Program 7, CSci 39542: Data Science, Hunter College

Program 7: Commodity Pricing
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2023

Program Description

Program 7: Commodity Pricing. Due 10am, Wednesday, 15 March.
Learning Objective: to build models with polynomial features, employ thresholds to decide model fitness, and use regularization techniques to better fit models.
Available Libraries: pandas, numpy, scikit-learn and core Python 3.6+. The available packages used for scikit-learn: sklearn.model_selection, sklearn.preprocessing, sklearn.linear_model, and sklearn.metrics.
Data Sources: St. Louis Federal Reserve Bank Online Data (FRED).
Sample Datasets: fred_cpi_eggs_oil_cars_INDEXED.csv: Consumer Price Index, Price of Eggs, Heating Oil, and Cars, over last 10 years, indexed to January 1, 2013

This program builds models for economic data from St. Louis Federal Reserve Bank Online Data (FRED).

This program builds models, using polynomial features and regularization for economica data. The pipeline below is general enough to handle actual prices as well as changes over time and price indices. The first steps are importing the data and splitting it into training and testing sets. The training set is then used to determine the degree of the polynomial for the model, and once that is determined, regularization, with cross validation, is used to fit a model that can be validated with the testing data and used for predictions.

The assignment is broken into the following functions to allow for unit testing:

import_data(csv_file): This function takes one input:
- csv_file: the name of a CSV file time series data for commodities from FRED.
The data in the file is read into a DataFrame, and a new column, units, that indexes the lines of the files and represents the number of units (e.g. months, weeks, years). It can be generated by looping through the dataset or using the index as a column (e.g. df.index.to_series()). The resulting DataFrame is returned.

split_data(df, y_col_name, test_size = 0.25, random_state = 21): This function takes 4 input parameters:
- df: a DataFrame containing with a columns units.
- y_col_name: the name of the column of the dependent variable.
- test_size: accepts a float between 0 and 1 and represents the proportion of the data set to use for training. This parameter has a default value of 0.25.
- random_state: Used as a seed to the randomization. This parameter has a default value of 1870.
Returns the data split into 4 subsets, corresponding to those returned by train_test_split: x_train, x_test, y_train, and y_test. where units is the "x" column and the input parameter, y_col_name is the "y" column.
Note: this is function is very similar to the splitting of data into training and testing sets from Program 6.

fit_poly(xes, yes, epsilon=100, verbose=False): This function takes four inputs:
- xes: a DataFrame that includes the column units.
- yes: a series of the same length as xes.
- epsilon: the size of the sample. It has a default value of 100.
- verbose: when True, prints out the MSE cost for each degree tried (in format: f'MSE cost for deg {deg} poly model: {error:.3f}' for degrees 1, 2, ..., until the error is below epsilon, see example below). It has a default value of False.
It returns the smallest integer degree >= 1 for which the model yields a MSE of less than the specified epsilon and the coefficients as a vector for df["units"] and df[y_col]. If it does not find a model with an error less than epsilon by degree 5, returns None. When fitting the linear regression model, the fit_intercept=False.
Hint: see the Chapter 16 for examples of using PolynomialFeatures().

fit_model(xes, yes, poly_deg=2 reg = "lasso"):
- xes: a series of numeric values.
- yes: a series of numeric values.
- poly_deg: the degree of the polynomial features to be created. It has a default value of 2.
- reg: The type of regularization used: ridge or lasso. It has a default value of lasso.
This function fits a model with polynomial features using Lasso or Ridge regression with cross validation:
1. Apply PolynomialFeatures to the xes with degree equal to poly_deg.
2. If reg equals ridge, use RidgeCV to instantiate and fit a model to the polynomial features and yes. Otherwise, use LassoCV to to instantiate and fit the model.
3. Returns the model as serialized object (i.e. a pickled object).

predict_using_trained_model(mod_pkl, poly_xes, yes): This function takes three inputs:
- mod_pkl: a trained model for the data, stored in pickle format.
- poly_xes: an array or DataFrame of numeric columns with no null values.
- yes: an array or DataFrame of numeric columns with no null values.
Computes and returns the mean squared error and r2 score between the values predicted by the model (mod on x) and the actual values (y). Note that sklearn.metrics contains two functions that may be of use: mean_squared_error and r2_score.

For example, let's use the FRED dataset of the Consumer Price Index, the price of eggs, the price of heating oil, and the price of cars, all indexed to have the same starting value of 100 on January 1, 2013 (fred_cpi_eggs_oil_cars_INDEXED.csv):

csv_file = "fred_cpi_eggs_oil_cars_INDEXED.csv"
df = import_data(csv_file)
print('The DataFrame:')
print(df)

will print:

The DataFrame:
DATE        CPI       EGGS        OIL       CARS  units
0    2013-01-01  100.00000  100.00000  100.00000  100.00000      0
1    2013-02-01  100.54299  101.65546  103.26837  100.30355      1
2    2013-03-01  100.26027   99.58614   95.90818   99.41820      2
3    2013-04-01  100.05093   99.22400   89.38780   99.98735      3
4    2013-05-01  100.09237   96.74082   89.27521  100.44900      4
..          ...        ...        ...        ...        ...    ...
116  2022-09-01  127.99563  150.12933  106.16784   89.24935    116
117  2022-10-01  128.62063  176.87532  135.68469   99.74704    117
118  2022-11-01  128.88436  185.66994  124.54440   93.91640    118
119  2022-12-01  129.05356  219.86549   95.72411   87.74426    119
120  2023-01-01  129.72086  249.50854  100.54942  102.47897    120

[121 rows x 6 columns]

The column names were generated from FRED were rather cryptic, so they have been renamed. Note that all have been normalized to 100 for the date 2013-01-01.

Let's try the split data function on the CPI data:

y_col_name = "CPI"
print(f'For the x column = "units", y_col = {y_col_name}')
print(df[ ["units", y_col_name] ])
x_train_cpi, x_test_cpi, y_train_cpi, y_test_cpi = split_data(df, y_col_name)

print('\nReturned sets of lengths:')
print(f"x_train_cpi: {len(x_train_cpi)}, x_test_cpi: {len(x_test_cpi)}")
print(f"y_train_cpi: {len(y_train_cpi)}, y_test_cpi: {len(y_test_cpi)}")

would print in a sample run:

For the x column = "units", y_col = CPI
  units        CPI
0        0  100.00000
1        1  100.54299
2        2  100.26027
3        3  100.05093
4        4  100.09237
..     ...        ...
116    116  127.99563
117    117  128.62063
118    118  128.88436
119    119  129.05356
120    120  129.72086

[121 rows x 2 columns]

Returned sets of lengths:
x_train_cpi: 90, x_test_cpi: 31
y_train_cpi: 90, y_test_cpi: 31

Let's fit a polynomial model to the CPI:

eps = 5
  print(f'Finding the poly degree for training data with epsilon = {eps}:')
  deg = fit_poly(x_train_cpi.to_frame(),y_train_cpi,epsilon=eps,verbose=True)
  print(f'For epsilon = {eps}, poly has degree {deg}.')

would print in a sample run:

Finding the poly degree for training data with epsilon = 5:
MSE cost for deg 1 poly model: 7.980
MSE cost for deg 2 poly model: 2.511
For epsilon = 5, poly has degree 2.

We can adjust the sizes of training and testing sets, by using the keyword arguments:

y_col_name = "EGGS"
print(f'For the x column = "units", y_col = {y_col_name}')
print(df[ ["units", y_col_name] ])
x_train, x_test, y_train, y_test = split_data(df, y_col_name,test_size=0.33)

print('\nReturned sets of lengths:')
print(f"x_train: {len(x_train)}, x_test: {len(x_test)}")
print(f"y_train: {len(y_train)}, y_test: {len(y_test)}")

would print in a sample run:

For the x column = "units", y_col = EGGS
  units       EGGS
0        0  100.00000
1        1  101.65546
2        2   99.58614
3        3   99.22400
4        4   96.74082
..     ...        ...
116    116  150.12933
117    117  176.87532
118    118  185.66994
119    119  219.86549
120    120  249.50854

[121 rows x 2 columns]

Returned sets of lengths:
x_train: 81, x_test: 40
y_train: 81, y_test: 40

Let's try the same epsilon for fitting the EGGS model:

eps = 5
print(f'Finding the poly degree for training data with epsilon = {eps}:')
deg = fit_poly(x_train.to_frame(),y_train,epsilon=eps,verbose=True)
print(f'For epsilon = {eps}, poly has degree {deg}.')

would print in a sample run:

Finding the poly degree for training data with epsilon = 150:
MSE cost for deg 1 poly model: 582.143
MSE cost for deg 2 poly model: 404.864
MSE cost for deg 3 poly model: 212.212
MSE cost for deg 4 poly model: 207.756
MSE cost for deg 5 poly model: 192.272
For epsilon = 5, poly has degree None.

Next, let's use cross validation Lasso regularization (the default for our function) to better fit our models for the CPI data:

#Working with CPI:
transformer = PolynomialFeatures(degree=2)
x_train_poly = transformer.fit_transform(x_train_cpi.to_frame())
#Fit model to the poly features
#print(x_train_poly)
cpi_pkl = fit_model(x_train_cpi.to_frame(), y_train_cpi)
mod = pickle.loads(cpi_pkl)
print(mod.get_params())
print(x_train_poly)
print(f'model:  {mod.predict(x_train_poly)}')

would print in a sample run:

model:  [115.4081765  100.78877974 103.75105901 105.76705462 104.53105731
116.71103081 100.87106528 103.18877452 107.38533682 123.64873
102.5630616  106.54876721 115.09103434 101.56877805 116.38017439
117.38302935 101.65277786 110.48304436 102.02306277 104.36820053
101.26706442 112.39104021 115.72874723 111.01104321 100.72020846
114.77732073 101.13677899 111.55275632 105.39676972 102.4482047
102.92820366 105.58019789 114.16017922 101.07677912 120.57673669
120.94873588 102.22877661 119.4813105  104.69734267 123.25273086
122.08530483 122.47101828 109.71676032 106.15105379 101.33734998
103.3242028  106.96019489 109.2230471  103.05677481 112.96703896
109.46818943 104.05277264 101.4110641  102.80306107 105.95733992
113.85675131 103.46305964 100.82820823 114.46703569 111.82875572
103.90020154 116.05274653 100.91735089 100.62420867 107.60305063
119.12302556 108.98133334 108.27676345 124.0481577  100.58820875
101.20020742 104.2087723  121.70301995 117.0453158  113.55675196
118.06874214 105.04019906 100.66535144 103.60534504 118.76816919
121.32416363 101.9253487  117.72417146 106.34819622 109.96875977
104.86705658 119.843024   100.58135162 100.96706507 100.5830659 ]

We can see how well this model does with the testing data, as well as predicting for the future:

#Predict with the model on training data:
x_test_poly = transformer.fit_transform(x_test_cpi.to_frame())
print('Built model, testing on testing data:')

mse_cpi, r2_cpi = predict_using_trained_model(cpi_pkl, x_test_poly, y_test_cpi)
print(f'The CPI model has a MSE for the testing set of {mse_cpi:2.2} and r2 score of {r2_cpi:2.2}.')

would print in a sample run:

Built model, testing on testing data:
The CPI model has a MSE for the testing set of 2.8 and r2 score of 0.96.

Let's visualize the model (red line) with the full data set:

import matplotlib.pyplot as plt
import seaborn as sns

x_poly_all = transformer.fit_transform(df[['units']])
y_predict = mod.predict(x_poly_all)
sns.lineplot(df['units'],y_predict,color='red')
plt.scatter(df['units'],df['CPI'])
plt.title('CPI Quadratic Model')
plt.xlabel('Months since January 2013')
plt.show()

which shows:

Notes and Hints:

You should submit a .py file with only the standard comments at the top, the specified functions, and any helper functions you have written. The grading scripts will then import the file for testing. If your file includes code outside of functions, either comment the code out before submitting or use a main function that is conditionally executed (see Think CS: Section 6.8 for details).

Include only the libraries you need (such as pandas) for your functions and none of the ones for plotting (such as matplotlib.pyplot and seaborn) since the functions submitted are computing and not plotting. Only the libraries listed in Available Libraries are loaded by the autograder.
See Lecture 3 or DS 100: Section 9.4 for working with dates and times.

Program 7: Commodity Pricing CSci 39542: Introduction to Data Science Department of Computer Science Hunter College, City University of New York Spring 2023

Program Description

Program 7: Commodity Pricing
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2023