CSci 39542 Syllabus    Resources    Coursework



Program 3: Cookie Pricing
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Fall 2023


Classwork    Quizzes    Homework    Project   

Program Description


Program 3: Cookie Pricing.Due noon, Friday, 6 October.
Learning Objective: to build models with polynomial features, employ thresholds to decide model fitness, and use regularization techniques to better fit models.
Available Libraries: pandas, numpy, scikit-learn, pickle, pytest, and core Python 3.6+. The available packages used for scikit-learn: sklearn.model_selection, sklearn.preprocessing, sklearn.linear_model, and sklearn.metrics.
Data Sources:
St. Louis Federal Reserve Bank Online Data (FRED).
Sample Datasets: fred_ccc.csv

Federal Reserve Economic Data (FRED)

This program builds models for economic data from St. Louis Federal Reserve Bank Economic Data (FRED), using polynomial features and regularization for economica data.

We will focus, on the average price of chocolate chip cookies and their ingredients, but it can be generalized for other price indices. The pipeline below is general enough to handle actual prices as well as changes over time and price indices. The first steps are importing the data and splitting it into training and testing sets. The training set is then used to determine the degree of the polynomial for the model, and once that is determined, regularization, with cross validation, is used to fit a model that can be validated with the testing data and used for predictions.

Start by downloading the data file:

  1. Go to St. Louis Federal Reserve Bank Economic Data (FRED).
  2. In the search window, type: Chocolate Chip Cookies.
  3. Click on Average Price: Cookies, Chocolate Chip (Cost per Pound/453.6 Grams) in U.S. City Average. It's index number is APU0000702421.
  4. You should see a graph of the average price of cookies, like the image above.
  5. Click on the red button: Edit Graph.
  6. On the top of the side bar, click on ADD LINE.
  7. Type in eggs into the search bar. Choose the average price index (APU0000708111). And click on Add data series.
  8. Repeat the last two steps with butter (APU0000FS1101), chocolate (PCU3113513113517), flour (APU0000701111), and sugar (APU0000715211).
  9. Click the X in the upper right corner to close the side bar menu.
  10. Next, adjust dates so all indices are defined. Click on the 1980-01-01 and change it to 2018-04-01 (before which the butter index is not defined).
  11. Click on DOWNLOAD and choose CSV and save the file to your working directory.

Preparing Data

Once you have downloaded some test data sets to your device, the next thing to do is format the data to be usable for analysis. The data from FRED is very clean, so, we will only need to split into training and testing sets. Add the following function to your Python program:

For example, let's input the cookie dataset:

csv_file = "fred_ccc.csv"
names = {'APU0000702421' : 'CC Cookies', 'APU0000708111' : 'Eggs', 'APU0000FS1101' : 'Butter', \
         'PCU3113513113517' : 'Chocolate','APU0000701111' : 'Flour', 'APU0000715211' : 'Sugar'}
df_ccc = import_data(csv_file,names)
print('The DataFrame:')
print(df_ccc)
gives the output:
The DataFrame:
    CC Cookies   Eggs  Butter  Chocolate  Flour  Sugar
0        3.328  2.081   4.153     97.400  0.472  0.617
1        3.400  1.987   4.112     98.100  0.469  0.628
2        3.490  1.628   4.064     94.800  0.469  0.629
3        3.481  1.725   4.110     94.700  0.468  0.630
4        3.351  1.622   4.085     90.200  0.464  0.628
..         ...    ...     ...        ...    ...    ...
60       5.193  3.270   4.452    147.867  0.542  0.893
61       5.153  2.666   4.549    149.313  0.544  0.899
62       5.111  2.219   4.452    150.627  0.551  0.918
63       5.091  2.094   4.465    154.298  0.564  0.932
64       5.056  2.043   4.411    159.247  0.566  0.950

[65 rows x 6 columns]

If we use the default for the last argument, the resulting DataFrame has the original column names:

df_no_sub = import_data(csv_file)
print('The DataFrame with original column names:')
print(df_no_sub)
gives the output:
The DataFrame with original column names:
    APU0000702421  APU0000708111  APU0000FS1101  PCU3113513113517 APU0000701111  APU0000715211
0           3.328          2.081          4.153            97.400         0.472          0.617
1           3.400          1.987          4.112            98.100         0.469          0.628
2           3.490          1.628          4.064            94.800         0.469          0.629
3           3.481          1.725          4.110            94.700         0.468          0.630
4           3.351          1.622          4.085            90.200         0.464          0.628
..            ...            ...            ...               ...           ...            ...
60          5.193          3.270          4.452           147.867         0.542          0.893
61          5.153          2.666          4.549           149.313         0.544          0.899
62          5.111          2.219          4.452           150.627         0.551          0.918
63          5.091          2.094          4.465           154.298         0.564          0.932
64          5.056          2.043          4.411           159.247         0.566          0.950

[65 rows x 6 columns]

We can then split the data into training and testing subsets:

xes_col_names = ['Eggs', 'Chocolate']
y_col_name = "CC Cookies"
print(f'For the x column = {xes_col_names}, y_col = {y_col_name}')
x_train_ccc, x_test_ccc, y_train_ccc, y_test_ccc = split_data(df_ccc, xes_col_names, y_col_name)

print('\nReturned sets of lengths:')
print(f"x_train_cpi: {len(x_train_ccc)}, x_test_cpi: {len(x_test_ccc)}")
print(f"y_train_cpi: {len(y_train_ccc)}, y_test_cpi: {len(y_test_ccc)}")

print(f"And test sets:\nx test:\n{x_test_ccc}\ny_test\n{y_test_ccc}")
gives the output:
For the x column = ['Eggs', 'Chocolate'], y_col = CC Cookies

Returned sets of lengths:
x_train_ccc: 43, x_test_ccc: 22
y_train_ccc: 43, y_test_ccc: 22
And test sets:
x test:
      Eggs  Chocolate
43  1.718    111.641
22  1.449    101.100
4   1.622     90.200
57  4.823    142.614
32  1.481    107.900
58  4.211    141.027
63  2.094    154.298
48  2.520    118.120
12  1.463     95.100
50  2.707    118.896
51  2.936    122.882
1   1.987     98.100
49  2.863    118.161
28  1.328    103.300
60  3.270    147.867
47  2.046    116.664
9   1.554     94.200
26  1.554     96.000
29  1.353    108.000
15  1.243    100.000
0   2.081     97.400
23  1.525     95.400
y_test
43    3.999
22    3.551
4     3.351
57    5.058
32    3.793
58    5.177
63    5.091
48    4.174
12    3.426
50    4.567
51    4.654
1     3.400
49    4.505
28    3.769
60    5.193
47    4.147
9     3.466
26    3.717
29    3.659
15    3.468
0     3.328
23    3.633
Name: CC Cookies, dtype: float64

Once you have written your function, test it locally on the small test files. When it works, upload to Gradescope. Given the size of the files that we evaluate your code, you will find it much faster to develop and test the code in your IDE than debugging and testing in Gradescope.

Now that we have the data, let's do some quick plots to get some intuition about how the possible explanatory variables (e.g. the ingredients) explain the price of chocolate chip cookies:

import matplotlib.pyplot as plt
import seaborn as sns
ingredients = list(df_ccc.columns)[1:]
for ingredient in ingredients:
    sns.lmplot(data=df_ccc, x =ingredient, y = 'CC Cookies')
    plt.title(f'{ingredient} vs. Chocolate Chip Cookies')
    plt.ylabel('Chocolate Chip Cookies, Price per pound')
    plt.tight_layout()
    plt.subplots_adjust(top=0.88)    
    plt.show()

which makes a scatter plot of each ingredient versus the chocolate cookie price, and includes the linear regression line with the 95% confidence interval for the regression highlighted:

The linear models, especially for chocolate and sugar, fits the data well, but the plots suggest a quadratic might fit better than a linear model with a single input. Let's look at linear models that use multiple inputs, as well as polynomial models.


Multiple Linear Models

As we did in Lecture 3 with modeling miles-per-gallon (MPG) for cars, let's build a multiple linear model using all inputs:

Continuing our example using all the training data sets that we partitioned above:

mlm_pkl = fit_lin_reg(x_train_ccc,y_train_ccc)
print(f'The pickled model:  {mlm_pkl}')
mlm = pickle.loads(mlm_pkl)
print(f'\nwith coefficients:  {mlm.coef_}')
gives the output:
The pickled model:  b'\x80\x04\x95\x12\x02\x00\x00\x00\x00\x00\x00\x8c\x1asklearn.linear_model._base\x94\x8c\x10LinearRegression\x94\x93\x94)\x81\x94}\x94(\x8c\rfit_intercept\x94\x88\x8c\x06copy_X\x94\x88\x8c\x06n_jobs\x94N\x8c\x08positive\x94\x89\x8c\x11feature_names_in_\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x02\x85\x94h\r\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94(\x8c\x04Eggs\x94\x8c\tChocolate\x94et\x94b\x8c\x0en_features_in_\x94K\x02\x8c\x05coef_\x94h\x0ch\x0fK\x00\x85\x94h\x11\x87\x94R\x94(K\x01K\x02\x85\x94h\x16\x8c\x02f8\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C\x10y\x13p\x89\xdd\xa1\xbe?\xec!*\xde-\xc9\x9c?\x94t\x94b\x8c\x05rank_\x94K\x02\x8c\tsingular_\x94h\x0ch\x0fK\x00\x85\x94h\x11\x87\x94R\x94(K\x01K\x02\x85\x94h(\x89C\x10\x8e\xd1Q\xf3\x16\x8a[@\x9a\x17\xa3\x05@\x9f\x07@\x94t\x94b\x8c\nintercept_\x94h\n\x8c\x06scalar\x94\x93\x94h(C\x08\xb4\xdc\xcdN\xb76\xe5?\x94\x86\x94R\x94\x8c\x10_sklearn_version\x94\x8c\x051.2.2\x94ub.'

with coefficients:  [0.11965737 0.02811119]
The pickled model is bytestream but is fairly short considering that it's encoding the state of our trained model.


Polynomial Models

Next, let's focus on a single independent variable and build models that use higher order (e.g. quadratic, cubic, etc.) terms to capture the relationship between the independent and dependent variables. A useful function underlying the three following functions is sklearn.preprocessing.PolynomialFeatures, which is available from scikit-learn.

Continuing our example with the chocolate column:

choc_deg_3_tr = encode_poly(x_train_ccc,'Chocolate',deg=3)
print('Polynomial features (deg = 3) for chocolate:')
np.set_printoptions(precision=3,suppress=True)
print(choc_deg_3_tr)
gives the polynomial features (printed prettily):
Polynomial features (deg = 3) for chocolate:
[[      1.         99.1      9820.81   973242.271]
  [      1.        147.867   21864.65  3233060.156]
  [      1.        106.487   11339.481 1207507.331]
  [      1.        108.997   11880.346 1294922.074]
  [      1.         92.8      8611.84   799178.752]
  [      1.        111.641   12463.713 1391461.37 ]
  [      1.        150.627   22688.493 3417499.655]
  [      1.         99.9      9980.01   997002.999]
  [      1.         94.       8836.     830584.   ]
  [      1.        141.027   19888.615 2804831.669]
  [      1.        107.9     11642.41  1256216.039]
  [      1.         91.       8281.     753571.   ]
  [      1.         94.7      8968.09   849278.123]
  [      1.        103.3     10670.89  1102302.937]
  [      1.        143.979   20729.952 2984677.823]
  [      1.         99.8      9960.04   994011.992]
  [      1.         93.2      8686.24   809557.568]
  [      1.        136.453   18619.421 2540675.882]
  [      1.         91.       8281.     753571.   ]
  [      1.        112.5     12656.25  1423828.125]
  [      1.         91.3      8335.69   761048.497]
  [      1.        105.8     11193.64  1184287.112]
  [      1.        108.4     11750.56  1273760.704]
  [      1.        108.6     11793.96  1280824.056]
  [      1.        104.5     10920.25  1141166.125]
  [      1.        111.8     12499.24  1397415.032]
  [      1.        117.552   13818.473 1624389.103]
  [      1.         94.6      8949.16   846590.536]
  [      1.        101.1     10221.21  1033364.331]
  [      1.        117.856   13890.037 1637024.17 ]
  [      1.        100.6     10120.36  1018108.216]
  [      1.         94.8      8987.04   851971.392]
  [      1.        133.051   17702.569 2355344.455]
  [      1.        142.614   20338.753 2900590.92 ]
  [      1.        102.8     10567.84  1086373.952]
  [      1.         92.4      8537.76   788889.024]
  [      1.        129.897   16873.231 2191782.036]
  [      1.         99.       9801.     970299.   ]
  [      1.         96.       9216.     884736.   ]
  [      1.        110.413   12191.031 1346048.258]
  [      1.         97.4      9486.76   924010.424]
  [      1.        116.664   13610.489 1587854.077]]
The first column is Chocolate**0 which is constantly 1. The next column is Chocolate**1 which is just the Chocolate column. The next columns are the square, Chocolate**2, and the cube, Chocolate**3, of the Chocolate column.

The next function tries different degree polynomials and stops when the MSE cost is less than epsilon or large (degree >=5). Trying first for eggs,

deg_ccc = fit_poly(x_train_ccc[['Eggs']],y_train_ccc,verbose=True)
print(f'For eggs, the degree found is {deg_ccc}.')
finds that no small degree yields an error less than epsilon:
MSE cost for deg 1 poly model: 0.076
MSE cost for deg 2 poly model: 0.053
MSE cost for deg 3 poly model: 0.053
MSE cost for deg 4 poly model: 0.048
MSE cost for deg 5 poly model: 0.044
For eggs, the degree found is None.

Repeating with chocolate:

deg_ccc = fit_poly(x_train_ccc[['Chocolate']],y_train_ccc,verbose=True)
  print(f'For chocolate, the degree found is {deg_ccc}.')
finds that degree = 3 yields an error less than epsilon:
MSE cost for deg 1 poly model: 0.011
MSE cost for deg 2 poly model: 0.011
MSE cost for deg 3 poly model: 0.008
For chocolate, the degree found is 3.

The next function fits the model using regularization:

choc_3_lasso_pkl = fit_with_regularization(choc_deg_3_tr, y_train_ccc, poly_deg=3, reg="lasso")
choc_3_lasso = pickle.loads(choc_3_lasso_pkl)
print(f'\nThe model, using Lasso regularization:  {choc_3_lasso.get_params()}')
gives the output:
The model, using Lasso regularization:  {'alphas': None, 'copy_X': True, 'cv': None, 'eps': 0.001, 'fit_intercept': True, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': None, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': False}


Evaluating Our Models

The next part of program evaluates how well our constant models do at prediction. We will use a loss function, mean squared error, introduced in Lecture 1 and Section 4.2.

Continuing our example using models that we trained above:

mse_mlm_tr,r2_mlm_tr = predict_using_trained_model(mlm_pkl,x_train_ccc,y_train_ccc)
print(f'For the mult linear model on the test data, MSE is {mse_mlm_tr:.4} and r2 is {r2_mlm_tr:.4}.')
mse_mlm,r2_mlm = predict_using_trained_model(mlm_pkl,x_test_ccc,y_test_ccc)
print(f'For the mult linear model on the test data, MSE is {mse_mlm:.4} and r2 is {r2_mlm:.4}.')
gives the output:
For the mult linear model on the test data, MSE is 0.008202 and r2 is 0.9748.
For the mult linear model on the test data, MSE is 0.02778 and r2 is 0.9295.
As expected, the data on which the model was trained yields better mean squared error and r2 than the testing data.

We can use the function to predict with the polynomial features for chocolate:

choc_3_pkl = fit_lin_reg(choc_deg_3_tr,y_train_ccc)
mse_choc_3_tr,r2_choc_3_tr = predict_using_trained_model(choc_3_pkl,choc_deg_3_tr,y_train_ccc)
print(f'\nFor the mult linear model on the test data, MSE is {mse_choc_3_tr:.4} and r2 is {r2_choc_3_tr:.4}')

choc_deg_3_test = encode_poly(x_test_ccc,'Chocolate',deg=3)
mse_choc_3,r2_choc_3 = predict_using_trained_model(choc_3_pkl,choc_deg_3_test,y_test_ccc)
print(f'\nFor the mult linear model on the test data, MSE is {mse_choc_3:.4} and r2 is {r2_choc_3:.4}')
gives the output:
For the mult linear model on the test data, MSE is 0.00777 and r2 is 0.9761

For the mult linear model on the test data, MSE is 0.01934 and r2 is 0.9509
The cubic model on just the chocolate column does similarly to the regression built on all the input indices.

Let's visualize the model (red line) with the full data set:

import matplotlib.pyplot as plt
import seaborn as sns

x_poly_all = transformer.fit_transform(df_ccc[['Eggs']])
y_predict = mod.predict(x_poly_all)
sns.lineplot(df_ccc['Eggs'],y_predict,color='red')
plt.scatter(df_ccc['Eggs'],df_ccc['CC Cookies'])
plt.title('Eggs vs CC Cookies Quadratic Model')
plt.xlabel('Egg Prices')
plt.show()
which shows:


Testing Code

Each programming assignment includes functions that test that your code works (a "test suite"). In Programs 1 & 2, we wrote the test functions by hand. For this program, we will also use pytest, a standard Python testing framework. Before you start, make sure that your IDE has pytest installed:

pip install -U pytest

Pytest is one of the most popular testing frameworks for Python. For this program, we will introduce the core testing features (and will introduce more features, such as parametrizing, in future programs). For more details, see Lecture 5, pytest docs and Think CS: Chapter 20.

Trying first on the correct function, assuming that your program is called p3.py, we can invoke pytest from the command-line:

pytest p3.py::test_encode_poly
gives the output:
======================================= test session starts ========================================
platform darwin -- Python 3.11.5, pytest-7.4.2, pluggy-1.3.0
rootdir: /Users/stjohn/gitHub/dataScience/programs/fall23/program03
collected 1 item                                                                                   

p3.py .                                                                                      [100%]

======================================== 1 passed in 1.63s =========================================
pytest looks for a function called: test_encode_poly() in the file p3.py. It then runs test_encode_poly(), which tests the correctness of the function encode_poly() and reports back the results.


Notes and Hints: