Description
You’ll try the following regression models on the real-world data and conduct some analysis: linear regression, linear regression with 𝑙-/𝑙U regularization, CART, random forest and Adaboost.
For all model selection process, please use the training set to train models and use the validation set to select the final one. After selection, just test with the selected model and no need to train on (train+val). Don’t standardize the dataset unless specified in questions.
Programming questions (for parts (b)-(f), see the “Hints on functions to use” box below):
(a) Come up with a baseline regressor which doesn’t use much learning for later comparison. Explain your idea and why it would be a reasonable baseline. Implement your baseline regressor and report the testing MSE.
Note: your baseline should still use (or learn from) the training data but not much learning. This means that a baseline system that randomly generates an arbitrary number as output, without taking consideration of the dataset, would not qualify. The baseline cannot be one of the regression models mentioned in the problem description (such as linear regression), or a model of similar complexity.
(b) Try linear regression model with three regularization settings: no regularization, 𝑙- regularization and 𝑙U regularization. As for the regularization coefficient 𝜆, search from logU 𝜆 = −10 to logU 𝜆 = 10 at step size of Δlo𝑔Uλ = 1. Choose the parameter with the best val_MSE. Report the val_MSE and test_MSE of your best models. Table you can use:
Best 𝜆 val_MSE test_MSE
No regularization –
𝑙- regularization
𝑙U regularization
(c) Standardize all data in the following way: calculate the mean and std of all feature dimensions on the training set, and then subtract the mean and divide by the std to standardize (train, val, test). Repeat (b). Present the table and answer:
i. Do you observe any difference on the learned coefficients among three models? Explain. ii. Compare test_MSE with those in (b). Which methods have obvious changes and which not? Explain.
6
Note: there might be some features which have the same value for all samples. Please handle their std values properly so there won’t be divide-by-zero errors. Standardization only applies for this question, and don’t do this in other questions.
(d) Try CART. Search from 1 to 10 on max_depth. As for other parameter setting, please use (criterion=’mse’, max_features=None, random_state=0) and default for others. Report the val_MSE and test_MSE of your best model.
(e) Try random forest. Set max_depth as the best one you find in (d). Search from 2 to 30 on n_estimators. Report the val_MSE and test_MSE of your best model.
(f) Try Adaboost. Set max_depth and n_estimators as your best values in previous questions. Search for a good value for learning_rate. Report the val_MSE and test_MSE of your best model.
(g) Do those regressors learn from the data compared with the baseline? Compare and comment on (explain) their performances relative to the baseline, and relative to each other.
Written questions (no computer simulations are necessary in (h), (i) below):
(h) Some of the features only have 0/1 values because they are converted from categorical features. Now for some of those categorical features, you modify the converted numerical values to 0/12345 (i.e., that attribute is either 0 or 12345) for all data including training, val and test and then do the regression problem again with the same random seed. Will your regressor give different estimations compared to your previous results? Give Yes/No answers for linear regression, CART, random forest and Adaboost, and explain why.
Note: Assume that the change of value doesn’t affect any random process such as the feature sampling in random forest.
𝑝 .
V V
Hints on functions to use:
sklearn.linear_model. LinearRegression sklearn.linear_model. Lasso sklearn.linear_model. Ridge sklearn.tree.DecisionTreeRegressor sklearn.ensemble.RandomForestRegressor sklearn.ensemble.AdaBoostRegressor
7




Reviews
There are no reviews yet.