Description
Assignment 1
For all the questions involving R programming, please submit your R code and attach the screenshot of the R output.
Question 1: Please download the bike sharing data set from the following website: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset. (I also include the data set on crowdmark.) This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. There are two data sets and we will use the day.csv. Import the data into R and perform the following exploratory analysis.
a) The variable ”registered” records the number of registered users used the bike sharingservice on a particular day. Please provide the mean value of the variable ”registered” for each day of the week.
b) Plot the conditional density plot of the variable ”registered” conditional on each month
of the year.
c) Produce a two-dimensional levelplot of the variable ”registered” against the combina-
tion of temperature (variable ”temp”) and humidity (variable ”hum”).
Question 2: Perform linear regression model on the bike sharing data set from Question
2.
a. Provide the summary result of the regression model with ”registered” as the response variable and ”temp”, ”hum” as the predictors.
b. What other predictors do you think might be important for the modelling of the variable ”registered”? Please construct another linear model including more predictors and provide the summary result of the second model.
c. Perform 100 times of 5-fold cross validation and each time you randomly partition the dataset into five equal parts. You will use 80% of the data as the training data and 20% as the validating data. For models in a) and b), calculate the total sum of squared prediction error divived by the size of the validation data and by the number of cross-validations. Which model has better predictive power?
Question 3: In the following marketing set, we have 9 years with the sales in 10 million euro and the advertising expenditure in million euro.
Year Sales Advertisement
1 34 23
2 56 36
3 65 45
4 86 52
5 109 53
6 109 58
7 122 63
8 124 68
9 131 70
a) Formulate the response vector Y, which has nine entries.
b) Formulate the data matrix of X, the first column should be all ones corresponding to the intercept, and the second column should be the predictors. The dimension of X should be 9 ⇥ 2.
c) Write R code to compute XtX.
d) Write R code to compute ✓ = (XtX) XtY. This is the estimated linear regression
coe cient of the linear model with Y as the response and X as the data matrix.
e)Run the linear regression using Y as the response and X as the predictor using lm
command in R and compare the output with your own calculation.
f) Now two additional data points arrived. They are Year 10, Sales 96, and Advertisement 53; Year 11, Sales 107, and Advertisement 63. Please use the online algorithm to update the linear model. Use the two new observations together to perform the sequential learning and update the model using stochastic gradient descent algorithm using the learning rate
= 0.01. Note here in the updating scheme ✓ˆnew = ✓ˆold rEn, the term En will be the sum of the squared prediction errors over the two new observations:
En = (y10 yˆ10)2 + (y11 yˆ11)2.
In this example, we implement the online algorithm when the new data come in batches, not one at a time.
Question 4: Consider a data set with the response vector Y = (y1,…,yn)t and the data matrix
.
We model the relationship between X and Y using the linear regression model: yi = ✓0 + ✓1xi1 + ✓2xi2 + ✏i, i = 1,…,n, where ✏ ⇠ N(0, 2). Let the parameter vector be denoted as ✓ = (✓0,✓1,✓2)t. We wish to minimize the sum of squared residuals: SSE = . Let the fitted value be denoted as ˆyi = ✓0 + ✓1xi1 + ✓2xi2,
and let the fitted value vector be denoted as Yˆ.
a) Show that .
b) Show that SSE = (Y Yˆ)t(Y Yˆ).
c) Show that Yˆ = X✓.
d) Simplify the derivative equation .
e) Find the solution of ✓ which solves the equation in part d.
Question 5: Analyze the QSAR fish toxicity data set from the site:
https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity. Perform variable selection on the six predictors using the lasso package.
a) Based on the output of “lars”, please provide the sequence of candidate models. Forexample, the first model is {X5}, the second model is {X5,X3} and the third model is {X5,X3,X10}, etc.
b) Use the cross validation method, select the best value for the fraction s based on the plot of cross validation error againt the fraction s. The fraction s measures the ratio of the L1 norm of the penalized estimate over the L1 norm of the regular penalized estimate.
c) Use the optimum s you select, perform the penalized regression and output the opti-
mum model and the estimated coe cients.
Question 6: Simulate a data set with 100 observations yi = 30+5x+2×2 +3×3 +✏i, where ✏i follows independent normal distribution N(0,1).
a) Perform polynomial regression on your simulated data set and using x,I(x2),I(x3) as the predictors. Compare the estimated coe cients with the true model and report the R-square.
b) Formulate the design matrix of this regression and write down the first two rows of thedesign matrix based on your data set.
c) Perform polynomial regression on your simulated data set and using x,I(x2),I(x3),I(x4) as the predictors. Compare the estimated coe cients with the true model and report the Rsquare. Is the R-square increased or decreased compared to the model in part (a)? Explain why?
3333 Assignment 1
## Loaded lars 1.2
library(lattice) library(lars) library(locfit)
library(knitr) library(ggplot2)
Question 1
Part A
registeredMeans = aggregate(day$registered, by = list(day$weekday),
FUN = function(x) {round(mean(x), digits = 0)})
means = cbind(days, registeredMeans)
names(means) = c( Days , Days ID , Registered Mean ) means
## Days Days ID Registered Mean
Part B
library(lattice)
densityplot(~registered|mnth, data = day, plot.points = FALSE, col = “black”)
−20002000 6000 −20002000 6000
4e−04
3e−04
2e−04
1e−04
0e+00
−20002000 6000 −20002000 6000
registered
Part C
df_brakes = tapply(day$registered,
INDEX = list(cut(day$temp, breaks = 10), cut(day$hum, breaks = 10)), FUN = mean, na.rm = TRUE)
levelplot(df_brakes, scales = list(x = list(rot = 90)), main = 2D Levelplot of Registered
xlab = temperature , ylab = humidity )
, 2D Levelplot of Registered
temperature
Question 2 Part A
mod = lm(registered ~ temp + hum, data = day) summary(mod)
##
## Call:
## lm(formula = registered ~ temp + hum, data = day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3687.4 -994.5 -177.7 968.0 3170.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2405.1 239.4 10.046 < 2e-16 *** ## temp 4778.5 263.1 18.162 < 2e-16 *** ## hum -1777.6 338.1 -5.257 1.93e-07 ***
## —
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 1291 on 728 degrees of freedom
## Multiple R-squared: 0.3175, Adjusted R-squared: 0.3156
## F-statistic: 169.3 on 2 and 728 DF, p-value: < 2.2e-16 Part B
mod1 = lm(registered ~ temp + hum + workingday + weathersit, data = day) summary(mod1)
##
## Call:
## lm(formula = registered ~ temp + hum + workingday + weathersit,
## data = day)
##
## Residuals:
## Min 1Q Median 3Q Max ## -2907.45 -960.55 -74.98 901.86 2799.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1948.26 229.30 8.497 < 2e-16 *** ## temp 4340.16 251.88 17.231 < 2e-16 ***
## hum -584.22 397.70 -1.469 0.142
## workingday 971.65 95.51 10.173 < 2e-16 *** ## weathersit -530.27 104.10 -5.094 4.47e-07 ***
## —
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 1196 on 726 degrees of freedom
## Multiple R-squared: 0.4161, Adjusted R-squared: 0.4129
## F-statistic: 129.3 on 4 and 726 DF, p-value: < 2.2e-16
Part C
### CV for part a)
totalCVerror<-0
for ( j in 1:1){
training<-sample(1:731,584) trainingset<-day[training,] testingset<-day[-training,]
###fill in the codes to get the predictions errors
### update the totalCVerror
model<-lm(registered~temp+hum, data=trainingset) prediction<-predict(model, new = testingset) errors<-sum((testingset$registered – prediction)ˆ2) totalCVerror<-totalCVerror+errors
}
averageCVerrors_modA<-totalCVerror/100/147
averageCVerrors_modA
## [1] 16434.51
### CV for part b)
totalCVerror<-0
for ( j in 1:1){
training<-sample(1:731,584) trainingset<-day[training,] testingset<-day[-training,]
###fill in the codes to get the predictions errors
### update the totalCVerror
model<-lm(registered~temp+hum + workingday + weathersit, data=trainingset) prediction<-predict(model, new = testingset) errors<-sum((testingset$registered – prediction)ˆ2) totalCVerror<-totalCVerror+errors
}
averageCVerrors_modB<-totalCVerror/100/147
averageCVerrors_modB
## [1] 14343.96
#In this case, the model used in part B, has a better predictive power, #since it has less errors.
Question 3
Part A
Y = c(34, 56, 65, 86, 109, 109, 122, 124, 131) matrix(Y, ncol = 1, byrow = T)
## [,1]
## [1,] 34 ## [2,] 56 ## [3,] 65 ## [4,] 86 ## [5,] 109 ## [6,] 109 ## [7,] 122 ## [8,] 124
## [9,] 131
Part B
advertisement = c(23, 36, 45, 52, 53, 58, 63, 68, 70) ones = c(rep(1,9))
X = cbind(ones, advertisement) X
## ones advertisement
## [1,] 1 23
## [2,] 1 36
## [3,] 1 45
## [4,] 1 52
## [5,] 1 53
## [6,] 1 58
## [7,] 1 63
## [8,] 1 68
## [9,] 1 70
Part C
XtransX = t(X)%*%X
XtransX
## ones advertisement
## ones 9 468
## advertisement 468 26220
Part D
XtransY = t(X)%*%Y theta = (solve(XtransX))%*%XtransY theta
## [,1] ## ones -20.550602
## advertisement 2.181529
Part E
mod = summary(lm(Y~X)) mod
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.618 -3.793 -1.156 4.375 13.930
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.5506 10.2415 -2.007 0.0848 .
## Xones NA NA NA NA
## Xadvertisement 2.1815 0.1897 11.497 8.47e-06 ***
## —
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 8.236 on 7 degrees of freedom
## Multiple R-squared: 0.9497, Adjusted R-squared: 0.9425 ## F-statistic: 132.2 on 1 and 7 DF, p-value: 8.47e-06
#As you can see, when using the lm function vs. manually solving the #Least Square method, we end up having the same model and coefficients.
Part F
#Please check the handwritten notes
Question 5 Part A
lasso <-lars(x=as.matrix(fishToxicity[,1:6]),y=as.numeric(unlist(fishToxicity[,7])),trace=TRUE)
## LASSO sequence
## Computing X X …..
## LARS Step 1 : Variable 6 added ## LARS Step 2 : Variable 2 added ## LARS Step 3 : Variable 3 added ## LARS Step 4 : Variable 4 added ## LARS Step 5 : Variable 5 added
## LARS Step 6 : Variable 1 added
## Computing residuals, RSS etc …..
#Based on the Output the following models will be listed based on the LASSO Method # 1st Model: X_6 (MLOGP)
# 2nd Model: X_6 (MLOGP), X_2 (SM1_Dx(Z))
# 3rd Model: X_6 (MLOGP), X_2 (SM1_Dx(Z)), X_3(GATS1i)
# 4th Model: X_6 (MLOGP), X_2 (SM1_Dx(Z)), X_3(GATS1i), X_4(NdsCH)
# 5th Model: X_6 (MLOGP), X_2 (SM1_Dx(Z)), X_3(GATS1i), X_4(NdsCH), X_5 (NdssC)
# 6th Model: X_6 (MLOGP), X_2 (SM1_Dx(Z)), X_3(GATS1i), X_4(NdsCH), X_5 (NdssC), X_1 (CIC0)
Part B
cv.lars(x=as.matrix(fishToxicity[,1:6]),y=as.numeric(unlist(fishToxicity[,7])),K=10)
Fraction of final L1 norm
# Based on looking at the plot, I would choose 0.8 as the best fraction s value.
Part C
lasso <-lars(x=as.matrix(fishToxicity[,1:6]),y=as.numeric(unlist(fishToxicity[,7])),trace=TRUE)
## LASSO sequence
## Computing X X …..
## LARS Step 1 : Variable 6 added ## LARS Step 2 : Variable 2 added ## LARS Step 3 : Variable 3 added ## LARS Step 4 : Variable 4 added ## LARS Step 5 : Variable 5 added
## LARS Step 6 : Variable 1 added
## Computing residuals, RSS etc …..
coef(lasso,s=0.8,mode=”fraction”)
## CIC0 SM1_Dz.Z. GATS1i NdsCH NdssC MLOGP
## 0.2077567 0.9921246 -0.4651972 0.2890276 0.0354972 0.4324455
Question 6
Part A
q = seq(0, .99, by = 0.01)
length(q)
## [1] 100
y = 30 + 5*q + 2*qˆ2 + 3*qˆ3
noise = rnorm(length(q), mean = 0, sd = 1) noisy.y = y + noise
plot(q, noisy.y, col = deepskyblue4 , xlab = lines(q, y, col = firebrick1 , lwd = 3) q , main = Observed Data )
Observed Data
q
model <- lm(noisy.y ~ q + I(qˆ2) + I(qˆ3)) summary(model)
##
## Call:
## lm(formula = noisy.y ~ q + I(q^2) + I(q^3))
##
## Residuals:
## Min 1Q Median 3Q Max ## -2.84753 -0.71749 0.02206 0.71125 2.15023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.2033 0.4075 74.115 <2e-16 ***
## q 0.2893 3.5829 0.081 0.9358
## I(q^2) 16.0228 8.4337 1.900 0.0605 .
## I(q^3) -6.9555 5.5983 -1.242 0.2171
## —
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 1.057 on 96 degrees of freedom
## Multiple R-squared: 0.8882, Adjusted R-squared: 0.8847 ## F-statistic: 254.2 on 3 and 96 DF, p-value: < 2.2e-16
# We get the model to be 29.9287 + 0.5647x + 16.7805xˆ2 – 7.5869xˆ3 #which is quite different from the true model #We get the R squared for the estimated model to be 0.8886
Part B
matdata = c(1, q[1], (q[1]ˆ2), (q[1]ˆ3), 1, q[2], (q[2]ˆ2), (q[2]ˆ3)) xMat = matrix(matdata, nrow = 2, ncol = 4, byrow = TRUE) colnames(xMat) = c( 1 , x , xˆ2 , xˆ3 )
rownames(xMat) = c( row 1 , row 2 ) xMat
## 1 x x^2 x^3 ## row 1 1 0.00 0e+00 0e+00
## row 2 1 0.01 1e-04 1e-06 Part C
model2 = lm(noisy.y ~ q + I(qˆ2) + I(qˆ3) + I(qˆ4)) summary(model2)
##
## Call:
## lm(formula = noisy.y ~ q + I(q^2) + I(q^3) + I(q^4))
##
## Residuals:
## Min 1Q Median 3Q Max ## -2.81946 -0.69717 0.00276 0.71388 2.15258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.0869 0.5006 60.100 <2e-16 ***
## q 2.7527 7.0816 0.389 0.698 ## I(q^2) 4.6941 29.3004 0.160 0.873 ## I(q^3) 10.9040 44.5754 0.245 0.807 ## I(q^4) -9.0200 22.3330 -0.404 0.687
## —
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 1.062 on 95 degrees of freedom
## Multiple R-squared: 0.8884, Adjusted R-squared: 0.8837
## F-statistic: 189 on 4 and 95 DF, p-value: < 2.2e-16
# We get the model to be 30.324 – 7.806x + 55.279xˆ2 – 68.280xˆ3 + 30.653xˆ4
#Once again, this is quite different from the true model in-terms of the coefficients
# We get the R squared for the estimated model to be 0.8887.
#The more regressors we had, the bigger the R squared tends to 1.




Reviews
There are no reviews yet.