Description
HW#3
Problem 1 (50 points). In this problem, you are asked to write a report to summarize your
analysis of the popular “Auto MPG” data set in the literature. Much research has been done to analyze this data set, and here the objective of our analysis is to predict whether a given car gets high or low gas mileage based 7 car attributes such as cylinders, displacement, horsepower, weight, acceleration, model year and origin.
(a) The “Auto MPG” data set is available at UCI Machine Learning (ML) Repository:
https://archive.ics.uci.edu/ml/datasets/Auto+MPG
Download the data file “auto-mpg.data” from UCI ML Repository or from Canvas, and use Excel or Notepad to see the data (this is a .txt file).
To save your time, you can access the cleaned data from the file “Auto.csv” from Canvas and the R code below if you save it in the local folder of your computer, say, “C:/Temp”:
Auto1 <- read.table(file = “C:/Temp/Auto.csv”, sep = “,”, header=T);
mpg01 = I(Auto1$mpg >= median(Auto1$mpg))
Auto = data.frame(mpg01, Auto1[,-1]); ## replace column “mpg” by “mpg01”.
(d) Split the data into a training set and a test set. Any reasonable splitting is acceptable, aslong as you clearly explain how you split and why you think it is reasonable. For your convenience, you can either randomly split, or save every fifth (or tenth) observations as testing data.
(e) Perform the following classification methods on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (c). What is the test error of the model obtained?
(1) LDA (2) QDA (3) Naive Bayes (4) Logistic Regression
(5) KNN with several values of K. Use only the variables that seemed most associated with mpg01 in (c). Which value of K seems to perform the best on this data set?
1
(6) (Optional) PCA-KNN. The Principal Component Analysis (PCA) or other dimension reduction methods can easily be combined with other data mining methods. Recall that the essence of the PC-reduction is to replace the p-dimensional explanatory variable xi = (xi1,…,xip), for i = 1,…,n, with a new p-dimensional explanatory variable ui = (ui1,…,uip), where ui = Ap×pxi. Then we can apply standard data mining methods such as KNN to the first r(≤ p) entries of the ui’s, (ui1,…,uir), to predict Yi’s. Find the testing errors when the KNN with different values of K (neighbors) is applied to the PCA-dimension-reduced data for different r = p − 1,p − 2,…,1.
(7) (Optional) Any other classification methods you want to propose or use.
Write a report to summarize your findings. The report should include (i) Introduction, (ii) Exploratory (or preliminary) Data Analysis, (iii) Methods, (iv) Results and (v) Findings. Also see the guidelines on the final report of our course project. Please attach your computing code for R or other statistical software (without, or with limited, output) in the appendix of your report, and please do not just dump the computer output in the body of the report. It is important to summarize and interpret your computer output results.
Problem 2. (Optional, no credit. This is a PhD IE/statistics level question). Suppose we have features x ∈Rp, a two-class response, with class sizes N1,N2, and the target coded as −N/N1,N/N2, where N = N1 + N2. In other words, we observe N observations (yi,xi) with xi ∈Rp and the response variable yi = −N/N1 or yi = N/N2, so that .
(a) Show that the LDA rule classifies to class 2 if
,
and class 1 otherwise. Here ˆ class k xi for k = 1,2 and
(b) Let us treat yi = −N/N1 or yi = N/N2 as numerical values, and consider minimization of the least squares criterion . Show that the solution βˆ satisfies
(after simplification), where ΣˆB = (µˆ2− µˆ1)(µˆ2− µˆ1)T.
(c) Hence show that ΣˆBβ is in the direction (ˆµ2−µˆ1) and thus βˆ ∝ Σˆ−1(µˆ2−µˆ1) Therefore the least squares regression coefficient is identical to the LDA coefficient, up to a scalar multiple.
(d) Show that this result holds for any (distinct) coding of the two classes. That is, if we codethe Y values into any two distinct values and , and do linear regression as in part (b), then the results in part (c) still holds. In other words, the specific choices of and make the proof in (b) and (c) a little easier, but not essential.
(e) Find the solution βˆ0, and hence the predicted values fˆ= βˆ0 + βˆTx. Consider the following linear regression rule: classify to class 2 if ˆyi > 0 and class 1 otherwise. Show this is not the same as the LDA rule unless the classes have equal numbers of observations. (Fisher, 1936; Ripley, 1996)
2




Reviews
There are no reviews yet.