PSTAT 131 Homework 1 Solved

Description

5/5 – (1 vote)

Sunrise Gao
Homework 1
PSTAT131/231
Machine Learning Main Ideas
Please answer the following questions. Be sure that your solutions are clearly marked and that your document is neatly formatted.
You don’t have to rephrase everything in your own words, but if you quote directly, you should cite whatever materials you use (this can be as simple as “from the lecture/page # of book”).
Question 1:
Define supervised and unsupervised learning. What are the difference(s) between them?
• Supervised learning: It is defined by its use of labeled data sets to train algorithms that to classify data or predict outcomes accurately. As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately, which occurs as part of the cross validation process (Cited: https://www.ibm.com/cloud/learn/supervised-learning)
• Unsupervised learning: also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets.(Cited: https://www.ibm.com/cloud/learn/unsupervisedlearning)
• Supervised learning algorithms require input-output pairs (i.e. they require the output), unsupervised learning requires only the input data (not the outputs).
Question 2:
Explain the difference between a regression model and a classification model, specifically in the context of machine learning.
• Regression model: Regression analysis is a fundamental concept in the field of machine learning. It falls under supervised learning wherein the algorithm is trained with both input features and output labels. It helps in establishing a relationship among the variables by estimating how one variable affects the other. Especially, The Y is quantitative which is numerical values.(Cited: https://builtin.com/datascience/regression-machine-learning)
• Classification model: Classification is a task that requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. An easy to understand example is classifying emails as “spam” or “not spam.” The Y is qualitative which is categorical values.(Cited: https://machinelearningmastery.com/types-of-classification-in-machine-learning/)
Question 3:
Name two commonly used metrics for regression ML problems. Name two commonly used metrics for classification ML problems.
• For regression ML problems: Mean Square Error, Mean Absolute Error.
• For classification ML problems: Accuracy, Confusion.
Question 4:
As discussed, statistical models can be used for different purposes. These purposes can generally be classified into the following three categories. Provide a brief description of each.
• Descriptive models: For choosing a model to best visually emphasize a trend in data i.e., using a line on a scatterplot.
• Inferential models: Aim is to test theories; (Possibly) causal claims;State relationship between outcome & predictor(s). (From lecture_day2)
• Predictive models: Aim is to predict Y with minimum reducible error; Not focused on hypothesis tests.
Question 5:
Predictive models are frequently used in machine learning, and they can usually be described as either mechanistic or empirically-driven. Answer the following questions.
• Define mechanistic. Define empirically-driven. How do these model types differ? How are they
similar? Mechanistic: Assuming a parametric form for f(i.e. beta_0 + beta_1 + …) that Won’t match true unknown f and can add parameters to it which it will be more flexibility.
Empirically-driven: There is no assumptions on f; Requiring a larger # of observations and much more flexible by default.
Mechanistic model can be more flexibility by adding more parameters while empirically-driven can be more flexible by default; Mechanistic model will cause overfitting by adding more parameters while empirically-driven is just overfitting by default.
• In general, is a mechanistic or empirically-driven model easier to understand? Explain your choice.
Mechanistic model is easier to understand becasue a mechanistic model uses an existing theory to predict things in the real world while an empirical model is to study real-world events to develop a theory then to do the calculation. So I would’ve just done the calculation directly.
• Describe how the bias-variance tradeoff is related to the use of mechanistic or empirically-driven models.
Question 6:
A political candidate’s campaign has collected some detailed voter history data from their constituents. The campaign is interested in two questions:
• Given a voter’s profile/data, how likely is it that they will vote in favor of the candidate?
• How would a voter’s likelihood of support for the candidate change if they had personal contact with the candidate?
Classify each question as either predictive or inferential. Explain your reasoning for each.
• Given a voter’s profile/data, how likely is it that they will vote in favor of the candidate? This is a predictive model. Since we just predict the outcome and without hypothesis test.
• How would a voter’s likelihood of support for the candidate change if they had personal contact with the candidate? This is an inferential model.Since this is a causal claim which the candidate whether has personal contact.
Exploratory Data Analysis
This section will ask you to complete several exercises. For this homework assignment, we’ll be working with the mpg data set that is loaded when you load the tidyverse. Make sure you load the tidyverse and any other packages you need.
library(tidyverse)
## — Attaching packages ————————————— tidyverse 1.3.2 —
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.4.1 ## v readr 2.1.2 v forcats 0.5.2
## — Conflicts —————————————— tidyverse_conflicts() -## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2) library(dplyr) tidyverse_packages()
## [1] “broom” “cli” “crayon” “dbplyr”
## [5] “dplyr” “dtplyr” “forcats” “ggplot2”
## [9] “googledrive” “googlesheets4” “haven” “hms”
## [13] “httr” “jsonlite” “lubridate” “magrittr”
## [17] “modelr” “pillar” “purrr” “readr”
## [21] “readxl” “reprex” “rlang” “rstudioapi”
## [25] “rvest” “stringr” “tibble” “tidyr”
## [29] “xml2” “tidyverse”
library(corrplot)
## corrplot 0.92 loaded
Exploratory data analysis (or EDA) is not based on a specific set of rules or formulas. It is more of a state of curiosity about data. It’s an iterative process of:
• generating questions about data
• visualize and transform your data as necessary to get answers
• use what you learned to generate more questions
A couple questions are always useful when you start out. These are “what variation occurs within the variables,” and “what covariation occurs between the variables.” You should use the tidyverse and ggplot2 for these exercises.
Exercise 1:
We are interested in highway miles per gallon, or the hwy variable. Create a histogram of this variable. Describe what you see/learn.
ggplot(mpg, aes(x=hwy)) + geom_histogram(fill=”orange”) + xlab(“Highway Miles Per Gallon”) +
ggtitle(“Histogram of Highway Miles Per Gallon”)
## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
Histogram of Highway Miles Per Gallon

As we can see, The data is not evenly distributed. Most data between 16-20 hwy & 21-29 hwy. The biggest count is nearly 26 hwy and the least count is nearly 41.
Exercise 2:
Create a scatterplot. Put hwy on the x-axis and cty on the y-axis. Describe what you notice. Is there a relationship between hwy and cty? What does this mean?
ggplot(mpg, aes(x=hwy, y=cty)) + geom_point() + xlab(“Highway Miles Per Gallon”) + ylab(“City Miles Per Gallon”) +
ggtitle(“Scatterplot of highway miles per gallon regrading city miles per gallon”)
Scatterplot of highway miles per gallon regrading city miles per gallon

From the plot, we can notice that the hwy has a positive correlation to the cty which imply a car has higher highway miles per gallon also has a high city miles per gallon.
Exercise 3:
Make a bar plot of manufacturer. Flip it so that the manufacturers are on the y-axis. Order the bars by height. Which manufacturer produced the most cars? Which produced the least?
# original plot
ggplot(mpg, aes(x= manufacturer)) + geom_bar(stat = “count”, fill=”pink”) + ggtitle(“Bar Plot of Manufacturer”)
Bar Plot of Manufacturer

# Filp and order plot mpg %>% group_by(manufacturer) %>% summarise(count = n()) %>%
ggplot(aes(y = reorder(manufacturer,(count)), x = count)) + geom_bar(stat = ‘identity’,fill=”pink” )+
xlab(“weight”)+ylab(“manufacturer”) + ggtitle(“Bar Plot of Manufacturer (order)”)
Bar Plot of Manufacturer (order)

dodge
toyota volkswagen
ford chevrolet audi subaru hyundai nissan honda jeep pontiac mercury land rover lincoln
0 10 20 30
weight
As we can see, Dodge produces the most cars and Lincoin produces the least cars.
Exercise 4:
Make a box plot of hwy, grouped by cyl. Do you see a pattern? If so, what?
ggplot(mpg, aes( y=hwy, x=factor(cyl))) + geom_boxplot(fill=”red”) + xlab(“Highway Miles Per Gallon”) + ylab(“Number of Cylinders”) +
ggtitle(“Boxplot of Highway Miles Per Gallon vs Number of Cylinders”)
Boxplot of Highway Miles Per Gallon vs Number of Cylinders

As the plot above we can say that the Highway miles per gallon has a negative correlation to the number of cylinders which means a car that has less number of cylinders will have a high highway miles per gallon.
Exercise 5:
Use the corrplot package to make a lower triangle correlation matrix of the mpg dataset. (Hint: You can find information on the package here.)
data <- subset(mpg, select = -c(manufacturer,model,trans,drv,fl,class)) data
## # A tibble: 234 x 5
## displ year cyl cty hwy
## <dbl> <int> <int> <int> <int>
## 1 1.8 1999 4 18 29 ## 2 1.8 1999 4 21 29 ## 3 2 2008 4 20 31 ## 4 2 2008 4 21 30 ## 5 2.8 1999 6 16 26 ## 6 2.8 1999 6 18 26 ## 7 3.1 2008 6 18 27 ## 8 1.8 1999 4 18 26 ## 9 1.8 1999 4 16 25 ## 10 2 2008 4 20 28
## # … with 224 more rows
M <- cor(data) corrplot(M, type=”lower”)

displ
year
cyl
cty
hwy

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Which variables are positively or negatively correlated with which others? Do these relationships make sense to you? Are there any that surprise you?
• Both highway miles per gallon and city miles per gallon have negative relationship with displ which makes sense since a higher displ will use more oil and decrease the miles per gallon.
• Both highway miles per gallon and city miles per gallon have negative relationship with cyl (the number of cylinders) which makes sense again since a higher number of cylinders will use more oil and decrease the miles per gallon.
• The highway miles per gallon has positive relationship with city miles per gallon as they should.
• The number of cylinders has positive relationship with disp which also makes sense that more cylinders can provide more power and cause higher disp.

Reviews

There are no reviews yet.

Be the first to review “PSTAT 131 Homework 1 Solved”

PSTAT 131 Homework 1 Solved

Description

Reviews

Related products

PSTAT 131 Homework 5 Solved

PSTAT 131 Homework 4 Solved

PSTAT 131 Homework 3 Solved

PSTAT 131 Homework 2 Solved

PSTAT131 – Credit Card Fraud Detection Using Machine Learning Solved