Description
• Identifying the parts of the SLR model for a given situation.
• Summarizing the relationship between two quantitative variables with a scatterplot, linear correlation, and the regression equation.
• Inference in SLR.
• Transformation of variables to achieve normality and meet assumptions.
Reading Connection: As outlined in the Activity 1.1 Readiness Materials, read textbook sections 2.1, 2.2, 2.3, 2.4, and 2.5. Prior to this activity it is assumed that you have read these sections.
There are many types of human coronaviruses including some that commonly cause mild upperrespiratory tract illnesses. COVID-19 is a new disease, caused by a novel (or new) coronavirus that has not previously been seen in humans.
Data Description: The COVID Tracking Project (https://COVIDtracking.com/) maintains numerous data sets related to the COVID-19 pandemic. Included are daily totals for each state and other entities (such as Puerto Rico) that are part of the United States. The data set includes 56 states/entities of the United States.
There are many variables included in the dataset. A selection of them are:
• state – two letter abbreviation for state or entity
• positive – total number of positive cases
In LM 1 you will be analyzing the COVID-19 data.
Activity Start:
STARTING WITH R
Open the RStudio Server. (See Handout 1 in the Technology menu)
• Navigate to Home > SharedProjects > Adrian > STA 321
• Click box next to “Activity 1.1 Covid-19 Part 1 starter.rmd”
• Click on More → Copy To. Click on Home
Click on “Activity 1.1 Covid-19 Part 1 starter.rmd” to open the RMarkdown Pane.
It will write the dataset into an R object named Covid. Open the Environment tab and notice that Covid is there. Click on Covid to see a spreadsheet of the dataset just like the one above.
In this activity we are going to focus on the variables death and positive. Our goal is to develop a simple linear regression model using the number of positive cases to predict the number of deaths.
1. Choose the correct notation and role of each variable.*
(Note: * denotes that this is a question on the Activity 1.1 Blackboard quiz)
A. The predictor variable denoted as X is the number of deaths; the response variable denoted as Y is the number of positive cases
B. The response variable denoted as X is the number of deaths; the predictor variable denoted as Y is the number of positive cases
C. The predictor variable denoted as Y is the number of deaths; the response variable denoted as X is the number of positive cases
D. The response variable denoted as Y is the number of deaths; the predictor variable denoted as X is the number of positive cases
Note: We use the terms predictor variable and explanatory variable interchangeably.
2. Sketch out a table similar to Table 2.1 in the text that shows the notation for the first three rows of the data set with only the variables death and positive. Put in the appropriate values.
Observation Number Response Variable Predictor Variable
1=AK Y1=305 X1=56,886
2=AL Y2=10,148 X2=499,819
** Class Check Point **
Univariate EDA (Exploratory Data Analysis)
Before we fit a regression model we should explore each of the variables separately both numerically and graphically. Use the R Output below for Questions 3 – 6.
Death
Positive
Obtain this output from your program:
• Run the chunk for “EDA – summarize death variable”
• Fill in the blanks and then run the chunk for “EDA – summarize positive variable”
Note: favstats() and histogram() are both on the 321 R sheet under the Technology menu.
3. Interpret what the median for the variable death tells you.*
4. Consider the minimum of deaths and the minimum of positive. Write a brief explanation of what this suggests about the data.
Missing Data is the probable cause of the ‘0’ for both minimum values.
5. Describe the distribution of deaths.
A. Unimodal, most states have fewer than 10,000 deaths very skewed left
B. Unimodal, most states have fewer than 10,000 deaths; very skewed right
C. Unimodal, most states have about 25 deaths; very skewed left
D. Unimodal, most states have about 35 deaths; very skewed right
6. Before we even made the histogram for positive cases we should have been able to predict what the shape would be. Why?
A. Because the median is much greater than the mean we expect skewed left
B. Because the median is much greater than the mean we expect skewed right
C. Because the mean is much greater than the median we expect skewed left
D. Because the mean is much greater than the median we expect skewed right
** Class Check Point **
Data Management
It appears some of the U.S. entities that are not states are skewing the data. We are going to restrict ourselves to the 50 states, the District of Columbia, and Puerto Rico. The R Output below shows what happens to the numerical and graphical summaries when we restrict ourselves to these 52 data points.
Death
Positive
** Class Check Point **
Bivariate EDA
Before we fit a simple linear regression model we should explore the relationship between the two variables numerically and graphically. Use the R Output below for Questions 7 – 9.
Check that you can get this output from your R program.
correlation
We will use the ruler below when we interpret the linear correlation.
7. Choose the best description of the relationship between deaths and positive cases.
A. There is a relationship between deaths and positive cases
B. There is a linear relationship between deaths and positive cases
C. There is a positive, linear relationship between deaths and positive cases
D. There is a strong, positive, linear relationship between deaths and positive cases
** Class Check Point **
Correlation Details
The formula for the linear correlation is:
r Cor Y X= ( , )= 1 n yi −yxi −x , n−1i=1 sy sx
where n = the number of data points (i.e., sample size), y is the mean of the y variable, sy is the standard deviation of the y variable, x is the mean of the x variable, and sx is the standard deviation of the x variable.
8. Choose the best description for the quantity yi −y ?*
A. Absolute distance of the y-coordinate for the ith data value from the mean
B. Z-score of the y-coordinate for the ith data value from the mean
C. Deviation of the y-coordinate for the ith data value from the mean
9. Choose the best description for the quantity yi −y ?* sy
A. Absolute distance of the y-coordinate for the ith data value from the mean
B. Z-score of the y-coordinate for the ith data value from the mean
C. Deviation of the y-coordinate for the ith data value from the mean
The picture below is a modified version of Figure 2.1 in the text that illustrates linear correlation.
10. Choose each statement below that is true about points in Quadrant 1 of the scatterplot.
A. The deviation of the x-coordinate is positive
B. The deviation of the y-coordinate is positive
C. The z-score of the x-coordinate is positive
D. The z-score of the y-coordinate is positive
E. The point adds a positive amount to the calculation of the correlation
11. Choose each statement below that is true about points in Quadrant 2 of the scatterplot.*
A. The deviation of the x-coordinate is positive
B. The deviation of the y-coordinate is positive
C. The z-score of the x-coordinate is positive
D. The z-score of the y-coordinate is positive
E. The point adds a positive amount to the calculation of the correlation
12. Choose each statement below that is true about points in Quadrant 3 of the scatterplot.
A. The deviation of the x-coordinate is positive
B. The deviation of the y-coordinate is positive
C. The z-score of the x-coordinate is positive
D. The z-score of the y-coordinate is positive
E. The point adds a positive amount to the calculation of the correlation
13. Choose each statement below that is true about points in Quadrant 4 of the scatterplot. Have your Whiteboard ready to respond when your team is called upon.
A. The deviation of the x-coordinate is positive
B. The deviation of the y-coordinate is positive
C. The z-score of the x-coordinate is positive
D. The z-score of the y-coordinate is positive
E. The point adds a positive amount to the calculation of the correlation
14. In this scatterplot most of the points are in Quadrants 1 and 3 with fewer points in Quadrants 2 and 4. True/False: This means that most points add a positive amount to the calculation of the correlation.*
A. True
B. False
** Class Check Point **
Simple Linear Regression (SLR) Model
The Simple Linear Regression (SLR) model is: Y = + 0 1X + . For Questions 15 – 19, apply the model to the relationship between deaths and positive cases for COVID-19.
15. What does Y represent?
16. What does X represent?
17. What does β0 represent?
Note: On page 33 the text states that “The coefficient β0 …. is the predicted value of Y when
X=0.” This is only a valid interpretation when X=0 is within the range of the data set. (i.e. between minimum X value and maximum X value).
18. What does β1 represent?
19. What does ε represent?*
C. The difference between the slope of the line and the predicted slope of the line found as 1 − ˆ1
D. The difference between the slope of the line and the predicted slope of the line found as ˆ1 − 1
** Class Check Point **
Parameter Estimation
We estimate the y-intercept β0 and the slope β1 using the method of least-squares. While I don’t expect you to find estimates by hand using the formulas in the textbook, it is important that you understand how the method of least-squares finds estimates for the y-intercept and slope.
We use the next two plots given below to illustrate the method of least-squares. Suppose that the six points represent the entire population; this means that the least-squares line we find is the line for the population. Use Plot 1 to answer Questions 20 – 21.
20. In Plot 1 you can see a red vertical line drawn from each of the six points to the leastsquares line. What does the direction and length of this vertical line represent?
A. The error i = −yi ( 0 + 1xi ) B. The deviation di = −yi y
C. The error i =( 0 + 1xi )−yi
D. The deviation di = −y yi
21. The method of least-squares finds the unique line that fits two criteria. Criteria 1 is that the sum of the errors is 0. Choose the true statement from the list below.
A. The total perpendicular distance of the two points above the least-squares line equals the total perpendicular distance of the four points below the least-squares line
B. The total length of the two red lines above the least-squares line equals the total length of the four red lines below the least-squares line
C. The total horizontal distance of the two points above the least-squares line equals the total horizontal distance of the four points below the least-squares line
Use Plot 2 to answer Questions 22 – 23.
22. In Plot 2 a square has been drawn for each point. What does the area of each square represent?
A. The squared error (i )2 = −(yi ( 0 + 1xi ))2 B. The squared deviation (di )2 =(yi −y)2
C. The squared error (i )2 =(( 0 + 1xi )−yi )2
D. The squared deviation (di )2 =(y−yi )2
23. The method of least-squares finds the unique line that fits two criteria. Criteria 2 is that
n 2 the sum of squares error SSE =(i ) is minimized. Choose the correct statement.*
i=1
A. The least-squares line is chosen to make the six squares as close in area to each other as possible
B. The least-squares line is chosen to make the largest square as small as possible
C. The least-squares line is chosen so that the total area of the two blue squares equals the total area of the four green squares
D. The least-squares line is chosen so that the total area of the six squares is as small as possible for all the lines that meet Criteria 1
Note: Among all lines that meet Criteria 1, there is only one line that makes the SSE as small as possible.
** Class Check Point **
R Output for the least-squares line
Use the output below to answer Questions 24 – 28. (Make sure you can get it yourself from running your R program.)
24. Write out the least-squares regression line using the notation Y and X, and without any scientific notation (i.e. no e+ or e- notation).
25. Interpret the y-intercept in the context of the problem.
A. We predict that for a U.S. entity with 0 positive cases there would be 450.5 deaths
B. We predict that for a U.S. entity with 0 positive cases there would be 0.0171 deaths
C. For every 1000 positive cases we predict 450.5 more deaths for a U.S. entity
D. Positive cases = 0 is not within the range of the data; therefore, the y-intercept is not interpretable
26. Interpret the slope in the context of the problem.*
A. 11,642 deaths
B. 11,226 deaths
C. 11,219 deaths
D. 8406 deaths
A. Michigan had 2494 more deaths than predicted
B. Michigan had 2494 fewer deaths than predicted
C. Michigan had 5016 more deaths than predicted
D. Michigan had 5016 fewer deaths than predicted
The graph below shows the scatterplot of the data with the regression line drawn on the plot.
** Final Class Check Point **
Starting with Activity 1.2, we will “knit” the RMarkdown file. Steps:
• Click Knit button and html document will pop up in Viewer. Check this document for errors.
• On line 4 of the program, change output: html_document
to
output: pdf document
• Click Knit. A pdf document should appear that you can download.
• The last question of each Activity quiz will ask you to upload this document.



Reviews
There are no reviews yet.