Description
Exploratory Data Analysis (EDA)
STAT/BIOS 823 Homework 6
Directions
Using RMarkdown in RStudio, complete the following questions. Launch RStudio and open a new RMarkdown file or use the class RMarkdown template provided and save it on your working directory as a .Rmd file. At the end of the activity, save your pdf generated from RMarkdown+Knitr and submit your homework on the Blackboard.
If you have questions, please post them on the lesson discussion board. All questions are mandatory and the R-code and output must be clearly shown.
1. Use the built-in dataset cars.
(a) Reproduce Figure 1 by creating a scatterplot of speed versus {distance} starting with the code plot(dist ≥ speed, data = cars. Add the following details into the plot:sub title “Using plot() in R”, x main title “Scatterplot of Speed versus Distance”, axis title “Speed (miles per hour)”, y axis title “Stopping Distance (feet)”. Make the main title to be red, axis colors to be magenta. Use filled circles for symbol type and make the symbol color to be blue and axis labels to be dark green. Make the fonts of the titles, label axes and symbol sizes to 1.5.
(b) From Figure 1, create Figure 2. This can be done by turning o the axes using plot(…,axes=FALSE). Add a line of best fit using abline(lm(dist ≥ speed, data=cars)). Add a grid using grid(). Add a box around the plotting area using box(col=”red”, lwd=3, lty=3). Add a legend to the plot using legend(“topleft”, inset = 0.01, title = “Distance vs. Speed”, legend = c(“Observation”), col=c(“blue”), pch=19,horiz=TRUE). You can add these commands to get back the axes labels: axis(1) and axis(2).
Page -1- of 4
Scatterplot of Speed versus Distance
Speed (miles per hour)
Using plot() in R
Figure 1: Scatterplot of Speed versus Distance
Scatterplot of Speed versus Distance
speed
Using plot() in R
Figure 2: Scatterplot of Speed versus Distance
Figure 3: Time Series Graph
(a) Use the str() function to examine the structure of the dataset. Use the describe() function in the Hmisc package. Use des(), summ() and codebook() functions from the epiDisplay package and summary() function to visualize summaries of the 12 variables in the dataset.
(b) Load vcd and epiDisplay packages. Use the table(),textttexttextprop.table() and tab1() functions to generate frequency tables describing the distribution of each of the following categorical variables: Sex, Exer and Smoke.
(c) Produce contingency tables to explore the relationships between Sex and Exercise,
Smoke and Exercise and Smoke and Sex. Calculate the Pearson’s Chi-squared test or Fisher’s Exact Test if appropriate (if the expectation of at least one of the cell value is conclusion?Æ 5). From the test of independence of these categorical variables, what would be your (d) Using the following code, write down the least squares regression equation describing the linear relationship between hand span and height and calculate the Pearson’s
correlation coe cient. What do you notice about r2 from the linear regression output and the correlation coe cient, r? Based on the Pearson’s correlation matrix Figure 4, which continuous variables are highly correlated?
data(“survey”)
ff <- lm(Height ~ Wr.Hnd, data = survey) summary(ff)
# calculation of Pearson s correlation coefficient. cor(survey$Wr.Hnd, survey$Height, use = “complete”)
# This code was used to produce the correlation
# matrix library(psych)
dat0 <- survey[, c(“Pulse”, “Age”, “Height”, “NW.Hnd”,
“Wr.Hnd”)] pairs.panels(dat0)
20 40 60 14 18 22
−0.08
Age −0.04 0.07
0.03
40 70 100 150 180 14 18 22
Figure 4: Pearson’s Pairwise Correlation Coe cients




Reviews
There are no reviews yet.