STAT823 – Homework 7: Introduction to Statistical Inference

Description

5/5 – (1 vote)

Introduction to Statistical Inference
STAT/BIOS 823 Homework 7
Directions
Using RMarkdown in RStudio, complete the following questions. Launch RStudio and open a new RMarkdown file or use the class RMarkdown template provided and save it on your working directory as a .Rmd file. At the end of the activity, save your pdf generated from RMarkdown+Knitr and submit your homework on the Blackboard.
If you have questions, please post them on the lesson discussion board.
All questions are mandatory except the One-Way ANOVA question. This is not required. Partial R-code and output from the code has been provided for you in some of the question.
R code and output must be clearly shown.

1 Analyzing Data with a Categorical Outcome

Page -1- of 9

1.1 Enter the data into R and re-produce the barchart (Figure 1).
Table 1 shows a summary of the data (it’s frequency table).
(a) Enter the data into R in an expanded form (long form) such that you will have a dataset with 1300 rows, of which, 630 will represent the Yes category and 670 represent the No category. Hint: Use existing R functions such as rep(), factor(), data.frame() or write your own function.
(b) Explore the data and produce a frequency table, by, for example typing tab1(binge) or summ() using the epiDisplay package. Reproduce the bar-chart below. Hint: tab1() will produces a bar chart for you with frequencies on the Y axis default. You can have percentages instead of frequencies by typing tab1(x, bar.values =”percent”.
Binge Drinking on Campus:
A Survey of N = 1300 Undergraduates

Figure 1: Barplots for drinking Binge
1.2 Chi-square Goodness of Fit Test
(a) To test if the proportion of Yes’s is significantly di erent from 0.5, we can type the following code: prop.test(x=630, n=1300) or binom.test(). What is your
conclusion based on the output?
(b) Perform a chi-square goodness of fit test to investigate the hypothesis that:
(i) there is no di erence in proportion between the students who binge drink and your conclusion? ” p2. What is those that do not. That is, test that H0 : p1 = p2 versus H1 : p1 =
(ii) the proportion of undergraduates on campus who binge drink fi = P(Y = yes) is di erent than the average reported by NSDUH: H1 : fi = 0” .247. The chi-square goodness of fit is testing whether the sample in hand appears to have been drawn from a population where the true rate of binge drinking is 0.247, or 24.7%. Note: we can verify the assumption that is based on sample size:0.247 ◊ 1300 = 321.1 and (1 5, that is (1 ≠ 0.247)fi◊ n >1300 = 9785, that is,.9.
What is your conclusion? ≠fi) ◊ n > ◊
Perform both the Chi-square test and the binomial test. What is your conclusion?
prop.test(x = 630, n = 1300, p = 0.247, alternative = “g”) binom.test(x = 630, n = 1300, p = 0.247, alternative = “g”)
1.3 Sales Data
The CEO of Company Z is interested in learning more about sales patterns for the chain’s retail outlets across the United States. Tomorrow you have to let the CEO know whether the type of gear sold stores is associated with geographic location. For each store, you have information on the store’s geographic region (A, B, C, D) and its most popular type of sports gear sold (winter sports, summer sports, all-season sports; based on total sales volume) in the last three calendar years. The sales data is used to answer the questions that follow.

(a) Read in and examine the sales data. The summary table can be reproduced by typing
library(MASS)
(tabs <- xtabs(~Region + Sport, data = sales))
## Sport
## Region A S W
## A 9 6 22 ## B 13 31 7 ## C 7 15 0
## D 14 13 13
# or tabpct(Region, Sport, graph=FALSE)
(b) What is the distribution of most popular gear type within each region? To answer this question, What is the distribution of most popular gear type within each region? To answer this use prop.table() to generate a table of proportions and use the mosaic() to generate a mosaic question, produce a contingency table as shown below using tabpct() function from plot. Then describe what you find.epiDisplay package or prop.table(), and mosaic() from vcd package.
for the ith row, C is the total count for the jth column, and
size. Based on the output of the chisq.test function, we should be able to check that we have su cient numbers to use the chi-square test. Do we? What is your conclusion
about the dependence of sales of sports gear on geographic region?
# chisq.test(tabs) # Chi-square test
2 Analysis of Continuous Outcome Data
2.1 One-Sample Tests
car <- c(19, 26, 24, 21, 24, 23, 26, 24, 23, 20, 21,
24, 18, 21, 20, 23, 24, 26, 25, 19, 24, 23, 27,
24, 26, 25, 20, 21, 19, 23)
(a) Perform a one-sample t-test to investigate the hypothesis that the average fuel
Generate the mean and standard deviation of mpg.
# t.test(car, mu=25)
(b) The assumption of the t-test is approximate normality of the outcome. This assumption is not necessary for large samples, but since we’re dealing with 30 observations we need to check. Use the “eyeball” method with a normal Q-Q plot. A normal QQ plot is a scatterplot of the observed quantiles (percentiles) of the data against its expected quantiles assuming it follows a normal distribution. If normality is a reasonable assumption, the actual quantiles and the expected quantiles should be similar and thus follow a straight line. Use the Shapiro-Wilk test to assess normality. If the p-value is small (e.g., less than 0.05), it’s likely that the data have violated the normality assumption. Perform the Shapiro-Wilk test. Does the data appear to follow the straight line?
# t.test(car, alternative= less , mu=25)
2.2 Dependent Samples Tests
id <- c(1:10)
pre <- c(899.63, 913.51, 897.05, 889.18, 903.2, 916.06,
899.08, 892.75, 901.47, 902.63) post <- c(899.53, 899.38, 879.25, 867.35, 897.97, 921.42,
895.52, 893.95, 889.44, 898.14) datap <- data.frame(cbind(id, pre, post))
(a) Install and load the PairedData package and produce a profile plot showing the paired di erences for each of the 10 swimmers.Hint: The first part of the code can be paired.plotProfiles(datap, “pre”, “post”) + geom_line(color=”blue”)

(b) Recall that the coach wants to investigate the hypothesis that the training program is e ective in reducing swim times for male athletes in the 1500 freestyle; that is, whether or not swimmer’s times decrease under the new program (i.e.,(i) Generate the means and standard deviations of swim time pre- and post-trainingH1 : µpre ≠ µpost > 0). and produce a boxplot showing the pre-and post-training distributions.
(ii) Compute the paired di erences and generate the mean and standard deviation of the di erences
(iii) The assumption of the paired t-test is normality of the paired di erences. Use a normal Q-Q plot and the Shapiro-Wilk test to verify this assumption.
(iv) Perform the t-test analysis for this paired data. What is your conclusion? Make sure you word it similar to: “There is su cient evidence to conclude that the training program is e ective at reducing swim times for Men’s 1500 Freestyle (p = 0.03). The program, on average, decreased swim time by 7 seconds (95% CI on di erence: µpre ≠ µpost > 0).”
# Starting code datap$diff <- pre – post
t.test(pre, post, paired = TRUE, alternative = “greater”)
2.3 Wilcoxon Signed-Rank Test
The Wilcoxon-Signed Rank test is a non-parametric test for comparing two dependent samples to assess whether their mean ranks di er. It’s a less powerful alternative to the paired t-test that should be substituted when the normality assumption cannot be verified or met. Perform the paired one-sided test using wilcox.test() function, with alternative = “greater” option. Does your conclusion di er from the paired t-test above?
2.4 Optional Question: One-Way ANOVA
Six di erent insect sprays are in development to help combat infestation of crops. Each of the 6 sprays were applied on 12 di erent fields and the number of insects found dead in the field was recorded. Researchers are interested in finding any significant di erences in e ectiveness across the six sprays. Load the data:
data(InsectSprays)
Recall that researchers want to investigate whether any di erence exists in the six insecticides under study. This hypothesis that we must nullify is called the omnibus null: H0 : µ1 = µ2 = ···µ µ=6 The alternative to this is that at least one of the insecticides is
with ANOVA, it is interpreted asdi i ” µj for some iat least one of the means is di=” j). That is, if we find a statistically significant resulterent. It alone can’t erent: H1 :
tell us how many or which are di erent.
(a) Perform the one-way ANOVA to investigate the omnibus hypothesis. Graphically visualize the data. What’s the best graphic to visualize a continuous outcome across groups?
(b) Generate the mean and standard deviation of insect totals for each group using a function like tapply(), ddply() or aggregate.numeric().
(c) The assumptions of ANOVA are the same as the two-sample t -test: (1) normality of the outcome within each group and (2) equal group variances. To Check (1), assess the residuals for normality after fitting the ANOVA model. To check (2), use leveneTest() from the car package. Is ANOVA appropriate for this data? ANOVA is robust to violations of normality (especially for large sample sizes) but a violation of equal variances can severely impact our ability to make accurate inferences. The leveneTest is a formal test investigating the hypothesis that at least one of the group variances are unequal (i.e.,su ‡i2 =” ‡j2 for i =” j). If you observe p < 0.05, you haveerent and the cient evidence to conclude that at least one of the variances are di
assumption is violated:
library(car) data(InsectSprays)
leveneTest(InsectSprays$count ~ InsectSprays$spray)
(d) Use log transformation to transform the response variable count. Does it get better? attach(InsectSprays) countlog <- log(count + 1) leveneTest(countlog ~ spray)
(e) Fit the ANOVA with the log-transformed variable. Check for normality of the residuals. Do the residuals appear normally distributed?
(f) Since we have a significant di erence somewhere, it is important that we both identify it and describe it. Post-hoc comparisons can be performed using many methods, but an easy one to start with is Tukey’s Highly Significant Di erences (HSD) Test. It performs pairwise comparisons of all groups to find where any statistically significant di erences exist between groups. The output includes a table of all pairwise comparisons (e.g., ‘B-A’), an estimate of the di erence between the groups, and a confidence interval on the di erence. Which groups appear to be di erent?
TukeyHSD(fit)
(g) Perform a non-parametric Kruskal-Wallis Test. The Kruskal-Wallis test is analogous to the Mann-Whitney test for two groups. It is a robust (non-parametric) alternative to the one-way ANOVA that eliminates the need for normality and equal variances, but it also provides a less powerful test of di erences than ANOVA. Do the results of this test agree with the ANOVA from above?

Reviews

There are no reviews yet.

Be the first to review “STAT823 – Homework 7: Introduction to Statistical Inference”

STAT823 – Homework 7: Introduction to Statistical Inference

Description

Reviews

Related products

STAT823 – Homework 6: Exploratory Data Analysis

STAT823 – Lesson 3: Data Cleaning and Management

STAT823 – Homework 11: Logistic Regression

STAT823 – Lesson 10: Matrix Approach to Linear Regression

STAT823 – Variables with the Greatest Impact on Solved