PSTAT115 – Homework 1 Solved

Description

5/5 – (1 vote)

Jessica Nguyen and Tristan Chen
Note: If you are working with a partner, please submit only one homework per group with both names and whether you are taking the course for graduate credit or not. Submit your Rmarkdown (.Rmd) and the compiled pdf on Gradescope in a zip file. Include any addition files (e.g. scanned handwritten solutions) in zip file with the pdf.
Text Analysis of JK Rowling’s Harry Potter Series
Question 1
You are interested in studying the writing style and tone used by JK Rowling (JKR for short), the author of the popular Harry Potter series. You select a random sample of chapters of size n from all of JKR’s books. You are interested in the rate at which JKR uses the word dark in her writing, so you count how many times the word dark appears in each chapter in your sample, (y1,…,yn). In this set-up, yi is the number of times the word dark appeared in the i-th randomly sampled chapter. In this context, the population of interest is all chapters written by JRK and the population quantity of interest (the estimand) is the rate at which JKR uses the word dark. The sampling units are individual chapters. Note: this assignment is partially based on text analysis package known as tidytext. You can read more about tidytext here.
1a.
Model: let Yi denote the quantity that captures the number of times the word dark appears in the i-th chapter. As a first approximation, it is reasonable to model the number of times dark appears in a given chapter using a Poisson distribution. Reminder: Poisson distributions are for integer outcomes and useful for events that occur independently and at a constant rate. Let’s assume that the quantities Y1,…Yn are independent and identically distributed (IID) according to a Poisson distribution with unknown parameter λ,
p(Yi = yi | λ) = Poisson(yi | λ) for i = 1,…,n.
Write the likelihood L(λ) for a generic sample of n chapters, (y1,…,yn). Simplify as much as possible (i.e. get rid of any multiplicative constants)
Yn yie−λ λ
L(λ) = yi! i=1
n
−nλ Y λyi
L(λ) = e

yi!
i=1 −nλλPni=1yi
e
L(λ) = n
Qi=1 yi!
L(λ) ∝ e−nλλPni=1yi
1b.
Write the log-likelihood `(λ) for a generic sample of n articles, (y1,…,yn). Simplify as much as possible. Use this to compute the maximum likelihood estimate for the rate parameter of the Poisson distribution.
`(λ) = ln(L(λ))
−nλ Pn yi `(λ) = ln(e λ i=1 )
−nλ Pn yi
`(λ) = ln(e ) + ln(λ i=1 )
n
`(λ) = −nλln(e) + Xyiln(λ)
i=1
n
`(λ) = Xyiln(λ) − nλ
i=1 Taking derivative of `(λ):
0 Pni=1 yi − n
` (λ) =
λ
Set = 0
Pn yi
n
From now on, we’ll focus on JKR’s writing style in the last Harry Potter book, The Deathly Hallows. This book has 37 chapters. Below is the code for counting the number of times dark appears in each chapter of The Deathly Hallows . We use the tidytext R package which includes functions that parse large text files into word counts. The code below creates a vector of length 37 which has the number of times the word dark was used in that chapter (see https://uc-r.github.io/tidy_text for more on parsing text with tidytext)
library(tidyverse) # data manipulation & plotting library(stringr) # text cleaning and regular expressions library(tidytext) # provides additional text mining functions library(harrypotter) # text for the seven novels of the Harry Potter series
text_tb <- tibble(chapter = seq_along(deathly_hallows), text = deathly_hallows)
tokens <- text_tb %>% unnest_tokens(word, text) word_counts <- tokens %>% group_by(chapter) %>%
count(word, sort = TRUE) %>% ungroup
word_counts_mat <- word_counts %>% spread(key=word, value=n, fill=0) dark_counts <- word_counts_mat$dark
text_tb <- tibble(chapter = seq_along(deathly_hallows),
text = deathly_hallows)
tokens <- text_tb %>% unnest_tokens(word, text) word_counts <- tokens %>% group_by(chapter) %>%
count(word, sort = TRUE) %>% ungroup word_counts_mat <- word_counts %>% spread(key=word, value=n, fill=0)
1c.
Make a bar plot where the heights are the counts of the word dark and the x-axis is the chapter.
# YOUR CODE HERE ggplot(data = text_tb, aes(x = chapter, y = dark_counts)) + geom_bar(stat = “identity”)

1d.
Plot the log-likelihood of the Poisson rate of dark usage in R using the data in dark_counts. Then use dark_counts to compute the maximum likelihood estimate of the rate of the usage of the word dark in The Deathly Hallows. Mark this maximum on the log-likelihood plot with a vertical line (use abline if you make the plot in base R or geom_vline if you prefer ggplot).
set.seed(123)
n <- 37
x <- dark_counts
L1 <- function(lambda){sum(x)*log(lambda)-n*lambda}
p1 <- ggplot(data = data.frame(lambda = 0), mapping = aes(x=lambda)) + geom_function(fun= p1
L1) + xlim(0,50

Question 2
For the previous problem, when computing the rate of dark usage, we were implicitly assuming each chapter had the same length. Remember that for Yi ∼ Poisson(λ), E[Yi] = λ for each chapter, that is, the average number of occurrences of dark is the same in each chapter. Obviously this isn’t a great assumption, since the lengths of the chapters vary; longer chapters should be more likely to have more occurrences of the word. We can augment the model by considering properties of the Poisson distribution. The Poisson is often used to express the probability of a given number of events occurring for a fixed “exposure’’. As a useful example of the role of the exposure term, when counting then number of events that happen in a set length of time, we to need to account for the total time that we are observing events. For this text example, the exposure is not time, but rather corresponds to the total length of the chapter.
We will again let (y1,…,yn) represent counts of the word dark. In addition, we now count the total number of words in each each chapter (ν1,…,νn) and use this as our exposure. Let Yi denote the random variable for the counts of the word dark in a chapter with νi words. Let’s assume that the quantities Y1,…Yn are independent and identically distributed (IID) according to a Poisson distribution with unknown parameter λ · 1000νi ,
νi
p(Y i = yi | νi,1000) = Poisson(yi | λ · ) for i = 1,…,n.
1000
In the code below, chapter_lengths is a vector storing the length of each chapter in words. chapter_lengths <- word_counts %>% group_by(chapter) %>%
summarize(chapter_length = sum(n)) %>% ungroup %>% select(chapter_length) %>% unlist %>% as.numeric
2a.
What is the interpretation of the quantity 1000νi in this model? What is the interpretation of λ in this model? State the units for these quantities in both of your answers.
Our interpretation of the quantity 1000νi is the chapter length per 1000 words and λ in this model is the expected count of the word dark out of 1000 words.
2b.
List the known and unknown variables and constants, as described in lecture 2. Make sure your include Y1,…,Yn, y1,…,yn, n, λ, and νi.
Known, Var > 0: Y1,…,Yn Known, Var = 0: y1,…,yn, n, νi
Unknown, Var > 0: Unknown, Var = 0:λ
2c.
Write down the likelihood in this new model. Use this to calculate maximum likelihood estimator for λ. Your answer should include the νi’s.
Type your answer here, replacing this text.
νi Yn νi yie−λ·1000νi
· = (λ · 1000) L λ
1000 yi! i=1

1000 1000
i=1
n Pyi
`0 Lλ · νi = −X νi +
1000 1000 λ
i=1
Xn νi Pyi
0 = − +
1000 λ i=1
P
λˆ = P yνii
1000
2d.
Compute the maximum likelihood estimate and save it in the variable lambda_mle. In 1-2 sentences interpret its meaning (make sure you include units in your answers!).
lambda_mle <- sum(dark_counts) / (sum(chapter_lengths/1000)) lambda_mle
## [1] 0.9652801
. = ottr::check(“tests/q2d.R”)
The MLE is basically stating the estimate of the appearance of the word dark every 1000 words. The maximum likelihood estimate is 0.965, which is essentially saying that for every 1000 words, we can expect on average that there is 0.965 counts of dark (1 count of dark basically if rounded up).
2e.
Plot the log-likelihood from the previous question in R using the data from on the frequency of dark and the chapter lengths. Add a vertical line at the value of lambda_mle to indicate the maximum likelihood.
# YOUR CODE HERE n <- 37
x <- dark_counts
L2 <- function(lambda){-lambda * (sum(chapter_lengths)/1000) + sum(x) * log(lambda)} p2 <- ggplot(data = data.frame(lambda = 0), mapping = aes(x=lambda)) + geom_function(fun= p2
L2) + xlim(0,10

Question 3
Correcting for chapter lengths is clearly an improvement, but we’re still assuming that JKR uses the word dark at the same rate in all chapters. In this problem we’ll explore this assumption in more detail.
3a.
Why might it be unreasonable to assume that the rate of dark usage is the same in all chapters? Comment in a few sentences.
It is unreasonable to assume that the rate of dark usage is the same in all chapters because not all chapters will be talking about the same things. One chapter could be talking about the relationships of the characters, which would hardly mention the word dark, and one other chapter could take place in the dark forest, which would have a lot of mentions of the word dark.
3b.
We can use simulation to check our Poisson model, and in particular the assumption that the rate of dark usage is the same in all chapters. Generate simulated counts of the word dark by sampling counts from a Poisson distribution with the rate (λˆMLEνi)/1000 for each chapter i. λˆMLE is the maximum likelihood estimate computing in 2d. Store the vector of these values for each chapter in a variable of length 37 called lambda_chapter. Make a side by side plot of the observed counts and simulated counts and note any similarities or differences (we’ve already created the observed histogram for you). Are there any outliers in the observed data that don’t seem to be reflected in the data simulated under our model?
observed_histogram <- ggplot(word_counts_mat) + geom_histogram(aes(x=dark)) +
xlim(c(0, 25)) + ylim(c(0,10)) + ggtitle(“Observed”) lambda_chapter <- (lambda_mle*chapter_lengths) / 1000 simulated_counts <- tibble(dark = rpois(37, lambda_chapter)) simulated_histogram <- ggplot(simulated_counts) +
geom_histogram(aes(x=dark)) + xlim(c(0,25)) + ylim(c(0,10)) + ggtitle(“Simulated”)
## This uses the patchwork library to put the two plots side by side observed_histogram + simulated_histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Observed Simulated

0 5 10 15 20 25 0 5 dark 10 15 20 25 dark
. = ottr::check(“tests/q3b.R”)
##
## All tests passed!
Our simulated data does not capture the one outlier in the observed data, where a chapter had 23 instances of the word dark.
3c. Assume the word usage rate varies by chapter, that is,
νi
Yi ∼ Poisson(λi · ) for i = 1,…,n.
1000
Compute a separate maximum likelihood estimate of the rate of dark usage (per 1000 words) in each chapter, λˆi. Make a bar plot of λˆi by chapter. Save the chapter-specific MLE in a vector of length 37 called lambda_hats. Which chapter has the highest rate of usage of the word dark? Save the chapter number in a variable called darkest_chapter.
# Maximum likelihood estimate
lambda_hats <- dark_counts / (chapter_lengths / 1000) darkest_chapter <- which.max(lambda_hats)
# Make a bar plot of the MLEs, lambda_hats ggplot(data = text_tb, aes(x = chapter, y = lambda_hats)) + geom_bar(stat = “identity”)

0 10 20 30
chapter
. = ottr::check(“tests/q3c.R”)
##
## All tests passed! darkest_chapter
## [1] 23
Chapter 23 has the highest usage of the word dark.
Question 4
Let’s go back to our original model for usage rates of the word dark. You collect a random sample of book chapters penned by JKR and count how many times she uses the word dark in each of the chapter in your sample, (y1,…,yn). In this set-up, yi is the number of times the word dark appeared in the i-th chapter, as before. However, we will no longer assume that the rate of use of the word dark is the same in every chapter. Rather, we’ll assume JKR uses the word dark at different rates λi in each chapter. Naturally, this makes sense, since different chapters have different themes and tone. To do this, we’ll further assume that the rate of word usage λi itself, is distributed according to a Gamma(α, β) with known parameters α and β,
f(Λ = λi | α,β) = Gamma(λi | α,β).
and that Yi ∼ Pois(λi) as in problem 1. For now we will ignore any exposure parameters, νi. Note: this is a
“warm up” to Bayesian inference, where it is standard to treat parameters as random variables and specify distributions for those parameters.
4a.
Write out the the data generating process for the above model.
1. Generate lambda (rate of word dark) for each of n chapters from Gamma(10,1)
2. Generate Y (count of word dark per chapter) from Poisson(lambda)
4b.
In R simulate 1000 values from the above data generating process, assume α = 10 (shape parameter of rgamma) and β = 1 (rate parameter of rgamma). Store the value in a vector of length 1000 called counts. Compute the empirical mean and variance of values you generated. For a Poisson distribution, the mean and the variance are the same. In the following distribution is the variance greater than the mean (called overdispsersed”) or is the variance less than the mean (underdispersed’’)? Intuitively, why does this make sense?
## Store simulated data in a vector of length 1000
# YOUR CODE HERE
lambda <- rgamma(1000, shape = 10, rate = 1) counts <- rpois(1000, lambda) print(mean(counts))
## [1] 9.868
print(var(counts))
## [1] 18.6232
. = ottr::check(“tests/q4b.R”)
##
## All tests passed!
The variance is greater than the mean, which makes sense. The reason this makes sense is because the rate of use for the word dark is not the same for each chapter, so it makes sense that the variance is higher.
4c.
List the known and unknown variables and constants as described in lecture 2. Make sure your table includes Y1,…,Yn, y1,…,yn, n, λ, α, and β.
Known, Var > 0: Y1,…,Yn
Known, Var = 0: α, β, y1,…,yn, n
Unknown, Var > 0: λ Unknown, Var = 0:
Extra Credit.
Compute p(Yi | α,β) = R p(Yi,λi | α,β)dλi. Hint: Use the fact that the gamma function is defined as
xz−1e−xdx, doont’ try to do any integrals yourself. Look at your sheet of distributions and see if you can find a matching distribution.
Type your answer here, replacing this text.
Fill in the blank.
You just showed that a Gamma mixture of Poisson distributions is a _______.

Reviews

There are no reviews yet.

Be the first to review “PSTAT115 – Homework 1 Solved”

PSTAT115 – Homework 1 Solved

Description

Reviews

Related products

STAT – HOMEWORK, WEEK 6 Solved

STAT – HOMEWORK, WEEK 3 Solved

STAT – HOMEWORK, WEEK 10 Solved

STAT – WEEK 9 Solved

PSTAT115 – Homework 2 Solved