Description
Bayesian Statistics and Data Analysis
Assignment 7
General information
The recommended tool in this course is R (with the IDE R-Studio). You can download R here and R-Studio here. There are many tutorials, videos and introductions to R and R-Studio online. You can nd some initial hints from RStudio Education pages.
When working with R, we recommend writing the report using R markdown and the provided R markdown template. The template includes the formatting instructions and how to include code and gures.
Instead of R markdown, you can use other software to make the PDF report, but you should use the same instructions for formatting. These instructions are also available in the PDF produced from the R markdown template.
We supply a Google Colab notebook that you can also use for the assignments. We have included the installation of all necessary R packages; hence, this can be an alternative to using your own local computer. You can nd the notebook here. You can also open the notebook in Colab here.
Report all results in a single and anonymous pdf. Note that no other formats are allowed.
The course has its own R package bsda with data and functionality to simplify coding. To install the package, just run the following (upgrade=”never” skips question about updating other packages):
1. install.packages(“remotes”)
2. remotes::install_github(“MansMeg/BSDA”, subdir = “rpackage”, upgrade=”never”)
Many of the exercises can be checked automatically using the R package markmyassignment. you can nd information on how to install and use the package here. There is no need to include markmyassignment results in the report.
You can nd common questions and answers regarding the installation and technical problems in Frequently Asked Questions (FAQ).
You can nd deadlines and information on how to turn in the assignments in Studium.
If you have any suggestions or improvements to the course material, please post in the course chat feedback channel, create an issue, or submit a pull request to the public repository here.
It is mandatory to include the following parts in all assignments (these are included already in the template):
1. Time used for reading: How long time took the reading assignment (in hours)
2. Time used for the assignment: How long time took the basic assignment (in hours)
3. Good with assignment: Write one-two sentences of what you liked with the assignment/what we should keep for next year.
4. Things to improve in the assignment: Write one-two sentences of what you think can be improved in the assignment. Can something be clari ed further? Did you get stuck on stu unrelated to the content of the assignment etc.
To pass (G) the assignment, you need 70% of the total points. To pass with distinction (VG), you need 90% of the total points. See the grading information on the point allocations for each assignment.
You are not allowed to show your assignments (text or code) to anyone. Only discuss the assignments with your fellow students. The student that show their assignment to anyone else could also be considered to cheat. Similarly, on zoom labs, only screen share when you are in a separate zoom room with teaching assistants.
Information on this assignment
This assignment is related to Chapter 5.
Reading instructions: Chapter 5 in BDA3, see reading instructions.
Reporting accuracy: For posterior statistics of interest, only report digits for which the Monte Carlo standard error (MCSE) is zero. Example: If you estimate E(µ) = 1.234 with MCSE(E(µ)) = 0.01, you should report E(µ) = 1.2.
When computing the Rˆ diagnostics, you only need to include two decimals.
Installing and using stan: To install Stan on your laptop, https://github.com/ stan-dev/rstan/wiki/RStan-Getting-Started. If you encounter problems, see additional answers in the FAQ. Recently there have been reports of installation problems with Windows and R 4.0 (see Stan discourse for more).
Installing and using CmdStanR: If you want to use Stan in R on local computer, it can be easier to install CmdStanR interface mc-stan.org/cmdstanr/.
General information on using stan: Additional useful packages are loo, bayesplot and shinystan. For Python users, PyStan, CmdStanPy, and Arviz packages are useful.
Stan manual can be found at https://mc-stan.org/users/documentation/. From this website, you can also nd a lot of other useful material about Stan.
R-studio enables you to automatically check the Stan syntax. Just open a stan le (with le extension .stan in R-studio). Then you can use the button “Check” to check the Stan syntax.
1. Linear model: drowning data with Stan
i) What is the trend of the number of people drowning per year? (We would plot the histogram of the slope of the linear model.)
To access the data, use:
library(bsda) data(“drowning”)
Corresponding Stan code is provided in Listing 1. However, it is not entirely correct for the problem. First, there are three mistakes. Second, there are no priors de ned for the parameters. In Stan, this corresponds to using uniform priors. Your tasks are the following:
a) Find the three mistakes in the code and x them. Report the original mistakes and your xes clearly in your report. Include the full corrected Stan code in your report.
b) Determine a suitable weakly-informative prior normal(0,σβ) for the slope beta. It is very unlikely that the mean number of drownings changes more than 50 % in one year. The approximate historical mean yearly number of drownings is 138. Hence, set σβ so that the following holds for the prior probability for beta: Pr(−69 < beta < 69) = 0.99. Determine suitable value for σβ and report the approximate numerical value for it.
c) Using the obtained σβ, add the desired prior in the Stan code. In the report, in a separate section, indicate clearly how you carried out your prior implementation, e.g. Added line …in block … .
d) In a similar way, add a proprt and weakly informative prior for the intercept alpha and explain how you chose the prior.
Hint! Example resulting plots for the problem, with the xes and the desired prior applied, are shown in Figure 1. If you want, you can use these plots as a reference for testing if your modi ed Stan code produces similar results. However, running the inference and comparing the plots is not required.
Listing 1: Broken Stan code for question 1
1 data {
2 int<lower=0> N; // number of data points
3 vector[N] x; // observation year
4 vector[N] y; // observation number of drowned
5 real xpred; // prediction year
6 }
7 parameters {
8 real alpha;
9 real beta;
10 real <upper=0> sigma;
11 }
12 transformed parameters {
13 vector[N] mu = alpha + beta*x;
14 }
15 model {
16 y ~ normal(mu, sigma)
17 }
18 generated quantities {
19 real ypred = normal_rng(mu, sigma);
20 }
2. Hierarchical model: factory data with Stan
Note! Assignment 8 build upon this part of the assignment, so it is important to get this assignment correct before you start with Assignment 8.
The factory data in the bsda package contains quality control measurements from 6 machines in a factory. In the data le, each column contains the measurements for a single machine. Quality control measurements are expensive and time-consuming, so only 5 measurements were done for each machine. In addition to the existing machines, we are interested in the quality of another machine (the seventh machine). To read in the data, just use:
library(bsda) data(“factory”)
For this problem, you’ll use the following three Gaussian models:
a separate model, in which each machine has its own model/parameters
a pooled model, in which all measurements are combined and there is no distinction between machines
a hierarchical model, which has a hierarchical structure as described in BDA3 Section 11.6.
As in the model described in the book, use the same measurement standard deviation σ for all the groups in the hierarchical model. In the separate model, however, use separate measurement standard deviation σj for each group j. You should use a N(100,100) for µ parameters and N+(0,50), i.e. truncated Normal priors for σ parameters in the model.
The provided Stan code in Listing 2 given on the next page is an example of the separate model (but with very strange results, why?).
Change the Stan code to a separate model that can be summarized mathematically as:
yij ∼ N(µj,σj) µj ∼ N(100,100) σj ∼ N+(0,50)
To run Stan for that model, simply use:
data(“factory”) sm <- rstan::stan_model(file = “[path to stan model code]”)
stan_data <- list( y = factory, N = nrow(factory),
J = ncol(factory)
)
model <- rstan::sampling(sm, data = stan_data) model
## Inference for Stan model: 5cbfa723dd8fb382e0b647b3943db079.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000. ## mean se_mean sd 2.5% 25% 50% ## mu[1] 0.11 0.01 0.98 -1.81 -0.56 0.12 0.77
## mu[2] 0.10 0.01 1.00 -1.86 -0.56 0.10 0.79
## …
Note! These are not the results you would expect to turn in your report. You will need to change the code for the separate model code as well.
For each of the three models (separate, pooled, hierarchical), your tasks are the following:
a) Describe the model with mathematical notation (as is done for the separate model above). Also describe in words the di erence between the three models.
b) Implement the model in Stan with the above mentioned weak priors and include the code in the report. Note that you might get divergent transitions. You should only use inference when there is no divergent transitions. To solve the issue, you can either change adapt_delta or reparametrize the model (see how here). If you compute the posterior correct, you should get that σ1,separate ≈ 26.7, σhiearchical ≈ 15.2, and σpooled ≈ 18.8.
c) Using the models, with the speci ed weakly informative priors, report the expectation/mean (and 90% uncertainty intervals), comment on it and, if applicable, plot histograms for the following distributions:
i) the posterior distribution of the mean (µ) of the quality measurements of the sixth machine
ii) the predictive distribution p(y˜) for another quality measurement of the sixth machine.
iii) the posterior distribution of the mean (µ) of the quality measurements of a seventh machine (not in the data).
d) Visualize the posterior p(θ|y) and p(logτ|y) in the hiearchcial model, where τ is the hyperprior sd parameter and θ is hyperprior mean. Interpret these parameters.
e) Visualize the posterior p(µ3,logτ|y), where τ is the hyperprior sd parameter. Describe the distribution and reason about the ’funnel’ shape. What does it mean for the relationship between µ3 and logτ?
f) Now change the prior for the σ, residual parameter(s), to a Gamma(1,1). What happens? Why?
Listing 2: Stan code for a bad separate model
1 data {
2 int<lower=0> N;
3 int<lower=0> J;
4 vector[J] y[N];
5 }
6
7 parameters {
8 vector[J] mu;
9 vector <lower=0>[J] sigma;
10 }
11
12 model {
13 // priors
14 for (j in 1:J){
15 mu[j] ~ t(10, 10, 10);
16 sigma[j] ~ inv_chi_square (10);
17 }
18
19 // likelihood
20 for (j in 1:J)
21 y[,j] ~ normal(mu[j], sigma[j]);
22 }
23
24 generated quantities {
25 real ypred;
26 // Compute predictive distribution
27 // for the first machine
28 ypred = normal_rng(mu[1], sigma[1]);
29 }




Reviews
There are no reviews yet.