Description
Worth 15 points
Read this first. A few things to bring to your attention:
1. Important: If you have not already done so, please request a Flux Hadoop account. Instructions for doing this can be found on Canvas.
2. Start early! If you run into trouble installing things or importing packages, it’s bestto find those problems well in advance so we can help you.
3. Make sure you back up your work! I recommend, at a minimum, doing your work in a Dropbox folder or, better yet, using git.
4. A note on grading: overly complicated solutions or solutions that suggest an incomplete grasp of key concepts from lecture will not receive full credit.
1 Warmup: counting words with mrjob (3 points)
In this problem, you’ll get a gentle introduction to mrjob and running mrjob on the Fladoop cluster. I have uploaded a large text file to the Fladoop cluster. Your job is to count how many times each word occurs in this file.
2. To test your code, I have uploaded a simple text file to the course webpage:
http://www-personal.umich.edu/~klevin/teaching/Winter2019/STATS507/simple.txt .
Download this file and test your code either on your local machine or on the Fladoop grid. The file is small enough that you should be able to check by hand whether your code is behaving correctly. Save the output of running your script on this small file to a file called simple word counts.txt and include it in your submission. Note: use the redirect arrow > to send the Hadoop output to a file. This will only send the stdout output to the file, while still printing the Hadoop error/status messages to the terminal.
3. Once you are confident in the correctness of your program, run your mrjob script on the file hdfs:/var/stat507w19/darwin.txt
on the Fladoop grid (this file is the Project Gutenberg plain text version of Charles Darwin’s scientific work On the Origin of Species). Note that this file is on hdfs, not the local file system, so you’ll have to run your script accordingly. Save the output to a file called darwin word counts.txt, and include it in your submission.
4. Zipf’s law states, roughly, that if one plots word frequency against frequency rank(i.e., most frequent word, second most frequent word, etc.), the resulting line is (approximately) linear on a log-log scale. Using the information in darwin word counts.txt, make a plot of word frequency as a function of word rank on a log-log scale for all words in the file hdfs:/var/stats507w19/darwin.txt
Give an appropriate title to your plot and include axis labels. Save the plot as a pdf file called zipf.pdf, and include it in your submission.
2 Computing Sample Statistics with mrjob (6 points)
In this problem, we’ll compile some very basic statistics summarizing a toy dataset. The
file
http://www-personal.umich.edu/~klevin/teaching/Winter2019/STATS507/populations_small.txt
contains a collection of (class,value) pairs, one per line, with each line taking the form class label,value, where class label is a nonnegative integer and value is a float. Each pair corresponds to an observation, with the class labels corresponding to different populations, and the values corresponding to some measured quantity.
1. Write a mrjob program called mr_summary_stats.py that takes as input a sequence of (label,value) pairs like in the file at http://www-personal.umich.edu/~klevin/ teaching/Winter2019/STATS507/populations_small.txt, and outputs a collection of (label, number of samples, mean, variance) 4-tuples, in which one 4-tuple appears for each class label in the data, and the mean and variance are the sample mean and variance, respectively, of all the values for that class label. Thus, if 25 unique class labels are present in the input then your program should output 25 lines, one for each class label. Note: I don’t care whether you use n or n−1 in the denominator of your sample variance formula—just be clear which one you are using. Note: you don’t need to do any special formatting of the Hadoop output. That is, your output is fine if it consists of lines of the form label [number,mean,variance] or similar.
Think carefully about what your key-value pairs should be here, as well as what your mappers, reducers, etc. should be. Should there be more than one step in your job? Sit down with pen and paper first! Hint: to compute the sample mean and sample variance of a collection of numbers, it suffices to know their sum, the sum of their squares, and the size of the collection.
Please include a copy of mr_summary_stats.py in your submission.
2. Download the small file at http://www-personal.umich.edu/~klevin/teaching/ Winter2019/STATS507/populations_small.txt. Run your mrjob script on this file, either on your local machine or on Fladoop, and write the output to a file called summary small.txt. Please include this file in your submission. Inspect your program’s output and verify that it is behaving as expected.
3. I have uploaded to the Fladoop cluster a much larger data file, located on theHDFS file system at hdfs:/var/stats507w19/populations large.txt. Once you are sure that your script is doing what you want, run it on this file. Be sure to use the -r hadoop command to tell mrjob to run on the Hadoop server rather than on the login node. Save the output to a file called summary large.txt. Download this file and include it in your submission. Please also include in your notebook file a copy-paste of your shell session on Fladoop in a markdown cell (i.e., a cell that will display as code but will not be executed by the interpreter).
4. Use matplotlib and the results in summary large.txt to create a plot displaying 95% confidence intervals for the sample means of the populations given by the class labels in file hdfs:/var/stats507w19/populations large.txt. You will probably want to make a boxplot for this, but feel free to get creative if you think you have a better way to display the information. Make sure your plot has a sensible title and axis labels. Save your plot as a pdf called populations.pdf and include it in your submission.
3 Graph Processing: Counting Triangles with PySpark (6 points)
n f1 f2 … fK
Once again, before you dive in and write a bunch of code, sit down and think about the problem. What is the right “fundamental unit” of the problem? What should your keys and values look like? Hint: the simplest solution to this problem involves multiple steps, involving a standard map-reduce pattern and a subsequent filtering operation. As usual, overly complicated solutions will not receive full credit.
1. Write a PySpark job that takes the described input and produces a list of all thetriangles in the network, one per line. Each triangle should be listed as a spaceseparated line node1 node2 node3, with the entries sorted numerically in ascending order. So, if nodes 2, 5 and 15 form a triangle, the output should include the triple (2,5,15), but not (2,15,5), (15,2,5), etc. Save your script in a file called ps_fof.py and include it in your submission.
2. Test your script on the set of 5 simple files in the HDFS directory
hdfs:/var/stats507w19/fof/friends.simple
which is small enough that you should be able to work out by hand what the correct output is. How many triangles are there? List them in a file called small triangle list.txt and include it in your submission.
3. Once you are confident that your script is correct, run it on the larger data set, storedon HDFS at hdfs:/var/stats507w19/fof/friends1000 Save the list of triangles to a file called big triangle list.txt, and include it in your submission. Don’t forget to include in your notebook file a copy-paste of the commands you used to launch your job along with their outputs.




Reviews
There are no reviews yet.