Description
Worth 10 points
Read this first. A few things to bring to your attention:
1. Important: If you have not already done so, please request a Flux Hadoop account. Instructions for doing this can be found on Canvas.
2. Start early! If you run into trouble installing things or importing packages, it’s bestto find those problems well in advance so we can help you.
3. Make sure you back up your work! I recommend, at a minimum, doing your work in a Dropbox folder or, better yet, using git.
4. A note on grading: overly complicated solutions or solutions that suggest an incomplete grasp of key concepts from lecture will not receive full credit.
1 Warmup: constructing pandas objects (2 points)
In this problem, you will create two simple pandas objects.
1. Create a pandas Series object with indices given by the first 10 letters of the English alphabet and values given by the first 10 primes.
2. Below is a table that might arise in a genetics experiment. Reconstruct this as apandas DataFrame.
2 Working with pandas DataFrames (3 points)
In this problem, you’ll get practice working with pandas DataFrames, reading them into and out of memory, changing their contents and performing aggregation operations. For this problem, you’ll need to download the celebrated iris data set, available as a
.csv file from my website: www-personal.umich.edu/~klevin/teaching/Winter2018/ STATS701/iris.csv Note: for the sake of consistency, please use this version of the CSV, and not one from elsewhere.
1. Download the iris data set from the link above. Please include this file in yoursubmission. Read iris.csv into Python as a pandas DataFrame. Note that the CSV file includes column headers. How many data points are there in this data set? What are the data types of the columns? What are the column names? The column names correspond to flower species names, as well as four basic measurements one can make of a flower: the width and length of its petals and the width and length of its sepal (the part of the pant that supports and protects the flower itself). How many species of flower are included in the data?
2. The data that I uploaded to my website, which you have downloaded, is based on thedata initially uploaded to the UC Irvine machine learning repository. It is now known that this data contains errors in two of its rows (see the documentation at https: //archive.ics.uci.edu/ml/datasets/Iris). Using 1-indexing, these errors are in the 35th and 38th rows. The 35th row should read 4.9,3.1,1.5,0.2,”setosa”, where the fourth feature is incorrect as it appears in the file, and the 38th row should read 4.9,3.6,1.4,0.1,”setosa”, where the second and third features are incorrect as they appear in the file. Correct these entries of your DataFrame.
3. The iris dataset is commonly used in machine learning as a proving ground forclustering and classification algorithms. Some researchers have found it useful to use two additional features, called Petal ratio and Sepal ratio, defined as the ratio of the petal length to petal width and the ratio of the sepal length to sepal width, respectively. Add two columns to you DataFrame corresponding to these two new features. Name these columns Petal.Ratio and Sepal.Ratio, respectively.
4. Save your corrected and extended iris DataFrame to a csv file called iris_corrected.csv. Please include this file in your submission.
5. Use a pandas aggregate operation to determine the mean, median, minimum, maximum and standard deviation of the petal and sepal ratio for each of the three species in the data set. Note: you should be able to get all of these numbers in a single table (indeed, in a single line of code) using a well-chosen group-by or aggregate operation.
3 Plotting Dataframes: Major League Baseball (5 points)
3. The Skellam distribution (https://en.wikipedia.org/wiki/Skellam_distribution) is the distribution that results from taking the difference between two Poisson random variables. It is often suggested as a model for the difference between scores in sports games, particularly baseball. Add a new column to the data frame called score_diff, given by the home score minus the away score. Make a histogram of this score difference and give the plot an appropriate title.
4. Read the documentation about the scipy implementation of the Skellam distribution at https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skellam.
html. If λH and λV are the means of two independent Poisson random variables KH and KV , respectively, then the Skellam distribution that describes the difference KH−KV has parameters λH and λV . We will assume (perhaps incorrectly) that the location parameter of the Skellam distribution is 0. To fit a Skellam distribution to the data, we will first fit Poisson distributions to the home and away teams.
Estimate parameters λˆH and λˆV as the means of the home and visitor scores, respectively. Use scipy to run a Kolmogorov-Smirnov test assessing whether or not the
Skellam distribution with parameters (µ1,µ2) = (λˆH,λˆV ) and location parameter 0 is a good fit for the score differences. Hint: see the documentation at https://docs. scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html to see how to perform such a test. Is the Skellam distribution a reasonable model to use? What might we do to build a more accurate model?




Reviews
There are no reviews yet.