Description
The deliverable is a python script that will be called from the terminal and take an output file path as input, examples on how the script will be called below:
Note, the output type should be automatically inferred from the file path provided. Provided files:
employers.csv: contains basic information about our current clients.
quotes.csv: contains information about quoting a new product to our existing clients. The column EmployerId will be the key to join to employers.csv.
product_premiums.txt: This text file contains information about a clients current product mix and the premiums they are paying for each product. This file is not structured in a way that is readable by pandas.
Pipeline steps
Your pipeline should follow these steps:
1. Combine Employers and Quotes data
2. Create a new column named ClientTenure
3. Create a column named SizeRange
This will be a categorical column based on the Employees column, which is a count of the number of employees a client has. The resulting values in the new column will be:
4. Read product_premiums.txt
This file contains the products a client currently has, and the annual premium associated with each product. The file structure is not natively readable by pandas, so you must write a parser to extract the information into a dataframe. The file is structured to have a section start with EmployerId: 00000000 followed by a series of product and premium pairs, such as Product1: 9233. Each Id only occurs once. An employer can have duplicate products (one per location) but premiums will be unique. The resulting dataframe should look like this:
5. Aggregate the premium data from step 4 to the EmployerId level
Create the following columns:
ProductList: a comma separated string with the distinct products a client has. ex. ‘Product4,
Product1, Product2’
NumUniqueProducts: the number of unique products a client has. In the above example it would be 3.
SumPremium: A sum of the premium from all products.
The resulting aggregated dataframe should look like this:
6. Join aggregated premium data to the dataset created in steps 1-3 using pandasql
Write SQL to create an inner join on EmployerId. There will be a case statement, and the columns should be selected as follows:
Note that the aliases are not valid, they just represent which dataframe to pull the columns from. To use pandasql:
7. Write final DataFrame to file output specified by user
user specified file path to be read from command line. A user can specify a .csv file or .json file, but will not provide explicit instruction besides the file extension provided.




Reviews
There are no reviews yet.