I would like a python module that will allow me to scrape the weekly reports from this website: [url removed, login to view]
An example of a specific report:
[url removed, login to view]
The module should be able to scrape a range of these weekly reports into one data frame. The output should be in the form of a Pandas dataframe. The row index should be a multiindex with area name and date. The remaining fields should be columns.
A basic example of the input and resulting output are attached.
I would like to be able to use the module in the following ways (where ‘dfg’ is the module you will write):
>>> df = [url removed, login to view](start_year=2012, start_week=3) # returns a dataframe with all data starting from 2012, week 3 through present.
>>> df = [url removed, login to view]() # returns a dataframe with all data published. Basically you can set the default start year to 1999 and the start week to 1.
Note that there are some inconsistencies with the data, including but not limited to:
* Different weeks may have a slightly different set of areas listed
* The URL pattern is different for 2013 vs. other years.
* Different years may have a different number of weeks.
* The data is listed under the year that the season started (e.g., January 2010 is listed under 2009). The dates recorded in the data frame should be the true calendar date.
* Some basic attempts should be made to ensure that area_names are parsed/grouped correctly. E.g., leading and trailing whitespace should be removed, they should use the same capitalization, etc.