HTML scraping job
We have a small self-contained screen scraping job which needs to be written in Python, preferably running on Unix (although you could probably get away with developing on Windows if you used cygwin and didn't do anything terribly Windows specific). To ease the pain of this work, we've been using [Beautiful Soup] to actually do the heavy lifting work of parsing the HTML. Some experience of this would be helpful. You are quite welcome to implement the parsing with or without this library as you choose. It's a scraper with a finite shelf life, so the code doesn't need to be particularly beautiful, just functional. The initial request is for a scraper for just one site, although there's the possibility of repeat work doing the same thing on other sites (and the second one is likely to be easier for you but would pay the same).
The application which will use this scraping module is an odds comparison system for fixed odds football (soccer) betting. The bid is purely for the scraper required to interface with it, and not the comparison application itself.
Existing plugins we've written to handle similar sites are approximately 350 lines of python code. That should be a rough estimate of the complexity of the task.
See the deliverables section for the exact requirements.
1. The scraper must subclass a python base class to be supplied.
2. The scraper must implement a method to log into the site with a username and password supplied as parameters. The scraper must store cookies sent to it by the host web site in such a way that subsequent HTTP requests can be made as if the browser were logged in.
3. The scraper must implement a method to log out of the site on demand.
4. The scraper must implement a method to pull details of all games and all available bets, along with the odds available for each bet type from the host web site. The site typically displays this information on one or two web pages (today's games and future games). To avoid any potential feature creep, for the purposes of this bid, we will not expect you to have to scrape more than four pages to implement this method. The details for each bet should be returned in a list in the format expected by the caller. This format will be fully documented, but each bet must report a unique identifier to distinguish it from all others on the system. Something like a date,team_name,bet_type tuple would suffice.
5. The scraper must implement a method which is called on demand to request follow the hyperlink for each selection and open up the betslip page. Information to be scraped from the betslip is the maximum permitted bet, the price displayed on the betslip (which may be different from that displayed on the main page due to race conditions), and an independent piece of code which determines the bet type from the information available on the betslip to provide a checking mechanism against errors on the main page. You must deal with the cases where the selection is temporarily suspended gracefully by throwing an appropriate exception (as documented).
6. The scraper must implement a method to fetch the contents of the "ticker" displayed on the host site in the documented format. Typically this is just a span or a frame on the page which displays plain unformatted text about any changes of sport venues and how this affects the offers they're quoting.
7. The scraper must implement a method to make a simple HTTP request as a proxy (in a logged in state) and return the HTML without processing.
The deliverables should be a python script which subclasses the supplied class. It must correctly implement the functions above and pass a test script to make sure the data is being returned in the correct format.
Obviously you cannot be held liable for any changes to the HTML made by the target site after acceptance, but equally we require to verify that your script handles all cases in normal operation. Consequently acceptance of (and payment for) the software will be made after the software has performed correctly without errors for three days. It must return the correct values in all cases where data is available, and gracefully handle any unexpected responses from the host site.
We have in-house developers ourselves, and thus there's no need to waste time producing installer scripts or user documentation. We require only the python source code.
The python script must run under linux. Any external libraries or modules required by your code must be open source, and packaged within the final deliverable.
You will transfer all ownership of the source code to us on delivery, and you may not redistribute any part of the deliverables to any other party. All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
Python/Linux. Clearly you can develop on Windows as long as your code is cross-platform enough to run on Linux though.