We are looking for a web scraper that extracts information from the following website: <[url removed, login to view]> Searching by ZIP code produces a listing of Agents. We would like to extract records for all Agents in California (ZIP codes 90000-96162). We will provide a listing of ZIP codes for your use, if needed. The following fields should be extracted for each record: First Name Last Name Title(s) (LUTCF, CPCU, CLU, etc) Address City State Zip Phone (10 digit number only) Fax (10 digit number only) E-mail address (CAVEAT: for the e-mail field, it is necessary to follow the “Profile?? or “Agent’s Web Site?? link for each record) Data should be extracted into a comma-delimited file to be imported into MS Excel. Before offering a bid on this project, please be aware of the following: 1. Some ZIP codes will not return any records. 2. The website will return a maximum of 10 records for each ZIP code query. However, we need to extract **all** records. This means it may be necessary to run the query multiple times for each ZIP code (that returns 10 records). The ideal solution would run multiple queries, compare each query against the previous query and reject duplicates. Then, if all records are duplicates, move onto the next ZIP code. We are always open to better suggestions. 3. Please be careful in parsing the First Name, Last Name, and Titles. We can provide you with examples if needed. 4. This website appears to have occasional timeout problems. Please make your program robust to timeouts. 5. Your program should include start, stop, and/or resume functions so the program can resume where it left off at a later time/date. 6. Please incorporate threading appropriate for 1.5+Mbps bandwidth. 7. A command-line interface or simple Windows interface is OK. Thanks.
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
Windows XP SP2 (Perl 5.8.3, Ruby 1.8.2 ok).