I require a web crawler to extract web base contact information regarding a businesses including Name, Website URL, Address, Phone, Mobile, Fax Number and business specialty if one is present. The crawler must also be able to accomodate multiple addresses and contact ph and fax numbers for one business.
The primary contact sites to crawl through are [url removed, login to view] and www.yellowpages.com.au.
Data results will be checked against an existing database for accuracy of results.
1. I must be able to set the starting URL from which the spider will intitiate from on the websites. The format of the data on each website should be examined closely before commencing as there are multiple data fields that are displayed if information is present.
2. The spider should contain its own database of products, professions and service such that it can use these as a basis of initiation of searches. Data is to be extracted into XML or ASCII format and then imported directly into a MySQL or Postgres Database file.
3. Spider must crawl through multiple pages until the final page for that category is completed. However, at the very beginning of most categories, there are businesses listed under the "Yellow Pages - Advertisers" heading. These are businesses that are not from the area that I have chosen but are advertising in that area. I do not want these entries included. The spider does not neccessarily need to know how my list was created, only to avoid entries under the "Advertisers" section.
4. When completed, an update function should let me choose a new search profession name and initiate the search.
5. Search and purge function that can be run anytime on any of the database files that have been created to ensure no two entires have the same telephone number/fax number. If duplicates telephone/fax numbers are found, records with the least information are automatically deleted. For example, 2 records with the same telephone/fax numbers but one lists a website and the other doesn't, then delete the one without the website number.
6. I require that this program be functional for both websites and that the system can reinitiate the searches to capture update info after say 4-5 months.
7. Finally, the crawler must function despite any anti-crawler or anti search / DOS protection (if any) being run by the site administrators.
1. You will be easily contacted. Either by phone, or you will be required to answer any e-mail I send to you within 10 hours time.
2. Must speak and write english well.
3. Code must be well commented in english.
4. All source code must be given to me.
5. I would prefer if this was written in Java, perl or python but XML is also OK.
6. I would like this done by no later than November 21st, 2007.
7. Must be able to run on my Windows XP machine or hosted in a USA data centre. Data usage is not an issue.