I need a php application which does the following:
access a specified html page, consisting of 100 numbered pieces of data (each linked to a separate page) and
extract from the top level-page:
a1) data matching 5 specific endings
a2) data matching two other patterns (every of the 100 pieces of data which is a single word only, and every of the 100 pieces of data which is a group of two words only)
extract from sub pages:
follow every one of the 100 links on the top level page only one link deep (these links are dynamic - potentially different every time the top level page is accesses), and extract
a3) data matching 5 specific endings (same as above)
B) Data Manipulation:
Raw data retrieved matching a2) patterns as above will need to be manipulated: remove spaces, and append one specific ending
Store data in a database, with date of (first) retrieval (duplicates should not be stored), and an extra attribute if it is data which has been manipulated (for a2 with the added ending).
create/update two daily txt files with data retrieved that day: 1 for a1+a3 combined, one for a2 data
* A simple web interface to create data output by date range and type (a1+a3 and/or a2)
* Script should run every X minutes/hours (cron job)
* Possibility to specify a list of proxies (with an option for username/pw) auth, which the script will cycle through for web-access (must be able to skip non-responding proxies. No proxy if list is empty.
* Development/Testing on your own server, complete installation on my server when finished (CentOS / WHM / Cpanel)
I was thinking about php/curl/mysql as I am familiar with these, but feel free to suggest other methods if you know far superior methods.
Thanks for looking :)
10 freelancers are bidding on average $196 for this job