I need an script or application that can scan 12 million web pages that I specify and look for a specific string in the page html source and identity which pages contain this string and which pages do not.
I have 12 text files with 1,000,000 urls in each file (12 million urls total), I need a script or application that can visit each of the 12 million urls and look for a specific word in the page html source and provide me with two list, one list that contains the urls which found the word and another list which contains the urls that do NOT contain the word I specify. Out of 12 million maybe 8 million will contain the keyword and 4 million will not. I have a strong windows 2008 server with 100mbps unlimited connection OR we can do this with PHP and a cron job on a linux server which I can obtain. PLEASE tell me how you plan to build this application. I would also like for the script to go fast, with at least 200 threads, remember its just loading the html source, no images or scripts on the page. It should also be able to auto-resume if it crashes or stops in the middle without having to start from 1 again. PLEASE ANSWER THESE QUESTIONS IN YOUR BID OR I WILL IGNORE IT COMPLETELY: 1) How fast can you get me a fully functional program that will not freeze up and have to be constantly restarted? 2) Will you use Windows Server 2008 compatible application or using PHP/MySql? 3) How will your script handle errors such as page timeouts, unable to load url, 404 errors, 500 error....etc? 4) Approx how long do you estimate to scan all 12 million pages? I need this completed ASAP so please keep this in mind. 1) All deliverables will be considered "work made for hire" under U.S. Copyright law. Employer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the employer on the site per the worker's Worker Legal Agreement). 2) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. 3) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables): a) For web sites or other server-side deliverables intended to only ever exist in one place in the Employer's environment--Deliverables must be installed by the Worker in ready-to-run condition in the Employer's environment. b) For all others including desktop software or software the employer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this project.
Windows Server 2008 OR PHP/MySql