Greetings all, This is a project for a web page spider. It is not for websites, specific domains, or a meta search engine. The sole purpose of this project is to have a spider(s) visit specific web pages listed in a table in the SOURCE database, where it will gather all links and other specified materials VERY QUICKLY, and return them to a table in the HARVEST database, that directly correlates with the SOURCE table. More specifically, I need a spidering/harvesting solution to be able to collect all instances of link names and Curls from literally thousands of specified webpages VERY quickly. I also need the option to have it harvest the visible page text (not HTML code) that occurs between any given HTML tags (e.g. headline, title, span, paragraph div, etc...). I will also need at least 1 DB programmed to store all harvested links, as well as 2 web pages. 1 web page will be used to administer the harvested links, and the second will be used to display the end result to the public. The details of this project are outlined in the DELIVERABLES portion of this project. Before reading any further, please note that I am looking for someone who can best demonstrate an ability to provide specifically what I need, quickly, and cost effectively. If you can show me a solution that provides all my needs, then the competition will end there. I am not concerned whether or not you morph a pre-existing Open Source spider or a previous project. It simply must not cost me anything beyond your bid, it must run on Debian Woody Linux, it must provide explicit detail in the instructions, it must be fast and easy to use, and of course it must be reasonably priced.
SPIDER 1. Spider is provided a mySQL databases consisting of 3000 starting URL's to check, all of which are broken up into tables representing subject categories. 2. Optionally provide a base domain name (e.g. [url removed, login to view]) that must be contained in all spidered page(s); 3. Select the data to harvest (e.g. URL AND link names AND/OR title AND/OR headline tags AND/OR paragraph tags etc.) 4. Spider the pages. Not the site just specified page. 5. Harvest the Links URL's, titles and the other specified data from the specified page. 6. During the process of harvesting links from a specified page listed in a mySQL SOURCES DB, I the spider should obey the following parameters: * Require that the harvested links be from the same domain as the source. * Require that the harvested URL NOT include specific characters for page to be spidered. For example, if I don't want to spider a pages forum links, I would like to be able to exclude links with "forum" in the name or the URL); * Not harvest links/data (write to DB) if it includes "x", such as 'Forum' or 'Advertisement'. * Remove duplicate entries contained in the harvested data list. 7. Send data to a corresponding 'Harvested' table in a new database. DB In placing the harvested links into the 'HARVESTED' DB, I would like to have the spider use the following parameters: * All harvested links will be placed in a table that corresponds to the source page table. IE: We have 1000 "SOURCES" tables, with 10 URL's in each. When the spider begins harvesting, it will start in table SOURCE-1, and check all the links in that table. When its spidered all the pages listed there, it will move on to table SOURCE-2, etc.. until table SOURCE-1000. The links harvested from the source links in SOURCE tables should be then placed in a corresponding HARVESTED table. The result will be that new links found using source pages from SOURCE-1, will be placed in HARVESTED-1, and new links from SOURCE-2 will be placed in HARVEST-2, etc, all the way to SOURCE-100 and HARVESTED-1000. * All URL's in the HARVESTED tables must be unique. The URL may appear in other HARVEST tables, but the URL cannot occur twice in the same table. * Entries occurring in HARVESTED tables should have the following attributes: SOURCE, URL, TITLE, BODY, TIME_DATE, SECURITY. (Values are explained below). WEB PAGES 8. Provide an administrator page where someone can view and edit harvested links. To accomplish this, I think that all HARVESTED links should have 1 of 3 SECURITY settings: 0-PUBLIC-CHECKED, 1-PRIVATE-CHECKED, 2-UNCHECKED. This way, only the administrator could view and edit the newly HARVEST Links. Additionally, people on the web could view the links in any 'HARVESTED' table, and have to worry about alot of irrelevant material. This would be ensured by having a php web page for the public that essentially says: SELECT * FROM table 'HARVESTED-100' WHERE SECURITY=0 AND TIME_DATE <=60; (*The public should be able to see all links added to the HARVESTED DB GREATER THAN 24 hrs AND LESS THAN 48hrs, etc). To edit the links with a value of 2-UNCHECKED, I need to be able to change as many values as quickly as possible. As such, it would be good to have an administrator page that had a text box that contains contents of the harvested 'HEADLINE' or 'TITLE' tag so that the administrator can quickly edit (add or delete) the title for the public. As well, it would be nice to be able to simply use radio buttons or check boxes to select en-masse all HARVESTED links and assign them either a PUBLIC-CHECKED and PRIVATE-CHECKED value at once. OTHER 9) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. 10) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request. 11) Complete ownership and distribution copyrights to all work purchased. 12) Full and complete instructions on how to use and operate the spider.
As stipulated in the project outline, as long as it runs on Debian Woody, using freely available packages that you are willing to support, then I can deal with it. It will be running on a P3 Computer with a standard DSL connection to the Internet.