The purpose of this project is to provide a script that checks submitted links for valid contents, keyword/phrase/string. I currently have a couple of scripts that can provide the basic functionality using LWP::UserAgent and LWP::Parrallel::UserAgent. I will forward the existing files to the successful candidate only. Althought I know this makes your bidding a little more challenging, it is a necessary precaution. As such, I would ask that you provide me with a brief outline of your experiences in link validation, search engines, and/or the Perl modules mentioned above. Your job will be to modify one of these scripts so that this script removes all of the web page tags, and only reads the human-readable text. A full explanation of the context within which this will work can be seen in the 'Deliverables'
Basically, the current environment works like this: 1)-A link to a single page is submitted and placed in a MySQL database. 2)-This submitted link is given a status of 'unchecked'. 3)-Someone manually visits the page to see if it is legitimate. 4)-Based on a review of the page, the page is either given a status of 'publish' if it is valid, or simply 'checked' if it is not pertinent. The new system that I expect to be inplace after the project has been accepted as complete will perform the following: 1)-A link is submitted and placed in a MySQL database. 2)-This submitted link is given a status of 'unchecked'. 3)-Link validators visits each page link, in each of 400 mySQL tables, with a status of 'unchecked'. a) Compares the text on the page to the subject keyword/phrase/string list. b) If at least one keyword/phrase/string exists on a page, then an attribute will be added to the harvested links, that keeps track of how many keywords are present on that page. (Note: The current system does not have a keyword attribute, so you must provide this somehow.) c) I would also like for this script to add another attribute, 'Preview' to the page Links. So, if a link does contain more than one keyword/phrase/string, the first 20 human readable words will be stored in the preview attribute to this link. d) If a link does not have any keyword/phrase/string matches, the link will then be given a status of 'checked'. 4)-Human will click on a link on the administrators page, and it will show all tables in the database, ranked by the the number links with keyword/phrase/string matches. For example: If we have 400 tables, they will all be shown, with the table at the top of the page having the most verified links, and the table at the bottom having the fewest. You will provide a php script that will do this. 5)-Human will click on the table with the most verified links, and check the links that have accurances of keyword/phrase/string. Based on what they see in the preview, or by visiting the link themselves, the human will give the page status of 'publish' or 'checked'. Now, the means by which we maintain a keyword list is up to you. We could create a plain text file for each table, or you can simply compare the contents of pages with a status of 'publish' to those that are unchecked. Obviously, I would prefer the latter solution, as it would eliminate the need for maintaining hundreds of lists. However, because cost and speed are very real factors in the decision making process, I will accept the text solution if it is dramatically faster. With that in mind, I should say that my expectations are for the validator to be able to visit and analyze all unchecked links within an hour. As this project is for modifying existing work, I expect that the cost will remain low, and the delivery time will be short, and that the following conditions are met. 1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request.
3) Exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
4) Updated README for the existing system, that fully explains the installation and use of your solution. 5) Guaranteed support via e-mail prmessanger, should there be any problems.
Debian GNU/Linux Woody, mySQL, Perl, PHP, are all mandatory, and experience with phpBB would be appreciated.