A project to spider a small number of news sites and keep an indexed database of their text content. Each page will be assigned an rank dependant on several factors. The database will be queried by a simple web front-end using keywords.
This project comprises of three parts:
* The back-end server which will continuously spider a number of news sources. New pages will be indexed for full-text search and custom relevancy data will be computed and stored.
* The web front-end is a very simple website which will allow end users to view search results across a number of pages. An options page will allow users to customise their search parameters.
* There will also be an administration section for site admins to view users, news sources and indexing stats.
The project should be coded in Java, Python or Perl and hosted either on Google App Engine or Amazon EC2. Full text indexing and searching should be carried out using Lucene (with Solr if needed).
Full functional spec and web interface wireframes available to prospective candidates.
All prospective candidates should have experience in deploying Lucene.