Nutch is a Java based Web-Search engine. While it can run on clusters of hundreds of machines it can also be run on a single host and can provide search results via a few JSP pages provided with nutch.
Crawling would be accomplished by something like `./bin/nutch crawl [url removed, login to view] -dir crawl -depth 2 -topN 30000` and the HTML interface by dropping `[url removed, login to view]` into you favorite servlet container (I use Jetty).
Your task is to buils a JSP single page allowing to view statistis about the current search index. For that you need to use the lucene API. Probably the study of the sourcecode of the tool "Luke" can show you exactly how to query the index (see [url removed, login to view])
The page should display
* number of documents
* number of terms
* index last modified. Date in [url removed, login to view] format
* Any statistics you can get on the crawldb. [url removed, login to view] [url removed, login to view] and [url removed, login to view] might provide pointers
This page will be used by us to monitor if the nutch instance is "healty", still adding pages etc. Nutch is run on an intranet spidering about two dozen hosts.
* JSP Page displaying statistics.
* If you need a newer version of nutch than 1.1 please provide us with the whole nutch installation
* Use OpenSource Libraries where they are available. If you copy OpenSource code please mark it clearly and mention the License of the the included code.
* Copyright of the Code written by you for the project will be assigned to us. We might OpenSource the code if we consider it of general interest.
* During development you will not get access to our servers, accounts, resources. Installation will be handled by us according to the documentation we provided.
FreeBSD 7, JBK 1.6, nutch 1.0