We need a jobserver where we can dispatch jobs to a hadoop cluster. It is approximately the same we try to achieve that is discussed at: [url removed, login to view]@[url removed, login to view] We would like to target amazon mapreduce, see [url removed, login to view] and the project should include integration with this service.
The jobserver should be a standard war file that is deployable in any standard servlet container. Preferable it should be written using struts/hibernate/mysql/jquery/ext/guice or other similar open source technologies
We should be able to administrate jobs through an ui of the server that is accessed through a browser.
The first job that should be implemented on the server is for processing html pages fetched from real estate brokers. The job should run through a set of stored pages, group(reduce in hadoop terms) them according to broker/saleid/fetcheddate and dispatch them group wise to a component that we will supply, which will handle further processing.
Also the job should handle submit of htmlpages through an http based interface. For each htmlpage the following attributes should be stored:
* htmlcontent(copy of the page that is fetched, could be zipped, cleaned to minimize storage need, it will amount to between 50-150 kb pr entity)
* saleId(unique id for the broker)
After the job has processed an htmlpage the isparsed attribute should be modified accordingly and the htmlpage should be store if possible. The processing of a htmlpage group can fail and this should be stored in a database/htable/log and should be available for later retrieval by the ui.