I need an application in two parts:
1. Parsing document data
The application needs to parse information from different document formats (currently PDF, CHM, XLS and DOC).
All parsed data should be saved in mysql database. You are free to use any programming language for parsing as long as its easy to get it to work on my debian and RHEL4 box (so python, perl or php are preferable).
2. Display data in website
A small website needs to be build that displays all parsed files. It should also contain a search box to search for documents. There should also be a page where I can categorize the parsed documents. The website should be build in PHP with AJAX.
## Deliverables
1. Parsing the document data
The application needs to parse information from different document formats (currently PDF, CHM, XLS and DOC). It needs to parse as much information as possible, but at the least the following:
- Name
- Size
- Type
- Description (properties)
- All other easy accessible metadata
Doctypes should be handled as plugins so that new ones can easily be written.
All parsed data should be saved in mysql database. You are free to use any programming language for parsing as long as its easy to get it to work on my debian and RHEL4 box (so python, perl or php are preferable).
Parsing should be a cron job that runs every so often, and picks up all files from a input directory. The application should move the documents to a suitable location.
If possible (want-to-have in MoSCoW) it should parse the contents of the file so a full-text search on the file is possible.
2. Display data in website
- A small website needs to be build that displays a list of all parsed files.
- It should contain a search box to search for documents (by all parsed fields).
- There should also be a page where I can categorize the parsed documents.
- The website should be build in PHP with AJAX
- I want to be able to open the documents inline or download them
- For now no authentication is required this might come in a new release.