I'm building what could eventually be a *massive* collection of government documents in PDF format. I'm looking for a sharp, experienced and professional programmer to develop what will be the heart of a new website: a fast, indexed, smooth, robust and scalable internal search engine capable of doing keyword searches on a potentially *very large* collection of PDF documents (all text-searchable, some converted via OCR).While I'm fine with the idea of using a public license search engine product (mnogo for instance), I absolutely need the following key features, some of which I haven't seen on any public products:1. Able to do both sophisticated fuzzy natural language searches AND complex boolean searches, including phrase searches, AND/OR/NOT, wildcards, etc. Most importantly as to the latter, it must be capable of doing complex proximity searches (for instance, two words in the same sentence, paragraph, or within a certain number of words.)2. In addition to keyword searching on the full document text, the search engine must be able to limit its search based on certain meta-data fields associated with each file (for instance, date authored, authoring agency, etc.) Some of these fields must themselves be keyword searchable (e.g., "author"), while others would be numerical (e.g., "date within 2 years").3. When the search results are displayed, must include snippets of text under each document link (ala Google) showing how the keyword hits appear.4. Also like Google, needs to be capable of viewing the documents as either PDF or text/html (I assume this requires a separate module to convert the PDFs), and the keyword "hits" need to be highlighted when the document is opened. (I presume this would only be possible with the text version?)

1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. 2) Installation and implementation on the platform(s) specified. 3) Complete ownership and distribution copyrights to all work purchased. To be clear, this is a work-for-hire proposal, and all copyrights inure to the purchaser, not to the programmer.


I'm pretty much starting from scratch, so it would be helpful if the programmer were willing to consult on languages to be used, site hosting solutions, etc. Would also need to do the installation, and assist with implementation. Best scenario would be someone willing to be involved on an ongoing basis.

Also, there's an entirely separate project involving the creation of spidering software to actually capture all these PDFs off of other certain public sites. I'm trying to do that myself, but would consider adding that to the project. In any case would need to coordinate with the search engine programmer to make sure the meta-data is captured in a usable way.

Finally, the start-up money for this site is limited, and there may be a delay between the posting of bids and my actual ability to commission the work.

Thanks in advance for any bids!

