I require a program that can take a large text file (from scanned and OCR'ed documents that are up to 500 print pages long) and convert the single file into multiple web pages.
It can be done as a Windows executable in any programming language you choose, or as a server-based program in PHP or Python for a Linux Apache server.
The software should:
1) Remove single line breaks that exist in the scan, but keep the double spaced paragraph breaks as paragraphs.
2) Be able to end a page cleanly after the end of a paragraph following a user-specified number of words (i.e. 500) NOTE: It does not end after nth word, but at the end of the paragraph containing the nth word.
3) Each new chapter/section of the source text documents begins with an upper case title. The script should be able to recognize that and begin a new html page at the start of each section. This feature should be optional for the user.
4) The script should allow the user to paste their own template code around the text.
5) Have the ability to include either the section title or the first sentence of the page as the
6) Same goes for the meta description.
7) The script should allow the user to create navigation into the template (ie previous/next links. This doesn't have to be complicated, just something that the user can include in the template like %%begin_next%%%%end_next%% )
8) The user should be able to specify the extension of the resulting files (html, shtml, php, etc) as well as a file prefix (i.e. Title_ with the pages simply numbered sequentially)
9) The script should be able to generate a table of contents that includes links to either every single page with it's title, or just the various sections, based on the user's preference.
10) Save the resulting html files in a user-specified directory
## Deliverables
The program interface would essentially include:
*A box to paste the template, plus some instructions on how to configure the title and description tags and navigation buttons.
*A box to specify how many words before the script looks for a paragraph break to end the page
*A check box to turn off or on the option of creating a new page at the start of each section or chapter.
*A box to specify the extension
*A box to specify the file prefix
*A radio or checkbox to specify either full table of contents or a "sections only" TOC.
*A place to choose or specify the destination directory
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
## Platform
win2k/XP desktop
**or**
Linux Server (Apache/mysql/php/python/perl)