Looking for someone to process all the wikis above 50k in size from [url removed, login to view] and [url removed, login to view] . Include non-English versions, talk pages and user pages but not revision history (ie you only need the most recent articles). We would like you to: A. convert the above wikis listed into html and their associated image dumps. B. extract the html and in-article images from the wikis attached by crawling the sites. Only extract the article itself, not the template (navigation and footer). C. machine translate the html from both A and B into as many languages as possible (from and to English and between other non-English languages). For this we recommend you use a translator that preserves html. D. Remove broken links (eg where in Wikipedia people link to articles that have not been written yet). E. Output HTML and where there are associated image dumps, ensure that links to the images are ok, otherwise strip them out. Only include article pages, not other namespaces (eg image, user and discussion pages). This must be able to be run periodically. So any manual processes must be clearly documented. This can be run on our servers or on yours. If it's ours, it must be on the linux platform and require no more than 400MB memory. For this initial run, all data from all the wikis in all the languages is a deliverable. Subsequent runs we can negotiate something beyond this project. Coders without significant history on RAC must submit a portfolio and expert guarantee of 20% to be considered.
The script needs produce one table with the following fields: * Article name (in original language) * source wiki * Original Language (2 letter code) * Translated Language (2 letter code).. Leave as NULL if it's the original untranslated content * html (translated, where an image dump is provided have correct links to them, otherwise take them out) * date of translation * date of article (as specified in XML feed) * popularity of article (where the record exists) This task deals with a large amount of data (in the gigabytes) so it's advisable that only coders with experience in managing gigabytes of data attempt this. 1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).