Generalize XSLT2.0 scripts to extract genealogy data from webpages
A lot of people post family histories on the Web. Although there are millions of such webpages, most are generated by one of about 50 different "genealogy->HTML" generators. Each generator uses a consistent HTML tag layout, so it is possible to extract the original genealogy data (names, dates, and places) from the webpages generated by a specific generator using an XSLT script.
We currently have first-cut XSLT scripts for each of the 50 generators. The problem is, as we apply the scripts to newly-identified pages we notice problems where the scripts need to be generalized to properly extract the data from the new pages. As we find problems running the scripts on new pages over the next several months, we will give you the scripts that need fixing along with the pages they break on, as well as pages that they don't break on and should continue to work on. We want you to generalize the scripts so that they properly extract data from the new pages as well as the old pages. The scripts are written in XSLT2 and can be run using Saxon 8 (an open-source XSLT2 processor).
We expect this project to require approximately 300 person-hours. We plan to hire 3-4 people to work on it and to complete it in the next 3 months.
We therefore expect you to be able to spend a minimum of 8 hours/week on the project from now until August.
Pay is $15/hr, plus bonuses for quick work.
Attached is a sample problem where an XSLT script doesn't properly extract all of the data from a particular XML document. We provide this problem so that you can understand the type of work that is expected and so that we can evaluate your XSLT2 skills. Please submit your solution to this problem along with a summary of your qualifications.
The project is work-for-hire, so we will own the scripts and the changes you make to them.