My problem is simple:
I need to extract text from www.uspto.gov.
(1) Click on "Patents: Search" on the left-hand side of the screen.
(2) Click on "Issued Patents: Quick Search".
(3) For Term 1, type in "4000000" and for Field 1, select "Patent Number".
(4) Click on "Search".
(5) About 2 screens down, is a heading "Foreign Patent Documents".
(6) Below the heading are 4 patents.
(7) I need the program to be able to extract these 4 patents by number and the country code (i.e. CA, DD, UK).
(8) The program outputs 2 columns of data, the first column lists the "citing" patent and the second column lists the "cited" patent. Therefore, for just patent #4000000, the data looks like this:
Notice that the commas are removed from the foreign patent numbers, and the periods are removed from the country code.
The program will accept a starting patent number and an ending patent number. It turns out that not all US patents cited a foreign patent. If you check #4000001 thru #4000011 you'll find that none of them cite a foreign patent. The program will include a "." in the cited column if the citing patent does not have foreign patents. In other words, all patent numbers between starting and ending will have at least one row.
For patents #4000000 thru #4000012, the data will look like this:
Speed is essential. I want to be able to extract on the order of millions of patents (i.e. from #4000000 to #6000000).
Of course, the program will retain whatever output it generated in case the computer is unplugged or otherwise crashes.
I need the bidder to be accessible via Yahoo IM ([url removed, login to view]) during the course of the work, for possible minor adjustments during and post-transaction.
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request.
3) Exclusive and complete copyrights to all work purchased. (No GPL, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site).
Whatever platform is easiest, although a suggestion would be JAVA.