I need a Perl script that will crawl [url removed, login to view], index the files over there, and download only PDF and DOC files (avoiding downloading unnecessary files such as JPG etc.).
The files are public domain, don't worry about any copyright issues, and publicly available to anyone.
Yet you have to take care of the following issues:
* I need the script to be done as 2/two separate scripts: one for the subdomain [[url removed, login to view]] and one for the rest of http://un.org. Whatever they download should happen in 2 different folders.
* Please pay a lot of attention not to overload the web site (I don't want my name in the newspapers as the guy who tore down the United Nations web site...). There should be a function in the script allowing random accesses between 0 and 10,000 ms (and please mark it as such, so that I can edit it myself if needed)
* I want the script's user agent to be able to impersonate either Google bot, Bing bot, or randomly between usual browsers. I need this function commented and easily editable by myself later on.
* I want the script to have no more than N threads ad the same time (not to overload the target web site). Please mark this function so that I can edit it later (the default should be 5 threads)
As a general rule your script should be commented so that I can modify it later on.
here are some final requests (and details) about the script covering [url removed, login to view]
[url removed, login to view] is basically a search engine. I will provide you with a list of search terms. Only use the simple search option.
For every individual document should have its own folder; inside each folder the files containing language versions of the same document should be saved with a 3-letter code, showing the language (as per ISO 639-2). For example, a document with the original file name "[url removed, login to view]" should be saved as "[url removed, login to view]
Codes: ARA (for Arabic), CHI (for Chinese), ENG (for English), FRA (for French), RUS (for Russian), SPA (for Spanish).
Also, if on [url removed, login to view] you encounter PDF files there is no need to download the DOC files, too.