Cancelado

Crawl the [url removed, login to view] web site, download PDFs

I need a Perl script that will crawl [url removed, login to view], index the files over there, and download only PDF and DOC files (avoiding downloading unnecessary files such as JPG etc.).

The files are public domain, don't worry about any copyright issues, and publicly available to anyone.

Yet you have to take care of the following issues:

* I need the script to be done as 2/two separate scripts: one for the subdomain [[url removed, login to view]][1] and one for the rest of http://un.org. Whatever they download should happen in 2 different folders.

* Please pay a lot of attention not to overload the web site (I don't want my name in the newspapers as the guy who tore down the United Nations web site...). There should be a function in the script allowing random accesses between 0 and 10,000 ms (and please mark it as such, so that I can edit it myself if needed)

* I want the script's user agent to be able to impersonate either Google bot, Bing bot, or randomly between usual browsers. I need this function commented and easily editable by myself later on.

* I want the script to have no more than N threads ad the same time (not to overload the target web site). Please mark this function so that I can edit it later (the default should be 5 threads)

As a general rule your script should be commented so that I can modify it later on.

## Deliverables

here are some final requests (and details) about the script covering [url removed, login to view]

[url removed, login to view] is basically a search engine. I will provide you with a list of search terms. Only use the simple search option.

For every individual document should have its own folder; inside each folder the files containing language versions of the same document should be saved with a 3-letter code, showing the language (as per ISO 639-2). For example, a document with the original file name "[url removed, login to view]" should be saved as "[url removed, login to view]

Codes: ARA (for Arabic), CHI (for Chinese), ENG (for English), FRA (for French), RUS (for Russian), SPA (for Spanish).

Also, if on [url removed, login to view] you encounter PDF files there is no need to download the DOC files, too.

Habilidades: Engenharia, MySQL, Perl, PHP, Arquitetura de software, Teste de Software

Ver mais: whatever you want in spanish, web mark, un code, search files on web, russian search engine, php codes pdf, google documents scripts, covering the need in arabic, covering letter example, code org 5 a, code org 1, web doc, russian to eng, french to eng, eng to spanish, eng to spa, eng to french, eng to arabic, either or in spanish, chinese to eng, google search web site, web bot, web arabic, web agent, un

Acerca do Empregador:
( 23 comentários ) Romania

ID do Projeto: #2964145

8 freelancers are bidding on average $85 for this job

tzo

See private message.

$85 USD in 5 dias
(312 Comentários)
6.8
kseen

See private message.

$85 USD in 5 dias
(32 Comentários)
3.8
jeremiahdodds

See private message.

$85 USD in 5 dias
(26 Comentários)
4.9
alienwebdevvw

See private message.

$85 USD in 5 dias
(6 Comentários)
2.9
alienwebsl

See private message.

$85 USD in 5 dias
(6 Comentários)
2.7
thayumanavar

See private message.

$85 USD in 5 dias
(2 Comentários)
1.8
spensorzvw

See private message.

$85 USD in 5 dias
(0 Comentários)
0.0
vw7352331vw

See private message.

$85 USD in 5 dias
(3 Comentários)
0.0