Perl crawler for bi-lingual web pages

I need a Perl script that can do what the attached PDF files describes:

1. take up to 1,000 initial keywords in Language A,

2. search on Google (using Search API) for pages containing them,

3. use the relevant pages as starting points for crawling the Internet for new pages. reach a certain threshold (say, 1,000,000 pages),

4. the pages are searched for both the keywords in Language A and their equivalent in Language B (in order to find any parallel texts, i.e. same document in 2 languages)

5. filter the results into categories, based on matching score

6. identify the parallel texts (using the technique described in PDF file #2 and also the occurrence of parallel words)

Please read the attached PDF files for more details, especially about how someone else implemented such process in the first place.

## Deliverables

The goal of this project is quite simple: crawl the Internet and identify parallel texts.

Habilidades: Engenharia, MySQL, Perl, PHP, Arquitetura de software, Teste de Software

Ver mais: web based languages, web 2.0 languages, search files on web, find perl, bi search, what is web crawling, what is a crawler, find matching words, google crawler api, php crawl web, pdf web api, keywords searched, crawl google search results, crawl pdf files internet, web search process, crawl Google, google crawler script, web search api, simple crawl script, google api crawler, php filter keywords, api place order, web crawler pdf, crawler perl parallel, web crawler details

Acerca do Empregador:
( 23 comentários ) Romania

ID do Projeto: #3012460