I need a Perl script that can do what the attached PDF files describes:
1. take up to 1,000 initial keywords in Language A,
2. search on Google (using Search API) for pages containing them,
3. use the relevant pages as starting points for crawling the Internet for new pages. reach a certain threshold (say, 1,000,000 pages),
4. the pages are searched for both the keywords in Language A and their equivalent in Language B (in order to find any parallel texts, i.e. same document in 2 languages)
5. filter the results into categories, based on matching score
6. identify the parallel texts (using the technique described in PDF file #2 and also the occurrence of parallel words)
Please read the attached PDF files for more details, especially about how someone else implemented such process in the first place.
The goal of this project is quite simple: crawl the Internet and identify parallel texts.