Em Andamento

language web crawler

We want to crawl the web to get: 1)? lists of the words used in different? languages on the web, and 2) a count of the number of times each word is found in each language UNTIL WE HAVE A STATISTICALLY SIGNIFICANT SAMPLE. Maybe 1000 pages of each language? We do not have a list of URLs we want to use. All that matters is that we do not count the same page twice. Other than that, ANY 1000 pages of each language will be fine. I imagine that the program will crawl pages by charset, CHECK to be sure the page is the "correct" language (per the charset tag) by comparing the simplest words in that language (see CHECK below), count the words on the page, note which page it is so it does not get counted again, and move on. CHECK Because charset tags are not alway reliable, we would pick 20 (or so) common words that are unique (and really common) to each language. E.G. an English example: the, an, in, are, is, and, to, on, this, a, by, that, were, have, been, will, a, of ...and then look for a meaningful subset of them to appear on a page before deciding what language it is. Obviously, we would test the search mechanism "by hand" first to be sure it worked in each language.) Note: I will identify the "check" words for each language, and be accordingly be responsible for the quality of this language filter. The? app will place the words and count into an Excel spreadsheet. (one sheet per language). As an example, after using this tool in English (and sorting by frequency within Excel) there would be? VERY long list, with a number next to it (indicating how many times it was found) like: the? 9,323,343 of? ? 9,028,282 and 9,003,939 a? ? ? ? 8,757,232 etc.... The languages of interest are: Afrikaans, Arabik,? Bulgarian, Catalan, Pinyin (Chinese), Croatian, Czeck, Dutch, English, Estonian, Finnish, French, German, Greek, English, German, French, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malay, Norwegian, Polish, Portugese, Romanian, Serbian, Slovak, Slovenian, Spanish, Swahili,? Swedish, Tagalog, Thai, Turkish, Urkranian and Vietnamese.

## Deliverables

1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. 2) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request. 3) Exclusive and complete copyrights to all work purchased. (No GPL, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site).

## Platform

We are running Windows 2000, IE 6, and Excel 2002.

Habilidades: Administração de Bancos de Dados, Engenharia, MySQL, PHP, Arquitetura de software, Teste de Software, SQL, Hospedagem Web, Gestão de Site , Teste de Website

Ver mais: working of web crawler, web search tool, web page spreadsheet, web languages list, web-crawler, web 2.0 languages, subset test, spreadsheet web page, spreadsheet web form, spreadsheet on web page, spreadsheet on the web, spreadsheet in web page, spreadsheet engineering, slovenian to english, serbian to french, romanian to french, portugese to spanish, not sure in spanish, list of web languages, language web, language english lithuanian, imagine engineering, estonian to english, english to slovenian, bulgarian to english

Acerca do Empregador:
( 9 comentários ) United States

ID do Projeto: #3000944

Premiar a:

techxpertAsh

See private message.

$85 USD em 25 dias
(644 Avaliações)
7.9

5 freelancers estão ofertando em média $357 para este trabalho

olegsavchuk

See private message.

$425 USD in 25 dias
(103 Comentários)
7.0
snakebytero

See private message.

$425 USD in 25 dias
(10 Comentários)
4.7
superpedro

See private message.

$425 USD in 25 dias
(6 Comentários)
3.7
fernandocerini

See private message.

$425 USD in 25 dias
(0 Comentários)
0.0