Cancelado

Wine Name Extractor + Wrapper Generation for Stores(repost)

I need a tool that will do the following: A. A wine name extractor: The extractor will be given an arbitrary web page, and will mark the names of the wines listed there. The format of the wine names is typically not standardized, although it follows an understandable structure. Typically it contains the name of the winery (e.g. Chateaux Margeau), the location of the wine (e.g., Carneros), the year of the wine, the type of the grape/wine (e.g., chianti or shiraz), potential modifiers (e.g., Reserve, Grand Cru), and the price. B. A tool that will interact with a winestore website to identify the listed wines and their prices. I attach a list of winestores of interest.

## Deliverables

The basic idea is to write a trainable extractor that will work as follows:

1. Take as input indicative, training values for the different parts of the name of the wine. Each wine name typically consists of some "standard" parts (e.g., one part will be the grape, the other part the winery, the other the location, and so on).

2. Using the data, we can build an extractor that will learn to extract wine names from multiple stores. (Potentially extracting a different extraction model for each store. The model extraction should happen automatically.)

3. Use the derived extraction models, we can identify more wines in each store and can learn more candidate names for wines, wineries, and so on.

4. In a validation step, a human annotator marks the newly extracted candidate names as correct or incorrect.

5. (Optional, but useful) Use the negative answers from the human to re-examine the extraction patterns and remove/modify the ones that cause problems.

6. Go back to Step 1 and repeat the process, until having no improvement.

**More details:**

For creating the initial list of wine names, grapes, and producers, Wikipedia is a very good source for data about wine names:

[url removed, login to view]

[url removed, login to view]

And in general:

[url removed, login to view]:Wine-related_lists

For names of wine producers you can get the names from

[url removed, login to view]

Marking prices, and years is comparatively easier, as we only need a regular expression.

**Even more details**

Now, my own take on how to build such a a tool: The wine names contain (with some probability) some of the elements described in point A, and there is some probability of transition from one field to another (a Markov model). Using the initial database of wine-name-elements, it is possible to build a coarse model that will be able to recognize wines. When the model is applied to strings that have some unknown elements, with some probability the string will be assigned to a field, and will be marked as candidate feature. Then the annotator examines the derived candidate features and marks them as "good" or "bad". The model is then retrained and the process continues. Of course, you may choose to implement the tool differently.

There are also tools available that can be used to facilitate this work. The Simile/PiggyBank and Dapper provides some online tools to build this tool in a different manner than the one suggested above.

I understand that the accuracy of such a tool will never be perfect. Also, there is no need to try to deal with complicated interfaces for the websites. A standard crawler (e.g., wget) will be used to retrieve the pages from each website. The tool should work on the retrieved, locally stored HTML pages.

**Agreeing for deliverables**

Since it is unclear how to measure "success" for such a tool, we will agree before starting on the following:

* What is the target accuracy of the tool for name extraction? What is the expected false positive and false negative ratio, that will be considered an acceptable solution? How do you plan to compute these ratios (cross-validation?)

* Similarly, what percentage of the web sites the developed tool will manage to process? (We have approximately 3000 websites). How will you measure failure/success at the website level?

Furthermore, it would help if you can answer the following questions:

* Have you ever developed something similar in the past?

* What approach do you plan to follow for this project?

## Platform

Pretty much anything that works is fine :-)

Habilidades: Administração de Bancos de Dados, Engenharia, MySQL, PHP, Arquitetura de software, Teste de Software, SQL, Hospedagem Web, Gestão de Site , Teste de Website

Ver mais: wine websites, wine com, wiki websites, wikipedia websites, wiki online, what is a regular expression, what is a probability, website data extraction tool, web data extractor online, web crawler wiki, web crawler features, using regular expression, structure stores, strings for c standard, string problems, string in data structure, string data structure, solution of probability problems, solution for problems, re write tools, re write tool, regular expression using, regular expression no, regular expression in c, regular expression generation

Acerca do Empregador:
( 54 comentários ) Thessalon?ki, Greece

ID do Projeto: #3032857

12 freelancers are bidding on average $2975 for this job

zhijun

See private message.

$1700 USD in 60 dias
(25 Comentários)
5.2
topcodersteam

See private message.

$4250 USD in 60 dias
(8 Comentários)
5.0
monssoengroupvw

See private message.

$4250 USD in 60 dias
(2 Comentários)
5.7
davidzhangvw

See private message.

$1700 USD in 60 dias
(4 Comentários)
3.6
dreamexpertsl

See private message.

$4250 USD in 60 dias
(1 Comentário)
2.8
a4code

See private message.

$170 USD in 60 dias
(6 Comentários)
1.9
greentech12vw

See private message.

$4250 USD in 60 dias
(1 Comentário)
0.0
mirosoftvw

See private message.

$4250 USD in 60 dias
(1 Comentário)
0.0
kimirizltd

See private message.

$4250 USD in 60 dias
(1 Comentário)
0.0
nokc

See private message.

$4250 USD in 60 dias
(1 Comentário)
0.0
agajn

See private message.

$425 USD in 60 dias
(0 Comentários)
0.0
outsourcesovw

See private message.

$1955 USD in 60 dias
(0 Comentários)
0.0