Cancelado

Coder Needed For Complex Web Scraping Script

This will be a multi part script that will:

1. record project name and data field

2. learn data locations via a web based inter active script

3. retrieve data automatically

4. report any errors in retrieval process

The scraped data will need to be incorporated into a mysql database for data extraction by an existing website. New pages will be need for this as a secondary project.

## Deliverables

I need a complex webscraper built for me. I say complex because it will be required

to pull data from many web sites of different layouts. The first task for the winning bidder

will be to create an input file of urls from my existing database. Each of these urls will be the home page of

one of the sites we will be collecting data from. This input file creation should be very simple as my

current website displays these urls on one of my pages. This file will contain 2 pieces of data:

the websites unique number and the url of the websites home page.

There will be 4 parts to actual scraper script.

The first will part will work with a user to name a project and all of the data fields that will need to be captured.

For my first project with the script that you will build there might be 8 to 12 pieces of data that will need to be collected

from each site and they may recide on multiple pages. Each of these data fields will need to have a unique name given to them.

So, I might call the project "toy prices" and the 8 data fields might be, "mattel-truck", "Hess-truck", "dump-truck", etc, etc.

.

The second part of the script will work as web based interactive program. In this part

of the script each data fields location at every website in the input file (both url and exact location on the page) will be

recorded by the script with the help of a user. The script will start by reading from the input

file of urls one at a time and display the home page of the 1st site in a work box on the users screen. By

"work box" I mean that part of the screen will be for the user to communicate to the script

(like the header and left hand column) while the rest of the screen will show the actual website url data screen.

The user will then

go through each of the data fields needed from this site one by one and define the url and exact page location on the screen

so that the script can record this information of each of the fields for later automatic retrieval

in part three of the script. In order to do this the user must be able

to change the url (navigate from the home page) to get to the proper url where

the data resides. The user will select each of the data fields (maybe they will all show on the

left hand column of the users screen) one at a time and then highlight

(select) the data field on the website. From the users highlighting

of the data field the script must be able to record each data fields exact position

so that in the end:

For every data field at every website we want to collect data from, the script will learn and create a record.

The record layout will look something like this:

Positons:

1-6 website unique number

7-29 data-field-1-name

30-60 data-field-1-name-description (text/decimal/size)

61-90 data-field-1-name-url

91-119 data-field-1-name-page-location (starting row/column)

121-130 current date of data collection

131-140 exact time of data capture

141-150 data-field-1-data

151-180 error-message-if-any blank if none

So, if there were 1,000 websites to collect data from and 8 pieces of data to collect from each we should have

8,000 records in the project file that shows the exact location of of piece of data and the data itself along with

any error message there might be if the data could not be collected. I.e.

The url was no good or the data was supposed to be decimal but the script found text... etc. All of these

8,000 records will be recorded/written during the user interactive second section of the script.

Also, with this file you can see how we could selectivly go out and scrape the data for just one website or

go to every website and just gather data-field-2 from all of them or... etc. etc. It will be able to do this because

in section three of the script, the automated retrieval of the data, it will first read an input record that will

contain the information it will use to determin exactly what to do. This auto update section will need to be a cron type job. The third section auto update record will look something

like this:

Position:

1-9 starting website number

10-20 ending website number

30 If postion 30 has a 1 in it then get data-field-1 if it's zero do not

31 If postion 31 has a 1 in it then get data-field-2 if it's zero do not

32 If postion 32 has a 1 in it then get data-field-3 if it's zero do not

33 "

34 "

35

36

37

38 All the way through data-field-8

From this record we can see that if the starting website number is 1 and the ending number is

equal to the last website and all of the datafield characters are set at 1 then the script will go and retrieve

all 8 data-fields from all 1000 websites.

The fourth section of the script will be the exception reporting. During the auto updating cycle any time the script

encounters an error a message should be written on the record as well as to an error report. This error report will

discribe the error as best as possible so that a user can use section two of the script to correct the defined position for

the error that was encountered.

Habilidades: Engenharia, MySQL, PHP, Gestão de projetos, Arquitetura de software, Teste de Software, Hospedagem Web, Gestão de Site , Teste de Website

Ver mais: work at home data capture, where get job for coder, what you do for me i will do for you, what it the best way to use communicate website, what is the best work from home job, what can i do for a job at home, web scraping process, web scraping part time, web page errors, unique page layouts, the best website to learn php, the best web sites, the best way to learn php, text coder, task project management web based, site to learn how to create a website, second job work from home, script to pull data from a website, scraping prices from websites, scraping data from web database, piece work from home, none given, need layout of fourth of a page, name coder, mattel

Acerca do Empregador:
( 97 comentários ) Oakland, United States

ID do Projeto: #3053946

6 freelancers estão ofertando em média $300 para este trabalho

arhamsoftltd

See private message.

$425 USD in 14 dias
(59 Comentários)
7.0
bondarenkosergey

See private message.

$85 USD in 14 dias
(33 Comentários)
4.8
mzabbasi

See private message.

$382.5 USD in 14 dias
(2 Comentários)
1.9
vw7318512vw

See private message.

$313.65 USD in 14 dias
(8 Comentários)
1.3
vw7437207vw

See private message.

$340 USD in 14 dias
(0 Comentários)
0.0
webmend

See private message.

$255 USD in 14 dias
(0 Comentários)
0.0