Encerrado

build web spider

Hi

Looking to have a web spider built, the spider must adhere to the following guidelines.

1) Completely obey [url removed, login to view] files and meta tags in web pages.

2) Only request [url removed, login to view] file once when indexing a website, i.e. if spidering [url removed, login to view], only request [url removed, login to view] file once for all pages in that website, store [url removed, login to view] file in table: at_Robot_Txt

2 a) Check to see if my Spidername is not blocked by website, if not continue to index pages

2 b) Insert [url removed, login to view] into database table: at_Robot_Txt, and use information from that to determine which pages can and cannot index

Columns:

• URL_Robot_Idx INT Primary Key

• BaseURL VarChar(100)

• RobotTxt VarChar(7500)

If no [url removed, login to view] file found enter “No text file found in Site”

3) Allow to enter own user-agent name i.e. "Spidername"

4) Read from a list of banned words and permitted words.

5) If it finds any banned works ignore page

6) If it find any permitted words index page

7) If it finds neither of the above ignore page

8) Must be able to index 60,000 + pages a day.

9) Must run on any windows platform from Windows 2000 professional, XP or Server

10) User interface must be easy to use and I should be able to see how spider is progressing, similar to visual web spider.

11) Take list of URLs from SQL 2005 at_URLsToIndex.

Columns:

• URLID INT Primary Key

• URL VarChar(300)

12) When indexing page insert data into the following table at_SpideredWebsites

Columns:

• PageURL VarChar(300)

• BaseURL VarChar(100)

• PageTitle VarChar(200) maximum 20 words

• PageParagraph VarChar(6000)

• PageSize VarChar(6) in KB

• PageLastUpdated VarChar(10) Format: 23 May 07

• ServerIpAddr VarChar(50)

• PageLevel INT i.e [url removed, login to view] = 100, [url removed, login to view] = 75, [url removed, login to view] = 50 [url removed, login to view] = 25 [url removed, login to view] = 0

• PageSpidered SmallDateTime Format: 23 May 07

13) Only index URLs that begin http://

14) Remove all html tags before inserting into database

15) Ignore URL that are invalid i.e

[url removed, login to view]://[url removed, login to view] etc

16) For body text all text except text in dropdownlists

I cannot state how strongly the spider must obey [url removed, login to view] files and only request file once when in session no matter how many threads are running, if spider is stopped and restarted later, only request [url removed, login to view] file once and update [url removed, login to view] table in database.

If you cannot achieve the above, please do not apply for this project as you will be just wasting my time and yours.

The budget for this project is $500, but get this right and i'll use you for the crawler that needs building

George

Habilidades: .NET, Programação C

Ver mais: web format, visual web, sql guidelines, robot name, name the robot, name that robot, name of the robot, name for a robot, int i, how to build web pages, how to build web page, how to build a web page, how can i build web page, hi web, format web, build com, all-in-web, how to build a web crawler, web net, web index, Web e, web domain, web build, web agent, user agent

Acerca do Empregador:
( 0 comentários ) Washington, United Kingdom

ID do Projeto: #150624