Encerrado

A selection of webscraping tools

We require several webscraping tools to be designed and built, in order to obtain various information about multiple URLs.

The tools will be designed to be used by the client, so the interface must be clear and simple. Ideally the various tools will have a similar interface.

All tools are likely to follow a similar format: a list of URLs entered, possibly with the tool referring to a .txt or similar document. The tool will scrape relevant data and present the scraped data in an Excel or CSV format, with the extracted data in columns alongside the original input data.

Please see the attached Excel file “example output – see note different [url removed, login to view]” which gives possible examples of what the data might look like.

We need a quick turnaround on this work – the completed, working tools must be sent to us and approved by 23 December.

If the quality of the tools is high, there may be further opportunities for similar ad hoc projects in the future.

Technorati Authority Webscraper

The inputted data will be a list of URLs, perhaps in a .txt document, in the following format separated by carriage returns:

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

The tool will need to be capable of giving up to 10,000 results at a time.

The scraper will need to scrape the Technorati database for the Technorati Authority, and present the results with the URLs in the first column, and corresponding Technorati Authority number in the second column. See the “Technorati Authority” tab on the “example output” spreadsheet file.

Not all blogs are registered with Technorati; the output should return a blank value when the blog URL is not recognised by the Technorati database.

You may need to use proxies.

Google PageRank calculator tool

Similar to the Technorati Authority tool, this tool will take a raw list of URLs and calculate the Google Pagerank, and generate an output file with the URL in the first column and corresponding PageRank score in the second column.

Inbound/Outbound link tool

We require a scraping tool to compile numbers of inbound and outbound links scores for multiple URLs and present them in an Excel or CSV format.

The inputted data will be a list of URLs, perhaps in a .txt document or similar, in the following format separated by carriage returns:

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

The output spreadsheet file will present the URL in the first column, with the number of inbound links for that URL in the second column, and number of outbound links in the third column. See the “inbound & outbound links” tab on the “example output” spreadsheet file.

The tool will need to be capable of giving up to 10,000 results at a time.

Email scraping tool

We require a tool to extract all email address information from a blog page, or anything that looks like an email; for example, many websites disguise their email contacts in the form “example at example dot com. The tool will need to perform several iterations, as the information may be on the page in different formats:

1. Extract any mailto: hyperlink and format only as the hyperlinked text

2. Extract any word which contains letters, followed by “@” followed by more letters ie _____@______ - this will have to take into account any dots, hyphens or similar and NOT take these as “separators” but instead treat a space or carriage return as the separator.

3. Extract anything in the following format: “WORD WORD WORD AT WORD WORD WORD” and format as “wordwordword@wordwordword”

4. Extract anything in the following format: “WORD WORD WORD [AT] WORD WORD WORD” and format as “wordwordword@wordwordword”

5. Extract anything in the following format: “WORD WORD WORD @ WORD WORD WORD” and format as “wordwordword@wordwordword”

The tool will run from a list of URLs and propagate the results onto the same row as the originating URL – as follows:

URL email email

URL email

URL email email email email

URL

URL email

See the email tab on the “example output” file.

Blogflux scraping tool

This tool will scrape the Blogflux directory ([url removed, login to view]) and extract all the URLs which are contained in the directory, along with the corresponding “categories” to which they are assigned within the directory.

As the directory will update as time goes on, we would need access to the tools – this will not be a single-use tool.

Each blog within the directory has an assigned page within Blogflux, for example [url removed, login to view] for which the blog address is http://ornamentalist.net. The output file will only present the outbound link (eg [url removed, login to view]).

The data would be presented in a spreadsheet with all the URLs on different rows, and the categories to which the blogs have been assigned populated across columns. See the “Blogflux” tab on the “example output” spreadsheet file.

Blog Catalog scraping tool

See [url removed, login to view] – a directory which we would like to scrape for a complete list of URLs, with associated information.

For an example, see [url removed, login to view] the information which we require is:

Blog URL (in this case [url removed, login to view])

Rating of blog from users (scroll down to comments...in this case currently 5.00)

Number of fans on blogcatalog (see [url removed, login to view] in this case 13)

Description (in this case “The fine arts, decorative arts and architecture of Europe, North America and Australia, 1650-1933”

“Listed in” tags (in this case Art History, Architecture). These will use a different cell for each category (see example on attached file)

The tool will scrape the entire directory at once and therefore give a large output file with over 100,000 rows of data.

See the “Blog catalog” tab on the “example output” spreadsheet file. Note in particular the fact that “Art History” and “Architecture” are in different cells.

Google search tool

We need a tool to search Google for given search terms and scrape the results into a spreadsheet. The tool would need to be capable of searching multiple terms at once. The input would be a list, perhaps in a .txt file, of search terms, for example:

food

“mobile phones”

technology blogs

“luxury jewellers” London

The tool will then populate a spreadsheet with columns containing: (1) search term searched for, (2) Metatitle of link; (3) URL; (4) home URL or parent website; (5) PageRank; (6) Google description.

The tool will need a feature whereby the user can limit the number of search results scraped per search term (for example, the user might want to specify 100 results for each of the 4 example terms given above...giving a total of 400 results).

The tool should give the option of searching using either [url removed, login to view] or [url removed, login to view]

See the “Google search tool” on the “example output” file; in this case, the user would have specified 3 results per search term on google.com.

The tool will need to work so that search terms like related:[url removed, login to view] (Google related search) also work.

[url removed, login to view] scraper

We need a tool to scrape the entire [url removed, login to view] database of profiles and populate all the information on the profiles pages into an output file. This information will be:

(1) Profile URL, (2) Name, (3) gender, (4) Email address if shown, (5) My Web Page URL, (6) IM username, (7)City/Town, (8) Region/state, (9) Country, (10) Industry, (11) Occupation, (12) About Me, (13) My Blogs, populated across multiple columns if applicable.

See the “[url removed, login to view] scraper” tab on the attached file for an example (note that in this example, not all the columns are populated as some of the information does not exist).

I need a VERY QUICK TURNAROUND for these projects - with tools to be delivered by 23 December. Please contact me with your proposal and costings.

Update: I also need a further tool to scrape the Alexa database and assign the Alexa ranking to an output file. Similar to the other tools in the brief, the input would be a .txt file or similar with a list of URLs, and the output file would be a spreadsheet/CSV with the URLs in the first column and Alexa rank in the second column.

Habilidades: Captura de dados na web

Ver mais: work search australia, work proposal examples, work opportunities in australia, work opportunities from home, work opportunities australia, work like a blogger, working opportunities in australia, working opportunities australia, working from home opportunities, working from home data input, working from home australia, work from home websites list, work from home tools, work from home proposal, work from home phones, work from home opportunities in australia, work from home opportunities australia, work from home london, work from home inbound, work from home data input uk, work from home co uk, work from home australia opportunities, work at home opportunities australia, work at home australia, work as a blogger

Acerca do Empregador:
( 42 comentários ) London, United Kingdom

ID do Projeto: #571739

4 freelancers estão ofertando em média $233 para este trabalho

srinichal

willing to start right away

$250 USD in 10 dias
(24 Comentários)
6.1
mantislin

Hi sir, can do this for you. thanks, kimi.

$230 USD in 5 dias
(61 Comentários)
6.0
shreesoftech

Hi, i am an experienced programmer, i have done many bots and crawlwers, i can show you demo. please go through [url removed, login to view] refresh the page if it does not appear. and find my IM contact detail Mais

$200 USD in 7 dias
(5 Comentários)
3.2
aruhat

Hello, Please have a look in PMB for more details. Regards, Bhavik

$250 USD in 0 dias
(1 Comentário)
4.0