A phython program and/or web page(s) that scrapes a financial user forum website.
Using: MySQL / Python / PHP / Open to suggestions
The user forum is unique, not php-nuke or vBulletin.
The qualified bidder will have experience parsing html / scraping values from an http response.
Create a program / website – Python preferred but open to suggestions that scrapes a list of web pages (list is predefined in a database table )
The process will work a database table that contains a URL in each row.
The process will make the http request, with a customizable user-agent string.
Next, determine if the http response is a valid response, or an http error, or a redirect to another page.
If the response is the desired response, scrape 10 different elements off of the page. These elements include:
• Topic Description
• Topic (stock) symbol
• The updated topic URL
• The number of followers for the topic
• The Topic Category ID
• The Topic HTML title
• The last DTTM post for the topic
• The number of posts for the topic
• The Moderators for the topic (there could be multiple moderators – each one will need to be recorded) The moderator will have a user_id, a username, and a URL. All will need to be recorded.
After the page elements have been scraped – perform the following action:
Update the original table with the new values (listed above)
For “moderators” elements, the following should be performed:
1. Determine if the USER_ID is already in a USER_TBL. If the user already exists in the table, no action. Otherwise insert a row into the USER_TBL, values USER_ID_NBR, USER_NAME, USER_URL and SYS_ADDED_DTTM.
SELECT ‘X’ FROM FORUM_USER_TBL WHERE USER_ID_NBR = :1
2. Each forum topic has a number. Take the TOPIC_NBR and the USER_ID_NBR and check to see if it exists on the MODERATOR_TBL. If exists, no action, otherwise insert a row into the MODERATOR_TBL values TOPIC_NBR, USER_ID_NBR, SYS_ADDED_DTTM
SELECT ‘X’ FROM MODERATOR_TBL WHERE TOPIC_NBR = :1 AND USER_ID_NBR = :2
3. Some additional logic, when moderators have been scraped from the web page, if they exist on the MODERATOR_TBL for the TOPIC_NBR, but they are no longer on the web page as a moderator, then remove the row from MODERATOR_TBL.
This project has the following requirements and EACH BIDDER MUST ADDRESS EACH ITEM:
(1) The bidder is fluent in English without 3rd party assistance.
(2) Bidder is familiar with and can use [url removed, login to view], [url removed, login to view], GOTOMEETING, SKYPE and google talk/hangouts to work out solution.
(3) The solution MUST BE HIGHLY PORTABLE, meaning it can be moved from one instance to another.
(4) Bidder will include platform requirements in bid. Please state what is required. A simple solution that runs on a hosted VPS account or a vmware instance is preferred, but I am open to detailed suggestions.
(5) Testing and final delivery to be performed on BUYER supplied instance.
(6) Bidder should have experience with full text indexing for future phases.
An additional document with screenshots of the forum are available to bidders after the above items are addressed.
Please do not make a final bid until reading the additional document.
8 freelancers are bidding on average $204 for this job
Hi sir, I am scraping expert, I have did too many similar projects, please check my feedback then you will know. Can you tell me more details? then I will provide demo data for you. Thanks, Kimi
Hi. I will develop scrapy (python framework) to scrape data. I hope that urls are points to same domain. I prefer to use linux platform for scrapy projects. I have no experince with [url removed, login to view], [url removed, login to view], GOTOMEETIN Mais
I have experience in scraping process using perl - Lwp user agent. i can scrape the data and give the output in excel format to load into Database. If need i can load directly into direct database.