We need you to develop a spider program that will traverse the page structure of certain specific sites, and extract e-mail addresses, skype ids, msn ids, etc.
This will not crawl the web in its entirety, it will only crawl specific sites we will tell you, and will not leave those sites, it will only follow certain internal links.
More information in the deliverables section.
The project will consist of two parts:
1) The "spider engine": This will be the "program" that is opened by the user and will have the UI that will be used. It will be really simple (it doesn't really need to look good, this will be an internal tool). It will basically have: start and stop buttons, and a listbox showing the queue of pages waiting to be downloaded.
This program will work with a database (we will provide the DB schema, but it's going to be very simple) which will have basically 2 tables. One is the queue of pages to download, and the other is the information extracted from these pages.
For each of the sites we want crawled we will manually add a record to the Queue table (the first page, than will get it running), and run the UI, which will do the following:
1) Get the next page to crawl from the Queue table. This will have two main attributes: URL, and "Format"
The format will determine which code to use when analyzing it (more on this later)
2) Download the HTML code of that page
3) Instance the correct "analyzer", and pass it the HTML code.
4) The analyzer will return a list of links to be followed found in that page. These will be added to the queue table. Obviously, any page already crawled will NOT be added to this list.
5) Repeat until queue is empty.
2) The "analyzers": Each of the sites we want crawled have different formats, and different ways of extracting the contact information we need. Thus, separate pieces of code will be necessary for each one. The idea is making this system in a way that is extensible in the future, to add other "formats", so each analyzer will be a separate class. All of them, however, will either inherit from a common base class, or implement a certain interface.
Basically, the analyzers will have to go through 2 types of pages:
a) Pages that have the content we want (the contact info): When they receive one of these, they will extract Names, phones, e-mails, skype ids, etc, and store it in the database
b) Pages that have to be traversed to reach the ones with the content we want. These will basically be lists of people. What you will do with these ones is extract the links to the pages with contact info, or to other pages with lists. There can be many levels of this, for example: A page with a list of companies, each of those has a list of departments, and each department has a list of people.
The code should be structured in a way that is easy to create more analyzers later. The "messy work" of parsing the HTML should be done inside the analyzers, and kept as separate as possible from other functional code. This will be partly enforced by the class structure I mentioned, but you will have to keep this in mind while coding.
Your quote will include the Spider Engine, and 2 (two) analyzers, for two specific sites.
If, for the purpose of quoting, you need to see these two sites, let us know and we will send you the links and definition on how to extract information.
Please indicate your price, and time for completion.
- The platform will be .Net 2.0. It can be VB.Net or C#.
- Knowledge of Regular Expressions is a must, they will simplify your life enormously.
- We are developers ourselves, so we know what we're talking about, and you won't have problems getting clear definitions from us. We will also help you if you need, or get stuck with a particular problem.
- Multi-threading is a plus (having the system process more than one URL at a time), but you MUST have had experience doing it before and must be able to prove it. It is not necessary, though.
- Excellent English communication is a MUST. Good spoken level is an important plus.
- If you do a good job, we will keep hiring you to create more analyzers for other sites.
- Please send us references to other work you've done that is relevant to this project.
Thanks for your time, and happy bidding!
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
Windows, .Net 2.0