I need a crawer script that will crawl on 6 different website addresses I will provide - websites that each of them contain a very organized long list of product names that are divided into alphabetical groups by manufacturers. Then the script should copy the entire list from each site into 6 different CSV files...
Read all details at the end of this project description (see at the botto ----->)
I need you to run a script that will do the following:
1) **Crawl**: crawl on 6 different website addresses I will provide - websites that each of them contain a very organized long list of product names that are divided into alphabetical groups by manufacturers. Then the script should copy the entire list from each site into 6 different CSV files and create **4 columns for each**: **One for product name** (e.g. "Xerox monitor B5 G617Q" etc?), another for the **manufacturer** name it belongs to (e.g. - "Xerox"), another for the **Category** it belongs to, (e.g. - "Printers") (these details are already given by each of the sites). And final column - name of the **source website** that product name came from.
2) **Compare lists and find commons**: After script copied the 6 different lists from the 6 different sites, I need the script to search and analyze all the 6 lists - and create the following 2 new lists out of them:
a) **Commons**: whenever script finds that a certain product name appears in more than 1 list - it will copy it to this new CSV file containing only product names that appeared in more than 1 list. (reason - I want to see which names appear in more than 1 site and are not unique to it).
b) **Unique**: names that only appear in 1 list - will be copied to this new list containing unique names.
Important notes: You should run this script anonymously and in a way that will look natural - from a few different IP's if possible.
Script will ignore comma's, hyphens etc.. when deciding if 2 names from different lists should be considered the same. E.g. **HP Business Inkjet 1100dtn** and **HP BUSINESS INKJET 1100-dtn** etc.. will be considered the same. I am sure there will be other examples. I'll check and give you more?
Product manufacturer name should always be the first word before the product name, If it's not - the script should put it there.