Basic extraction from Wikipedia (from a few specific lists to DB)
$100-500 USD
Pago na entrega
===================
BACKGROUND
===================
I will provide you with a few lists from Wikipedia website (list of ballet companies, list of operas, list of musicals, etc.) and your job would be to write a script to extract details into two basic mySQL tables (I will provide the structure of the two tables below).
As part of the deliverables of this project, I'm looking for (a) populated tables with data and (b) the scripts themselves which were used to extract the data.
**This is the first trial project of any such extraction undertakings. There is more extraction work ahead.**
===================
DATA STRUCTURE
===================
There will be two tables: "entities" table and "entity_names" table:
**entities** table:
- ID
- Wikipedia_Page
- Type
- Primary name ID (which will point to "ID" from "entity_names" table)
**entity_names** table:
- ID
- entity_ID (which will point to "ID" from "entity" table)
- Name
- Type (primary or secondary)
The reason we're using two tables, is that a given entity could later have more than one name/alias (for example "San Francisco Symphony" could be called "SF Symphony"). For all the stuff you will be extracting, you can set the value of "type" field of "entities_table" to "primary".
## Deliverables
===================
WHAT TO EXTRACT
===================
1) List of all ballet companies
Source: <[login to view URL]>
Fields to grab:
Name = "Company Name" from the table
Type = ballet_company
Wikipedia page = page for each ballet company (example: [login to view URL])
2) List of Operas
Source: <[login to view URL]>
Name = opera name from the list
Type: opera
Wikipedia page = page for each opera (example: [login to view URL])
*(below, I will only provide the type as the other fields are self-explanatory based on the above two examples)
*3) List of Opera Companies
Source: [[login to view URL]
][1] Type: opera_company
4) List of Musicals:
Sources: <[login to view URL]:_A_to_L>
<[login to view URL]:_M_to_Z>
Type: musical
5) List of Orchestras:
Source: <[login to view URL]>
Type: orchestra
6) List of Improv Theater Companies
Source: <[login to view URL]>
Type: improv_theater_company
7) List of Comedians
Source: <[login to view URL]>
Type: comedian
Note: Please only extract those who are still alive (i.e. do not take someone like "Bud Abbott (1895-1974)")
8) List of Stand-up Comedians
Source: [[login to view URL]
][2] Type: stand_up_comedian
Note: Please only extract those who are still alive
9) List of dance companies:
Source: <[login to view URL]>
Type: dance_company
10) List of pop punk bands
Source: [[login to view URL]
][3] Type: pop_punk_band
ID do Projeto: #3191040