Hundreds of files with mostly the same structure need to be uploaded into a database. The database, consisting of two tables, is already created and contains data, but no recent data upload has taken place.
Both tables need to be populated from the following types of files:
1- csv files: about 700 exist, all with the same format
2- xls files: 6 of these exist, all with the same format
Data needs to be extracted from selected columns. The data needs to be inserted into two database tables: one for price data and one for static data. The date to which the data pertains needs to be added as well.
So the deliverables are:
1- the database export that should be created after loading of the latest set of files as well as the load script for re-insert
2- scripts that have been used to parse the csv-files
We already have a script for just one file and the shell script is probably easily extensible for the date-appended files that we have.
The files are downloaded from the same site everyday and sometime in 2005 a new column was introduced, so that is the only change to the datastructure we are aware of. The lines in each file varies between 200 to 800 lines.
Both the price table and the static data table need to be populated from these files.
The difficulty lies in checking whether all data is complete, whether clearly erroneous prices are present, duplicate rows etc. So some common sense without much guidance from our side is highly preferable. You may have to do some manual data cleansing in a few files (<10), in case some corruption may have occurred.