STANDARDIZING ADDRESS, USA
- Estado: Pending
- Prêmio: $500
- Inscrições Recebidas: 0
Síntese do concurso
Contest is normalizing and standardizing usa addresses, In mysql and C
Have addresses from two different sources different data fields etc, situs file is control and will compare other file to this one. FL LEON CERTIFIED DATA 18-9-1 is data file to be compared to FL LEON SITUS FREELANCER.
Want to create mutli tiered algorithm to auto correct data. Want to new table with corrected data, along with what section or step of algorithm that matched/auto corrected that particular record. Also want to track accuracy of auto correction for each step, based on manual review of records. Ie there are 1000 records that were changed/matched using step 10 of algorithm and 100 were manually reviewed, and 3 were wrong, so you will need to make table to track manual reviews and % correctly auto changed.
The situs file is only for Tallahassee so any address that are not in Tallahassee in data file are not relevant.
First is developing pattern to combine fields which create the largest number of exact matches. Then develop matching data entry errors.
Fields of Typical USA addresses are:
Street Number, Street Name, Street direction, Street type, Building Number, Unit Number.
city, state, zip.
Typically data entry fields for address will be 2 to 4 lines.
Address line 1 generally contains street number, name and direction, can contain building or unit number
Address line 2 can be empty. Or contain building number and or unit number, or Care of, and have a name of individual.
Address line 3 usually contains city, state, zip.
Address line 4 Country if different than usa. Could also contain city state zip.
In certified mentioned above, address is in column E-H, and then separated fields, O - S
Street direction can go in before or after street name. building and unit number, can go behind street number or at end, after street type.
I know there will be questions on this so please hit me with comments so I may add and clarify any questions you might have.
Person with most records corrected above 95% accuracy, wins must go up to a minimum of 10 different match types.
Will keep original data file, and corrected table,
need ability to track percentage of accuracy based on manual review.
along with tracking each change and what part of algorithm that corrected that particular change in the recorded. What the change was, and a way to calculate percent accuracy.
that tracks, each part of code different match criteria. Will also need to track which part of code, catches and corrects the mistake , along with a manual review field for manual verification percentages.