data cleansing using Java code

We have a system that uses Machine Learning (WEKA) to cleanse missing and incorrect data.

The goal of this project is to improve and expand the Java program which cleans up Street Address, Latitude, and Longitude. Currently only Street Address is implemented and it leaves 852 items uncleansed. Latitude and Longitude have to be added and overall results improved. Latitude and Longitude present in the data file is 99% correct; goal of cleansing program is to select canonical Lat/Lon for each address and to detect and fix outliers.

This solution is for a commercial application. Using external services like Google Maps in the solution is not permitted. The current version here already outperforms Google Maps in accuracy for correction of street names (for example, "South C Ave" is incorrectly classified by Google as "S AVE E", when correct answer is "S AVE C"). So please do not insist about using Google Maps or other services, they are inadequate.

You may use external services to compare your results, with some analysis.

Contest Entries will be judged on three measures: 1) Correctness of results, 2) Elegance of Solution and Code, 3) Memory and Run-Time Requirements.

The contest duration is 14 days. I will monitor the project and post clarifications when asked.

ATTACHMENTS: Java program and Database dump (if you cannot open Zip file, try using 7-Zip). You need to download Weka [url removed, login to view] jar file yourself.

You may remove WEKA and use a different open-source library if you wish.

