data cleansing using Java code

  • Status: Pending
  • Prize: $75
  • Entries Received: 0

Contest Brief

We have a system that uses Machine Learning (WEKA) to cleanse missing and incorrect data.

The goal of this project is to improve and expand the Java program which cleans up Street Address, Latitude, and Longitude. Currently only Street Address is implemented and it leaves 852 items uncleansed. Latitude and Longitude have to be added and overall results improved. Latitude and Longitude present in the data file is 99% correct; goal of cleansing program is to select canonical Lat/Lon for each address and to detect and fix outliers.

This solution is for a commercial application. Using external services like Google Maps in the solution is not permitted. The current version here already outperforms Google Maps in accuracy for correction of street names (for example, "South C Ave" is incorrectly classified by Google as "S AVE E", when correct answer is "S AVE C"). So please do not insist about using Google Maps or other services, they are inadequate.

You may use external services to compare your results, with some analysis.

Contest Entries will be judged on three measures: 1) Correctness of results, 2) Elegance of Solution and Code, 3) Memory and Run-Time Requirements.

The contest duration is 14 days. I will monitor the project and post clarifications when asked.

ATTACHMENTS: Java program and Database dump (if you cannot open Zip file, try using 7-Zip). You need to download Weka [url removed, login to view] jar file yourself.

You may remove WEKA and use a different open-source library if you wish.

Recommended Skills

Public Clarification Board

  • afterhourstech
    Contest Holder
    • 2 years ago

    Hi ergo1wish, please tell me more! What have you done? I did not see any submission

    • 2 years ago
    1. ergo1wish
      ergo1wish
      • 2 years ago

      I have sent you a message on freelancer please take a look.

      • 2 years ago
    2. afterhourstech
      Contest Holder
      • 2 years ago

      Prize has been increased to $150 and contest re-posted.

      • 2 years ago
  • afterhourstech
    Contest Holder
    • 2 years ago

    Attention everyone. The prize has been DOUBLED to $150 and the contest re-posted at this URL: https://www.freelancer.com/contest/use-Data-Mining-to-fix-spelling-errors-in-street-addresses-94651.html?project_id=94651&simpleContestRedirect=true&hash=oUvS8RBPAsKaTzmM%2BQHBXkLBc4ufTxNx2JEuBVmmSEU%3D

    • 2 years ago
  • ergo1wish
    ergo1wish
    • 2 years ago

    Hello, I have already done too much on this project to give up now... If you want contact me here and we can talk further.

    • 2 years ago
  • afterhourstech
    Contest Holder
    • 2 years ago

    I will also consider solutions done in Microsoft Azure SQL Server Data Mining Extensions (DMX) and Azure Machine Learning.

    • 2 years ago
  • dorelm1958
    dorelm1958
    • 2 years ago

    Your data is scarce, which is the meaning of the columns: E, W, N, S, ZP4, std_address?

    • 2 years ago
    1. afterhourstech
      Contest Holder
      • 2 years ago

      E, W, N, S are the Chicago city grid coordinates. They are deemed 99% accurate. ZP4 is the last four digits of the ZIP Postal Code. I found that the ZP4 field is useless, and will cause overfitting. The std_address is the known standardized address - the known_street field value is derived from this.

      • 2 years ago
    2. dorelm1958
      dorelm1958
      • 2 years ago

      HSN, E, W, N, S are more like 90% accurate (see known_street='N CALIFORNIA AVE') Your project must have two steps: first find obvious errors and clean them manually, then clean the street name, finally LAT and LNG can be easily computed from them.

      • 2 years ago
  • ergo1wish
    ergo1wish
    • 2 years ago

    Hello, is Java a necessity here? Is it possible to do this in python/R/C++ or in a combination of everything, because it would be maybe easier to get good results really fast?

    • 2 years ago
    1. afterhourstech
      Contest Holder
      • 2 years ago

      You may use any open source language (that means no Microsoft Visual Studio). A solution developed in R would be awesome!

      • 2 years ago
  • dorelm1958
    dorelm1958
    • 2 years ago

    Why you did not used the data in the column HSN to predict the cleansed addresses?

    • 2 years ago
    1. afterhourstech
      Contest Holder
      • 2 years ago

      That was a bug which has been fixed in the version I uploaded to Google Drive

      • 2 years ago
  • swarm22
    swarm22
    • 2 years ago

    I am a data scientist/quantitative developer with over 5 years of experience doing analysis.
    My interest and expertise lies in using Weka and Java.

    • 2 years ago
    1. afterhourstech
      Contest Holder
      • 2 years ago

      Hi swarm22, you were the first poster, how is it going? When do you plan to submit your entry? Thanks

      • 2 years ago
  • afterhourstech
    Contest Holder
    • 2 years ago

    Hello again Contestants. I have another clarification. You may use Nominatim from openstreetmap.org in your solution, but not as a service: only if you include a dump of all the data for the city of Chicago, plus all required software to make it work, with installation instructions.

    • 2 years ago
    1. afterhourstech
      Contest Holder
      • 2 years ago

      Please see my other comment too. The performance of Nominatim is horrible, worse than Google. So I'm not sure why you would use it.

      • 2 years ago
    2. afterhourstech
      Contest Holder
      • 2 years ago

      I'm talking about the performance for matching of incomplete or mispeled addresses.

      • 2 years ago
  • afterhourstech
    Contest Holder
    • 2 years ago

    Where you execute the program as is, your output should look like this:

    • 2 years ago
    1. afterhourstech
      Contest Holder
      • 2 years ago

      When you add more fields to the process you may run out of memory (my machine has 12GB of RAM). In this case a data partitioning strategy is also needed. I believe clustering the data into zones and then processing separately should work very well.

      • 2 years ago
    2. afterhourstech
      Contest Holder
      • 2 years ago

      In case you decide to implement partitioning, it should be done automatically within the Java code.

      • 2 years ago
  • djpatra
    djpatra
    • 2 years ago

    Hi, Just to make sure that I understood the problem correctly, I am putting forth a question. In the following tuple, (uncleaned_address => ''PALMER (2200 N) WEST OF CENTRAL PARK TO ADDRESS', LAT => 41.7463, LNG => -87.5491, prediction =>'W PALMER ST', confidence => 0.8333333333333334, clean_latitude => NULL,
    clean_longitude => NULL), since the confidence value is less than '1', you would like to find the values for clean_latitude and clean_longitude, which might not be the predicted LAT and LNG values. Is my understanding of problem correct?

    • 2 years ago
    1. afterhourstech
      Contest Holder
      • 2 years ago

      search for confidence less than 0.25. Those cases the learning algorithm failed

      • 2 years ago
    2. afterhourstech
      Contest Holder
      • 2 years ago

      please check the newer version of the program at the link above. This has more clear Input and Output separation.

      • 2 years ago
  • afterhourstech
    Contest Holder
    • 2 years ago

    Attention Contestants: There has been some confusion about, what is the Input and what is the Output of the program? So I have modified the program to work with two separate tables: geo_cleanser_in and geo_cleanserout. These are the INPUT and OUTPUT respectively. The modified program and database tables can be found here: https://drive.google.com/file/d/0B8genuop-YRfZUNBRmlUcnFMNHM/edit?usp=sharing

    • 2 years ago
  • OpenDoorLogistix
    OpenDoorLogistix
    • 2 years ago

    You're doing address matching and geocoding - but to do this you need an external database of all addresses in a country with geocodes (i.e. lat/long) to match against (either that or use something like Nominatim). Do you have access to a database like this?

    • 2 years ago
    1. afterhourstech
      Contest Holder
      • 2 years ago

      We are the company that makes those external databases. Our competitors only do mailing addresses. We also do non-mailing addresses such as those for new houses that have not been built yet. The scope of this competition is only the city of Chicago. If the code needs to be adapted to other cities, we will do it ourselves or we will post a new contest.

      • 2 years ago
    2. afterhourstech
      Contest Holder
      • 2 years ago

      By the way, I tried Nominatim. It performs very poorly for matching addresses with a missing suffix. For instance try 200 N Dearborn, Chicago, IL, and then try 200 N Dearborn St. The result is very different.

      • 2 years ago
  • afterhourstech
    Contest Holder
    • 2 years ago

    Hello contestants, I have a tip: try adding deterministic calculations in front of the learning algorithm. That should improve results.

    • 2 years ago
  • ARFarrand
    ARFarrand
    • 2 years ago

    Hi, can I just clarify whether it is solely the long and lat you are wanting to quality check, or whether it is also the cleansed addresses?

    • 2 years ago
    1. afterhourstech
      Contest Holder
      • 2 years ago

      The winning program should produce 3 outputs: cleansed address, latitude, and longitude. I posted a java program which outputs cleansed address only. So it only does 30% of what I want. Makes sense?

      • 2 years ago
    2. afterhourstech
      Contest Holder
      • 2 years ago

      I mean cleansed address, cleansed latitude, and cleansed longitude.

      • 2 years ago
  • afterhourstech
    Contest Holder
    • 2 years ago

    Hello Contestants. I am surprised about the low number of questions. Please feel free to ask any questions. I don't mind helping you get started.

    • 2 years ago
  • djpatra
    djpatra
    • 2 years ago

    Dear contest holder, Thanks for answering my queries. I am not all set to start. Thanks,

    • 2 years ago
  • afterhourstech
    Contest Holder
    • 2 years ago

    Hello Contestants. I would like to clarify the urgency of posting a solution. In case of similar solutions by different contestants, the one POSTED FIRST will be given preference.

    • 2 years ago
  • afterhourstech
    Contest Holder
    • 2 years ago

    The code that I posted only processes street name and ignores latitude and longitude. That is the biggest problem with the code right now.

    • 2 years ago
  • afterhourstech
    Contest Holder
    • 2 years ago

    Great. Do you have any questions about this project?

    • 2 years ago

Show more comments

How to get started with contests

  • Post your contest

    Post Your Contest Quick and easy

  • Get tons of entries

    Get Tons of Entries From around the world

  • Award the best entry

    Award the best entry Download the files - Easy!

Post a Contest Now or Join us Today!