We need a software engineer to help us fix some bad data in a large dataset (5MM+rows). We are basically going to sort the data and then look for orphan values. Depending on what is right above and below these values, we might make a change. And the changed value will come from the surrounding data.
We will be using geohash to filter our data. For example, [url removed, login to view], In this very small geographic area, we see 22 rows of data. Formation name looks pretty good. The only orphaned value is Rodessa. We can't really correct it here, because James Lime is below and Massive Anhydrite is above. No consensus on what to change it to. Right now, the geohash is set to 9vsn3su55e9 to 11 digits. This is a very, very small area. As we back off the digits, 9vsn3su55e (10 digits, removed the last 9c), we will see the 18 rows don't change. We have to go all the way to 9vsn3s, six digits, to see any new rows show up and we are only at 22 rows [url removed, login to view]
When we finally get to five digits (this will happen often) we see the number of rows grow to 231. [url removed, login to view] Let's walk through an example change we will make.
[url removed, login to view] See the "Austin" (no chalk) formations? These need to be changed to "Austin Chalk." We will explain more on the logic once we award the project, but these are pretty obvious changes. We will be creating a new field to record all our changes. We don't just want to change raw data. Need to have before and after values. We will also add a couple new fields for analysis. One will be length of the geohash when we made the Austin to Austin Chalk change. In this case, 5. We will also want to record values above and below. To show an example, let's look at a different part of the data.
The project gets a little bit more complicated after that, but not much. The project's full description is too long for the description box here, so I've included it in a txt file in the attachments, along with a sample csv. Please take a look at the entire description.