Closed

Automated Data Mining/Extraction from Online PDFs

This project received 1 bids from talented freelancers with an average bid price of $5001 USD.

Get free quotes for a project like this
Employer working
Project Budget
N/A
Total Bids
1
Project Description

**Full description is attached.**

DO NOT APPLY FOR THIS JOB IF YOU HAVE NOT READ THE ENTIRE DESCRIPTION.
THIS IS AN AUTOMATED DATA MINING/EXTRACTION JOB -- NOT A MANUAL ONE.

We are looking for a contractor having solid experience with software and development for data extraction from online PDFs. The PDFs are scanned copies of IRS forms that have been filed by charities, and are available through a single public online source. There are six different types of forms. The IRS scanning process can result in different positioning of data among scanned forms and scans of different quality.

We want someone who has demonstrated a history of substantial, successful data mining using PDFs and OCR. If you are looking to learn or expand your profile, this is not for you. Fluent English is a must.

The contractor must develop a program that will do the following:

1. Download scanned PDFs of mixed quality from the online source using a list of URLs in a text file provided by the buyer (approximately 300,000 PDFs and URLs).

2. Extract up to ten numeric and text data fields from each PDF using a combination of automated graphical manipulation and OCR. The location of the data on the pages will be different for each of the 6 types of forms.

3. Incorporate error-checking based on related data fields selected by buyer.

4. Format the data output as a CSV to be uploaded to buyer's SQL database.

5. Provide well-commented source code and an executable. The program will be run on an ongoing basis by the buyer.

6. Deliver written step-by-step operating instructions that a novice user can readily understand and follow.

7. Pass the following accuracy tests when operated by the buyer: Based on 10,000 URLs chosen by the buyer, the program will (a) download 100% of the PDFs and (b) correctly extract from the downloaded PDFs 90% or greater of the designated data fields, with the error-checking identifying all data fields where extraction failed.

**Full description is attached.**

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online