Data extraction from 3 different sources: PDF, HTML, and Word Files

  • Status Closed
  • Budget $30 - $250 USD
  • Total Bids 9

Project Description

I need an interface developed that would allow a user to upload a file in PDF, Doc Or Docx, and HTML format. Once uploaded, the PHP page will extract information from each of the three types of files and store the information in a CSV file that is comma separated. Even though the three files are in different format, the same type of information will be extracted from each. This will create a consistent out put in the CSV file.

I attached a copy of the HTML, Word, and PDF Files to the project for viewing. The information that needs to be extracted from each of the document is the following:

● Class name: e.g. Sr. Puppy (9-12 Months) – Male

● Armband number – 2 to 4 digits

● Dog name

● Registration number: alphanumeric or “Listed”

● Date of Birth

● Class Placement: may be blank

● Breeder name: can be more than one name

● Sire and Dam: (parents) format is name of sire X name of dam

● Place of birth: Canada or Elsewhere

● Owner name: can be more than 1 person

● Agent name: optional

Using the word document format as an example, the following would be an example of what is to be extracted:

(Section from Word Document)

Sr. Puppy (9-12 Months) - Male

102 GRASSRIDGE I AM A ROCK, AE499458, 04-Mar-2013

1ST Breeders: Denise Cranna. Ch. Malhaven Skyrockets In Flight x Ch. Grassridge Heavenly Grace. Canada. Owner: Karen IBBITSON, Denise CRANNA. Agent: Ingrid WINKLER

Information to be extracted:

Class Name: Sr. Puppy (9-12 Months) – Male

Armband number: 102

Dog name: Grassridge I Am A Rock

Registration No: AE499458

DOB: 04-March-2013

Class Placement: 1st

Breeder Name: Denise Cranna

Sire & Dam: Ch. Malhaven Skyrockets In Flight x Ch. Grassridge Heavenly Grace

Place of Birth: Canada

Owner: Karen Ibbitson, Denise Cranna

Agent: Ingrid Winkler

Get free quotes for a project like this
Awarded to:
Skills Required

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online