pdf document structure extraction

  • Status Closed
  • Budget $30 - $100 USD

Project Description

From a pdf file, build an xml/html file extracting all text sequentially AND creating tags around title, heading levels, paragraphs, footer/header, side notes, text boxes and ideally tables.

Documents to parse are mostly offering documents from banks, and will mostly contain text, sometimes tables and be mostly in portrait.

Scientific papers you find on the internet are an easy to find and possibly simpler first set of documents for your testing.

Development should be done in Perl or java and running on windows

Get free quotes for a project like this
Skills Required

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online