Information extraction from bookmarked PDF using Java

This project was awarded to tikumishra for $166 USD.

Get free quotes for a project like this
Project Budget
$30 - $250 USD
Total Bids
Project Description

For this small development project the following skills are absolutely essential:

- Java programming skills

- Good understanding of PDF file structure

- Experience with manipulating PDF files using Java

- Experience with using an opensouce PDF Java library such as PDF Clown, iText, PDFTextStream or others

The objective is to create a core component of an automated solution which takes bookmarked PDF files and extracts the numbered, itemized paragraphs together with the text outline as classifier into a machine readable format (e.g. CSV, XML, MS Access table).

The solution needs to work only on a specific set of PDF files which all use the same document structure and all of which are bookmarked. Two sets of files are attached. The first pack ("Requirements and [url removed, login to view]") contains a file called "Explanation and [url removed, login to view]" which outlines the requirements and explains the context / objectives. It also contains sample Java code (NOT WORKING as expected yet) for further illustration, plus the input PDF file used in the code. The second pack ("Sample PDF [url removed, login to view]") contains a number of PDF files which can be used for testing the solution. Further test file can be made available at request.

Awarded to:
Skills Required

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online