You have chosen to sponsor your bid up to a maximum amount of .
For this small development project the following skills are absolutely essential:
- Java programming skills
- Good understanding of PDF file structure
- Experience with manipulating PDF files using Java
- Experience with using an opensouce PDF Java library such as PDF Clown, iText, PDFTextStream or others
The objective is to create a core component of an automated solution which takes bookmarked PDF files and extracts the numbered, itemized paragraphs together with the text outline as classifier into a machine readable format (e.g. CSV, XML, MS Access table).
The solution needs to work only on a specific set of PDF files which all use the same document structure and all of which are bookmarked. Two sets of files are attached. The first pack ("Requirements and Sample.rar") contains a file called "Explanation and requirements.pdf" which outlines the requirements and explains the context / objectives. It also contains sample Java code (NOT WORKING as expected yet) for further illustration, plus the input PDF file used in the code. The second pack ("Sample PDF documents.rar") contains a number of PDF files which can be used for testing the solution. Further test file can be made available at request.