PDF content extraction based on structural characteristics

NOTE: Please only apply if you have experience with low level PDF manipulation (information extraction), using open-source libraries such as iText, PDFTextStream or similar. It is NOT sufficient to just save the PDF as text file or XML and manipulate the interim files, as formatting specific information will be required to find a suitable solution.

Problem statement: I need an automated solution in Java with source code which takes a set of PDF documents and extracts specific content (requirements) into tagged XML. The PDF documents all have a similar structure: they contain PDF bookmarked table of content pointing to sections in the document. The section then typically contain one or several requirement statements which are sequentially numbered paragraphs. The objective is to correctly extract these requirements (numbered paragraphs) and classify them according to the level of the section heading.

The code will likely have to use the bookmarks as an initial step to extract the high level structure, and then use a combination of text position (margin from the left), text style to identify requirements (number-dot-spaces), and the sequential nature of requirements paragraphs to correctly extract the information. Furthermore, headers and footnotes will need to be correctly separated out, which could be done using the position on the page as well as differences in font sizes.

The code should be able to handle a set of documents, all of which follow similar structures i.e. bookmarked sections, number-dot-spaces style for requirements, sequentially numbered requirements, left-aligned anchors for beginning of requirements, difference in font size for body text and footer / header, etc. If any differences across documents would be critical for proper extraction (this needs to be verified), should be parameterized.

Please note: to be successful, this project requires good experience with using Java PDF libraries such as iText, PDFTextStream or similar, to manipulate PDF files at a granular level. It requires good understanding of PDF structures as well as Java programming. The specific library is not prescribed, but needs to be open-source.

The project only requires the core code of loading, processing and XML output. Command line or in-code parameterization or input is sufficient, no user interface is required.

A number of sample PDF documents to be parsed are attached. More available on request.

Skills: Java, PDF, Software Architecture, Software Development

See more: statement of the problem sample, statement of problem sample, specific problem statement, sequential programming, sample problem statements, sample problem statement, sample of requirements document, sample of problem statement, programming in objective c pdf, programming in java pdf, problem structures, problem statement sample, problem statements, problem statement, objective statement, objective c programming pdf, objective c pdf, java programming pdf, java programming from the beginning, java open source programming, it problem statement, initial problem statement, c programming if statement, characteristics of, beginning java programming

About the Employer:
( 4 reviews ) Doha, Switzerland

Project ID: #4829126

10 freelancers are bidding on average $456 for this job


Hi, I am experienced with iText and interested in this project, Thank You

$773 USD in 15 days
(202 Reviews)

Hi, I am expert in using iText. Previously I used iText to extract very complex structures. Let me know if you are interested to working with me. Thanks.

$515 USD in 5 days
(47 Reviews)

Hello, I can help you, thanks

$412 USD in 12 days
(36 Reviews)

Expert in Java.

$300 USD in 3 days
(6 Reviews)

I have experience with your project requirements , check pm.

$251 USD in 13 days
(3 Reviews)

This is not an easy assignments. I will do some research first before doing this. If I am sure, I will completed this task in time.

$555 USD in 22 days
(1 Review)

I have done work with PDF and know about the structure. Though that work was with php to pdf but i also have great skills in JAVA and would like to work on this project. Looking forward to work with you.

$333 USD in 5 days
(0 Reviews)

14 years experienced java expert.

$310 USD in 5 days
(0 Reviews)

I am 2 yr experienced in Java Development My last project had similar stuffs so I can assure you the success of the project In my last project I had used itextPdf library for PDF parsing. and Javadoc for xml.

$250 USD in 7 days
(0 Reviews)

am a java developer with 5yrs of experience.i have experience in using itext . would like to take this project further

$555 USD in 10 days
(0 Reviews)

hi, i have 3 years experienc in text mining.

$555 USD in 3 days
(0 Reviews)