PDF content extraction based on structural characteristics
This project received 11 bids from talented freelancers with an average bid price of $437 USD.Get free quotes for a project like this
Project Budget$250 - $750 USD
NOTE: Please only apply if you have experience with low level PDF manipulation (information extraction), using open-source libraries such as iText, PDFTextStream or similar. It is NOT sufficient to just save the PDF as text file or XML and manipulate the interim files, as formatting specific information will be required to find a suitable solution.
Problem statement: I need an automated solution in Java with source code which takes a set of PDF documents and extracts specific content (requirements) into tagged XML. The PDF documents all have a similar structure: they contain PDF bookmarked table of content pointing to sections in the document. The section then typically contain one or several requirement statements which are sequentially numbered paragraphs. The objective is to correctly extract these requirements (numbered paragraphs) and classify them according to the level of the section heading.
The code will likely have to use the bookmarks as an initial step to extract the high level structure, and then use a combination of text position (margin from the left), text style to identify requirements (number-dot-spaces), and the sequential nature of requirements paragraphs to correctly extract the information. Furthermore, headers and footnotes will need to be correctly separated out, which could be done using the position on the page as well as differences in font sizes.
The code should be able to handle a set of documents, all of which follow similar structures i.e. bookmarked sections, number-dot-spaces style for requirements, sequentially numbered requirements, left-aligned anchors for beginning of requirements, difference in font size for body text and footer / header, etc. If any differences across documents would be critical for proper extraction (this needs to be verified), should be parameterized.
Please note: to be successful, this project requires good experience with using Java PDF libraries such as iText, PDFTextStream or similar, to manipulate PDF files at a granular level. It requires good understanding of PDF structures as well as Java programming. The specific library is not prescribed, but needs to be open-source.
The project only requires the core code of loading, processing and XML output. Command line or in-code parameterization or input is sufficient, no user interface is required.
A number of sample PDF documents to be parsed are attached. More available on request.
Looking to make some money?
- Set your budget and the timeframe
- Outline your proposal
- Get paid for your work
Hire Freelancers who also bid on this project
Looking for work?
Work on projects like this and make money from home!Sign Up Now
- The New York Times
- Wall Street Journal
- Times Online