NOTE: Please only apply if you have experience with low level PDF manipulation (information extraction), using open-source libraries such as iText, PDFTextStream or similar. It is NOT sufficient to just save the PDF as text file or XML and manipulate the interim files, as formatting specific information will be required to find a suitable solution.
Problem statement: I need an automated solution in Java with source code which takes a set of PDF documents and extracts specific content (requirements) into tagged XML. The PDF documents all have a similar structure: they contain PDF bookmarked table of content pointing to sections in the document. The section then typically contain one or several requirement statements which are sequentially numbered paragraphs. The objective is to correctly extract these requirements (numbered paragraphs) and classify them according to the level of the section heading.
The code will likely have to use the bookmarks as an initial step to extract the high level structure, and then use a combination of text position (margin from the left), text style to identify requirements (number-dot-spaces), and the sequential nature of requirements paragraphs to correctly extract the information. Furthermore, headers and footnotes will need to be correctly separated out, which could be done using the position on the page as well as differences in font sizes.
The code should be able to handle a set of documents, all of which follow similar structures i.e. bookmarked sections, number-dot-spaces style for requirements, sequentially numbered requirements, left-aligned anchors for beginning of requirements, difference in font size for body text and footer / header, etc. If any differences across documents would be critical for proper extraction (this needs to be verified), should be parameterized.
Please note: to be successful, this project requires good experience with using Java PDF libraries such as iText, PDFTextStream or similar, to manipulate PDF files at a granular level. It requires good understanding of PDF structures as well as Java programming. The specific library is not prescribed, but needs to be open-source.
The project only requires the core code of loading, processing and XML output. Command line or in-code parameterization or input is sufficient, no user interface is required.
A number of sample PDF documents to be parsed are attached. More available on request.
10 freelancers are bidding on average $456 for this job
Hi, I am expert in using iText. Previously I used iText to extract very complex structures. Let me know if you are interested to working with me. Thanks.
I have done work with PDF and know about the structure. Though that work was with php to pdf but i also have great skills in JAVA and would like to work on this project. Looking forward to work with you.
I am 2 yr experienced in Java Development My last project had similar stuffs so I can assure you the success of the project In my last project I had used itextPdf library for PDF parsing. and Javadoc for xml.