I am looking for help understanding the PDFBox library. Please apply only if you already worked with PDFBox or iText or other PDF software.
What we need: Utility/jar/class we can call from our java WebApp which is running on Linux server (this may affect non-java solutions) under Tomcat with Java 8.
Problem: we need to extract text from searchable PDF (not scanned) and preserve text positions - so ideally lib should return words/tokens with x/y start/end positions as well as start/end coordinates of vertical and horizontal line separators. We need to get only the text a user can see; or if we get full text, we need a clear understanding what part of text is visible to the end-user and what part of text is not-visible. Attached is an example of a pdf file that has hidden text.
We tried Apache PDFBox, however, default PDFTextStripper handles only simple cases, when all extracted text is visible on screen. There are attached files where text is partially invisible because of PDF clipping/filling paths, so to track it, you need manually process PDF instructions and calculate if character is not covered/overlapped by another element, like image, other filled field etc. So we would like to get only the text a user can see; or if we get full text, we need a clear understanding what part of text is visible to the end-user and what part of text is not-visible.
There are some others tools could be used, like iText, Tika, but looks like they are built on top of PDFBox. Also we considered using Acrobat SDK but we are not familiar with it.