Closed

Need PDFBox expert to help extract text from pdfs with coordinates and a flag what part of text is visible

I am looking for help understanding the PDFBox library. Please apply only if you already worked with PDFBox or iText or other PDF software.

What we need: Utility/jar/class we can call from our java WebApp which is running on Linux server (this may affect non-java solutions) under Tomcat with Java 8.

Problem: we need to extract text from searchable PDF (not scanned) and preserve text positions - so ideally lib should return words/tokens with x/y start/end positions as well as start/end coordinates of vertical and horizontal line separators. We need to get only the text a user can see; or if we get full text, we need a clear understanding what part of text is visible to the end-user and what part of text is not-visible. Attached is an example of a pdf file that has hidden text.

We tried Apache PDFBox, however, default PDFTextStripper handles only simple cases, when all extracted text is visible on screen. There are attached files where text is partially invisible because of PDF clipping/filling paths, so to track it, you need manually process PDF instructions and calculate if character is not covered/overlapped by another element, like image, other filled field etc. So we would like to get only the text a user can see; or if we get full text, we need a clear understanding what part of text is visible to the end-user and what part of text is not-visible.

There are some others tools could be used, like iText, Tika, but looks like they are built on top of PDFBox. Also we considered using Acrobat SDK but we are not familiar with it.

Skills: Java, PDF

See more: java pdf library, apache pdfbox, extract data from pdf to excel, pdf table extract python, extract data from pdf python, pdfbox example, how to use pdfbox, pdf scraping tools, need expert help with excel vlookup, need expert help in choosing database software, i need an expert to help me with facebook concerns, need an expert to help with wp prperty rets feeds, i need an expert in english language that would help to grade my essay test, help need tumblr expert, need java expert help, need dreamweaver expert, need jamroom expert, need seo expert website, expert help word, need urgent money help

About the Employer:
( 0 reviews ) United States

Project ID: #15915061

6 freelancers are bidding on average $517 for this job

schoudhary1553

Greeting, I have understood your Need PDFBox expert to help extract text from pdfs with coordinates and a flag what part of text is visible task and can do it with your 100% satisfaction. Please ping me for more di More

$500 USD in 6 days
(21 Reviews)
5.0
expertjavagiant

Hi, I have huge experience in PDFbox & iText PDF library, i reviewed your requirement for extracting text from PDF and it's position is looking good to me as it's searchable PDF so we can get the text easily, for gett More

$480 USD in 10 days
(14 Reviews)
4.9
$400 USD in 10 days
(11 Reviews)
4.1
shahzain93

Hey man , I have worked on PDF box library, I have seen your document and I can try to do it, if interested, message men Thanks

$690 USD in 10 days
(10 Reviews)
4.2
anuragiitk

I am an IITK graduate and I have 11 years of experience in software development. I have 100% completion rate and I have finished projects with the highest level of customer satisfaction. I have a team of rock star dev More

$555 USD in 10 days
(20 Reviews)
5.4
benni25

Hello Sir/Mam Relevant Skills and Experience: Please send us all details and we will do the job now if possible...and we are always ready to take any challenge + we have an adobe lab too Proposed Milestones: 475 - (Pro More

$475 USD in 1 day
(5 Reviews)
3.1