Word frequency in pdf documents

Closed

I have 2 collections of multiple pdf files relating to research, Collection A and Collection B. I would like a 3-column CSV file populated with word and phrase frequencies for each collection. Words and phrases are considered matching without regard to case and punctuation and accenting. I envision the flow as follows: 1. User inputs the maximum phrase length (parameter LENGTH). A phrase is simply contiguous words. So the phrase "To be or not to be" is a phrase of LENGTH=6. 2. The user inputs the Collection A directory of pdf files (dirA) and Collection B directory (dirB) 3. The software digests each of the pdf files in each of the directories into text files, omitting punctuation, leaving behind just a contiguous block of text in lower case. 4. It iterates over each document to generate word and phrase frequencies up to length LENGTH. 5. It tabulates word and phrase frequencies for each document. 6. It tabulates word and phrase frequencies for each Collection separately. 7. It outputs a CSV files that lists the words and phrase frequencies for each collection. 8. It also outputs those words and phrases that only appear in one collection but not the other, as well as the frequencies. The project is subject to the standard eLance agreement that is attached. I will need the JAVA source code as well as executable for Windows and Mac.

Skills: PDF

See more: generate pdf, frequency, dira, pdf text generate, source code word frequency, word text windows, word csv, csv pdf, document frequency, user mac word, generate word document, generate java files, word matching project, word frequency file, java document frequency, frequencies, multiple pdf, word frequency text file, source code pdf word, pdf retype word documents data entry repost, pdf retype word documents data entry come, pdf retype word documents data entry, mac pdf, pdf documents word, pdf file phrases

Project ID: #5387030