Word frequency in pdf documents

CLOSED
Bids
3
Avg Bid (USD)
N/A
Project Budget (USD)
$30 - $250

Project Description:
I have 2 collections of multiple pdf files relating to research, Collection A and Collection B. I would like a 3-column CSV file populated with word and phrase frequencies for each collection. Words and phrases are considered matching without regard to case and punctuation and accenting. I envision the flow as follows: 1. User inputs the maximum phrase length (parameter LENGTH). A phrase is simply contiguous words. So the phrase "To be or not to be" is a phrase of LENGTH=6. 2. The user inputs the Collection A directory of pdf files (dirA) and Collection B directory (dirB) 3. The software digests each of the pdf files in each of the directories into text files, omitting punctuation, leaving behind just a contiguous block of text in lower case. 4. It iterates over each document to generate word and phrase frequencies up to length LENGTH. 5. It tabulates word and phrase frequencies for each document. 6. It tabulates word and phrase frequencies for each Collection separately. 7. It outputs a CSV files that lists the words and phrase frequencies for each collection. 8. It also outputs those words and phrases that only appear in one collection but not the other, as well as the frequencies. The project is subject to the standard eLance agreement that is attached. I will need the JAVA source code as well as executable for Windows and Mac.

Skills required:
PDF
Hire androidkit
Project posted by:
androidkit Bangladesh
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.


$ 250
in 7 days
Hire topcoder9793
$ 166
in 4 days
$ 500
in 10 days