Word frequency in pdf documents
This project received 3 bids from talented freelancers with an average bid price of $ USD.Get free quotes for a project like this
Project Budget$30 - $250 USD
I have 2 collections of multiple pdf files relating to research, Collection A and Collection B. I would like a 3-column CSV file populated with word and phrase frequencies for each collection. Words and phrases are considered matching without regard to case and punctuation and accenting. I envision the flow as follows: 1. User inputs the maximum phrase length (parameter LENGTH). A phrase is simply contiguous words. So the phrase "To be or not to be" is a phrase of LENGTH=6. 2. The user inputs the Collection A directory of pdf files (dirA) and Collection B directory (dirB) 3. The software digests each of the pdf files in each of the directories into text files, omitting punctuation, leaving behind just a contiguous block of text in lower case. 4. It iterates over each document to generate word and phrase frequencies up to length LENGTH. 5. It tabulates word and phrase frequencies for each document. 6. It tabulates word and phrase frequencies for each Collection separately. 7. It outputs a CSV files that lists the words and phrase frequencies for each collection. 8. It also outputs those words and phrases that only appear in one collection but not the other, as well as the frequencies. The project is subject to the standard eLance agreement that is attached. I will need the JAVA source code as well as executable for Windows and Mac.
Looking to make some money?
- Set your budget and the timeframe
- Outline your proposal
- Get paid for your work
Hire Freelancers who also bid on this project
Looking for work?
Work on projects like this and make money from home!Sign Up Now
- The New York Times
- Wall Street Journal
- Times Online