Word frequency in pdf documents

This project received 3 bids from talented freelancers with an average bid price of $ USD.

Get free quotes for a project like this
Project Budget
$30 - $250 USD
Total Bids
Project Description

I have 2 collections of multiple pdf files relating to research, Collection A and Collection B. I would like a 3-column CSV file populated with word and phrase frequencies for each collection. Words and phrases are considered matching without regard to case and punctuation and accenting. I envision the flow as follows: 1. User inputs the maximum phrase length (parameter LENGTH). A phrase is simply contiguous words. So the phrase "To be or not to be" is a phrase of LENGTH=6. 2. The user inputs the Collection A directory of pdf files (dirA) and Collection B directory (dirB) 3. The software digests each of the pdf files in each of the directories into text files, omitting punctuation, leaving behind just a contiguous block of text in lower case. 4. It iterates over each document to generate word and phrase frequencies up to length LENGTH. 5. It tabulates word and phrase frequencies for each document. 6. It tabulates word and phrase frequencies for each Collection separately. 7. It outputs a CSV files that lists the words and phrase frequencies for each collection. 8. It also outputs those words and phrases that only appear in one collection but not the other, as well as the frequencies. The project is subject to the standard eLance agreement that is attached. I will need the JAVA source code as well as executable for Windows and Mac.

Skills Required

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online