You have chosen to sponsor your bid up to a maximum amount of .
I have 2 collections of multiple pdf files relating to research, Collection A and Collection B. I would like a 3-column CSV file populated with word and phrase frequencies for each collection. Words and phrases are considered matching without regard to case and punctuation and accenting. I envision the flow as follows: 1. User inputs the maximum phrase length (parameter LENGTH). A phrase is simply contiguous words. So the phrase "To be or not to be" is a phrase of LENGTH=6. 2. The user inputs the Collection A directory of pdf files (dirA) and Collection B directory (dirB) 3. The software digests each of the pdf files in each of the directories into text files, omitting punctuation, leaving behind just a contiguous block of text in lower case. 4. It iterates over each document to generate word and phrase frequencies up to length LENGTH. 5. It tabulates word and phrase frequencies for each document. 6. It tabulates word and phrase frequencies for each Collection separately. 7. It outputs a CSV files that lists the words and phrase frequencies for each collection. 8. It also outputs those words and phrases that only appear in one collection but not the other, as well as the frequencies. The project is subject to the standard eLance agreement that is attached. I will need the JAVA source code as well as executable for Windows and Mac.