In this project, you will create a Python (version 2) function that finds all instances of incorrect usage of "a" vs. "an" in a string of English prose. The basic rule is that, if a word starts with a vowel sound (e.g., "apple"), then "an" should be used ("an apple"), but if the word starts with a consonant sound (e.g., "banana"), then "a" should be used ("a banana"). You should use the Natural Language Toolkit (nltk), specifically making use of its Carnegie Mellon Pronouncing Dictionary (CMPD). If there is more than one possible pronunciation in the CMPD, and if that disagreement over pronunciation would trigger an error in one case but not in the other, then the function should not consider that an error. If the word is an acronym (a string of 3 or more upper case letters, possibly separated by periods, e.g. N.A.S.A or NASA), the function should rely on the pronunciation of the individual letters.
Here is some sample code from someone else who tried to do something similar:
[url removed, login to view]
I have also included a couple pages from a usage guide that you should read before implementing this.
The following phrases should NOT be found as errors:
"a European country"
"a one-year term"
"a Ouija board"
"an FBI agent"
"an MBA degree"
"an SEC filing"
The following phrases SHOULD be found as an error:
"an habitual offender"
"an hallucinatory image"
"an hysterical crowd"
If good work is done here, there is the potential for considerably more work in this general space of natural language processing. The work ranges in complexity from small tasks that will take < 1 hr to larger tasks that might take multiple days or a few weeks.
Ideally you will be willing to release this code to us under an MIT or other comparable open source license.