I would order text normalizer. But for 2 languages at the same time. I attach sample files. They store the very same data in each file but in different language (translations). They are aligned by line. The most important think is to make program normalize both such files at the same time so that output is also alligned 2 files. In output each line should start with Upper case letter and ent with ".\n" sign.
Before normalization text should be cleaned - remove .\/\'/.\./\]--- [url removed, login to view] change for ex ? into a - I will provide those rules.
In normalization I will provide dictionary for polish etc. however for english you should do it on your own. You can use and modify opensource solutions like [url removed, login to view] (this one need better dictionary) but remember it should cope with both files at the same time so that output is the same.
If program cannot normalize something because it is unknown it should leave it untouched and generate a log file for external manual normalizer. The external normalizer should be able to read log file a let user manually correct problem and after use correct it program should make correction in correct place in output file.
It should be UTF 8 compatible, and be able to work with big files like even 100MB. Program should also be able to run in mode that will normalize only one language.
Programming language or platform does not matter.
Sample files can be found here [url removed, login to view], you got to click in table on link with PL.