The application, which will run on Linux Ubuntu should be able to process a series of sequentially regular expression designed to "clean up" our TMX files stored in a specific path. This regex should be added, changed and rearranged by changing the sequence.
I would like to be prepared with the program the first set of regex that performs these tasks:
· Getting started:
check in the archive folder for any files that have the name _clean
If there is some file clean
Remove the braces and any content within them
Remove symbols and any content within them
Remove numbers and periods and commas within each area
Remove symbols:% & $ £ "^ ° #
Remove Any URL
Remove Sigle (must be able to update the list)
Srl, spa, sas, spa, Ltd., sas, ltd., (I), (ii), (iii), (a), (b), (c)
Replace "(" with ","
Replace ")" with ","
Replace "(" with ","
Replace "-" with ","
· Translation Unit
Delete the entire unit if the segments that compose differ by more than 50% of the number of words (the control should only be done on segments longer than 3 words)
Eliminate the TUs that have one of the two segments empty
· Final Steps:
control double spaces
tmx rename the file by adding the file name _clean
//////////// TMX FILE example (normally they are very numerous TU) ///////////
Xxxxxxx Reports Third Quarter 2012 Financial Results
Xxxxxxx pubblica i risultati finanziari del terzo trimestre 2012
3Q 2012 Net Operating Income of $128.2 million, $1.55 per diluted share 3Q 2012 Net Income of $126.3 million, $1.52 per diluted share
Utile operativo netto T3 2012 = $ 128,2 milioni, $ 1,55 per azione diluito Utile netto T3 2012 = $ 126,3 milioni, $ 1,52 per azione diluito
Net income increased to $126.3 million, or $1.52 per diluted share, compared to third quarter 2011 net income of $74.0 million, or $0.77 per diluted share.
L'utile netto è aumentato fino a 126,3 milioni di dollari, pari a 1,52 dollari per azione diluiti, rispetto all'utile netto del terzo trimestre 2011, che si collocava a 74,0 milioni di dollari, pari a 0,77 dollari per azione diluiti.