Basically, language model adaptation techniques can be referred to two main categories. The first category includes the techniques that based on the data selection where task-oriented corpus can be extracted and used to train and generate models for specific translations. While, the second category focuses on developing a weighting criterion to assign the test data to specific model corpus.
This research aims to introduce language model adaptation approach that combines both strategies of the previous two categories of language model adaptation.
At first, this approach applies data selection for specific-task translations by dividing the corpus into smaller and topic-related corpora using clustering process. Using the Europarl corpus WMT07 that includes bilingual data for English-Spanish, English-German and English-French, the experiments investigate the effect of different approaches for clustering the bilingual data on the language model adaptation process in terms of translation quality. The approaches used for clustering bilingual data are direct clustering, clustering based on the development set and clustering based on the test set. After defining the sub-corpora as a result of the clustering process, several language models can be built based on these corpora. Using a specific weighting criterion, a mixture of language models can be defined to assign any given data to the right language model to be used in the translation process. For this purpose, three different weighting criterions (based on the entire test set, based on the sentence level, and hybrid approach based on both the sentence level and the entire test set)
i want someone in Malaysia to build the system