Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 25:7:1381290.
doi: 10.3389/frai.2024.1381290. eCollection 2024.

Efficient incremental training using a novel NMT-SMT hybrid framework for translation of low-resource languages

Affiliations

Efficient incremental training using a novel NMT-SMT hybrid framework for translation of low-resource languages

Kumar Bhuvaneswari et al. Front Artif Intell. .

Abstract

The data-hungry statistical machine translation (SMT) and neural machine translation (NMT) models offer state-of-the-art results for languages with abundant data resources. However, extensive research is imperative to make these models perform equally well for low-resource languages. This paper proposes a novel approach to integrate the best features of the NMT and SMT systems for improved translation performance of low-resource English-Tamil language pair. The suboptimal NMT model trained with the small parallel corpus translates the monolingual corpus and selects only the best translations, to retrain itself in the next iteration. The proposed method employs the SMT phrase-pair table to determine the best translations, based on the maximum match between the words of the phrase-pair dictionary and each of the individual translations. This repeating cycle of translation and retraining generates a large quasi-parallel corpus, thus making the NMT model more powerful. SMT-integrated incremental training demonstrates a substantial difference in translation performance as compared to the existing approaches for incremental training. The model is strengthened further by adopting a beam search decoding strategy to produce k best possible translations for each input sentence. Empirical findings prove that the proposed model with BLEU scores of 19.56 and 23.49 outperforms the baseline NMT with scores 11.06 and 17.06 for Eng-to-Tam and Tam-to-Eng translations, respectively. METEOR score evaluation further corroborates these results, proving the supremacy of the proposed model.

Keywords: SMT phrase table; beam search; hybrid NMT-SMT; incremental training; low-resource languages.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Proposed hybrid framework that incrementally trains the NMT with the best translations of the monolingual corpus as evaluated by the SMT framework and includes beam search decoding to optimize the translation performance for low-resource languages.
Figure 2
Figure 2
Comprehensive Representation of the Four Model Variants—Model Variant IV comprises all the components in the framework, including Block 1 and Block 2. Model Variant III comprises all the components in the framework except Block 2. Model Variant II is obtained by excluding both Block 1 and Block 2 that simply trains the NMT incrementally with random sampling. Model Variant I is the baseline NMT that includes only the encoder–decoder block.
Figure 3
Figure 3
Data generation for the proposed hybrid model (The original and small, parallel corpus is augmented incrementally with the quasi-parallel corpus produced as output by the hybrid model in each iteration. This repeating cycle of data augmentation and then prediction after retraining the model with the augmented data generates a large-quasi-parallel corpus, adequate to improve the translation performance of the model for low-resource languages).
Figure 4
Figure 4
Baseline NMT with varying corpus size and epochs for (A) In-Domain Parallel Corpus; (B) Out-of-Domain Parallel Corpus; (C) Own Corpus.
Figure 5
Figure 5
NMT with random sampling-based incremental training of monolingual corpus with varying set sizes. (A) In-Domain Monolingual Corpus; (B) Out-of-Domain Monolingual Corpus.
Figure 6
Figure 6
NMT with SMT-integrated incremental training of In-Domain monolingual corpus.
Figure 7
Figure 7
BLEU scores of varying corpus sizes and beam width.
Figure 8
Figure 8
Comparative analysis of four NMT model variants in terms of BLEU and METEOR scores for in-domain. (A) Eng-to-Tam corpus and (B) Tam-to-Eng corpus.

References

    1. 300-English-Sentences-With-Tamil-Meaning (2023). Available at: https://lifeneeye.com/spokeneng/basic-sentences/300-english-sentences-wi... (Accessed January 07, 2023).
    1. Almahairi A., Cho K., Habash N., Courville A., (2016). First result on Arabic neural machine translation. arXiv [Preprint] arXiv:1606.02680. (Accessed January 09, 2023).
    1. Artetxe M., Labaka G., Agirre E., Cho K., (2017). Unsupervised neural machine translation. arXiv [Preprint] arXiv:1710.11041. (Accessed February 03, 2023).
    1. Bahdanau D., Cho K., Bengio Y., (2014). Neural machine translation by jointly learning to align and translate. arXiv [Preprint] arXiv:1409.0473. (Accessed February 06, 2023).
    1. Banik D., Ekbal A., Bhattacharyya P. (2018). Machine learning based optimized pruning approach for decoding in statistical machine translation. IEEE Access. 7, 1736–1751. doi: 10.1109/access.2018.2883738 - DOI

LinkOut - more resources