Abstract

This study presents a fine-tuned mT5 model tailored for Vietnamese-Chinese translation, addressing the challenges of data scarcity and linguistic diversity. We enhanced the model with a sampling-based backtranslation method that leverages extensive monolingual corpora, achieving BLEU scores of 70.1 for baseline training and 64.1 after incorporating back-translated data—outperforming several strong baselines. Additionally, we propose a corpus construction strategy using LASER embeddings and the Hungarian Matching Algorithm to extract parallel sentences from bilingual news sources. These advancements demonstrate the potential of mT5 and backtranslation to improve translation quality for low-resource language pairs while providing a scalable approach to corpus construction.

Introduction

In an interconnected world, the demand for accurate translation systems is growing. Yet, translating low-resource language pairs—those with limited parallel corpora—remains a major challenge in natural language processing (NLP). Vietnamese-Chinese translation exemplifies these difficulties, as it involves languages with distinct linguistic structures and limited bilingual data. This language pair is critical for fostering cultural exchange, trade, and diplomacy in Southeast Asia.

Vietnamese, a tonal language written in Latin script, contrasts sharply with Chinese, which uses character-based scripts with tonal and syntactic variations. The diversity within Chinese, such as Mandarin and Cantonese, adds further complexity. These linguistic differences, combined with data scarcity, highlight the need for innovative approaches to improve translation quality.

This study leverages Multilingual T5 (mT5), a transformer model pre-trained on multilingual datasets, to address these challenges. mT5 is the multilingual variant of The Text-to-Text Transfer Transformer (T5, introducing a unified text-to-text framework and scaling strategies to achieve state-of-the-art performance on various NLP tasks. mT5 enables zero-shot and few-shot learning, making it well-suited for low-resource scenarios. We fine-tune mT5 on Vietnamese-Chinese datasets and incorporate backtranslation, a data augmentation technique that generates synthetic parallel data from monolingual corpora. This enhances training and mitigates overfitting.

By integrating mT5 and backtranslation, we aim to improve Vietnamese-Chinese translation quality, particularly in domain-specific contexts such as technical and medical texts. We achieved a 70.1 and 64.1 BLEU for the baseline and back-translated models. Our findings contribute to the broader field of low-resource translation, offering strategies to bridge linguistic and data gaps for other underrepresented language pairs.

Related Work

Low Resource Machine Translation

Reference: [2210.05610] MTet: Multi-domain Translation for English and Vietnamese

In recent years, research works focusing on improving Machine Translation Systems for Low Resource Languages have received a lot of attention from both academia and the industry Prior works include collecting more parallel translation data, training large multilingual models, and utilizing data augmentation or regularization techniques. Previous works from ParaCrawl ParaCrawl: Web-Scale Acquisition of Parallel Corpora - ACL Anthology and BiCleaner Bicleaner AI: Bicleaner Goes Neural - ACL Anthology focused on mass crawling parallel translation data for many low-resource language pairs. Yet, previous work shows that crawling at scale still has limitation and affect downstream translation performance.

Vietnamese-Chinese Machine Translation

Encouraging results have also been achieved in low-resource Chinese-Vietnamese translation. WikiMatrix WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia - ACL Anthology includes a ZH-VN language pair dataset, with 13 BLEU performance. And [2308.07601] VBD-MT Chinese-Vietnamese Translation Systems for VLSP 2022 used mBART alongside ensembling and postprocessing strategies to achieve 38.9 BLEU on ChineseVietnamese and 38.0 BLEU on VietnameseChinese on the public test sets, which outperform several strong baselines.

Multilingual Machine Translation Project KC4.0 at VNU-UET proposed KC4MT: A High-Quality Corpus for Multilingual Machine Translation for building high-quality multilingual parallel corpus in the news domain and for some low-resource languages, including Vietnamese, Laos, and Khmer, to improve the quality of multilingual machine translation in these areas.

Method

Data Sources

Parallel Corpus: Existing Vietnamese-Chinese datasets (poems, historical documents)
Monolingual Corpus: Vietnamese and Chinese corpora from bilingual news sources
Synthetic Data: Generated using backtranslation with existing models.