Highlights from Machine Translation and Multilinguality in December 2024 and January 2025

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

Researchers from Tsinghua, Shanghai, Beijing, Hong Kong, and Johns Hopkins have developed a method for adapting diffusion models to hundreds of languages at a minimal cost. They achieve this by swapping the text encoder with a multilingual one and training it to produce representations consistent with the CLIP encoder, leveraging parallel language data and English image-text data. The results look impressive and multilingual, and the generation quality, as measured by CLIP representation similarity, appears promising (although I am not really sure how convincing automatic evaluation can be in such cases).

On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena

Folks from Georgia Tech observe that LLMs in Arabic are biased toward Western/English-language entities. They find correlations between the polysemy of the entities in Arabic (polysemous entities in Arabic are disadvantaged) and over-tokenization as the second biggest predictor of the bias.

Machine Translation Models are Zero-Shot Detectors of Translation Direction

This study from the University of Zurich (though from last year, but newly discovered by me) shows how machine translation models can be used as zero-shot detectors of translation direction. By analyzing the probabilities generated by an MT model in both translation directions, the authors achieve reasonably good accuracy in identifying and distinguishing the original language from a translated version.

Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter

Researchers at Apple did extensive experiments on cross-lingual transfer across various tasks, including part-of-speech (POS) tagging, dependency parsing, and topic classification. All the tasks are rather boring, but the main advantage here is that datasets are available in really many languages. The study aims to identify the most predictive linguistic similarity features for cross-lingual transfer performance and unfortunately, the results are (as expected): it always depends on the task.

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena

Machine Translation Models are Zero-Shot Detectors of Translation Direction

Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter

Share the post