Here are some of my notes and comments on what I had a chance to see at ACL in Dublin last week (my first in-person conference since 2019).

ACL D&I 60-60 initiative

ACL announced its 60-60 initiative, for the 60th birthday of ACL, all materials that ACL produces should be available in 60 languages. The initiative already machine-translated titles and abstracts of ACL 2022 papers and did an automatic voiceover for plenary talks at the conference. Although this is definitely a cool idea, what I had a chance to see so far, was a little disappointing to me. The quality of the MT is not great at all and the voiceover in the talks speeds up and slows down the generated speech in such a way that sometimes it is completely silent and sometimes is not understandable at all.

Also, I am not entirely convinced about the choice of the materials that we selected for localization. Papers presented at ACL are typically very advanced and I can hardly imagine a person that can read an ACL paper and cannot read English. It would make much more sense to translate and make available beginner-level tutorials from the conferences and rather focus on the quality of the translation than the quantity.

The coolest thing that the initiative starts is a curated multilingual list of terminology in multiple languages. Recently, when we communicated with the media about our Czech-Ukrainian translation, we lacked Czech terminology, and creating terminology on the fly seemed a little inappropriate. Having a curated community-approved glossary would be a big help, also for undergraduate teaching. (And I will be more than happy to contribute to this effort.)

Big ideas panel

One of the panels at the conference presented ideas that several established researchers think are worth exploring in the near future and will bring some progress. These included story-telling, using language reasoning, or language for communication with devices in the wild. This shows a shift from looking mostly at the propositional use of language towards doing actual things with language. This is cool, the community seems to be shifting towards what language really is.

Paper highlights (in a random order)

What do Models Learn from Training on More than Text? Measuring Visual Commonsense Knowledge

The paper probes pre-trained models for knowledge about the color of common objects and their shapes. Language-vision models seem to perform better than text-only models. Sounds cool, but this paper shows that this is probably due to the texts they are trained on. When text-only models are finetuned on image captions, they perform very similar to the multimodal models. It seems that it is enough to tell BERT we are playing a different game: saying literal stuff about something visual.

Memorisation versus Generalisation in Pre-trained Language Models

The experiments are based on noisily labeled data: as long as the model learns to generalize, it should remember the noisy data instances..When finetuning BERT and other models, they empirically find three stages of learning: Phase 1: Fitting, the accuracy quickly increases, the model learns the most simple patterns; Phase 2: Settling, looks like nothing happens; Phrase 3: Memorisation, the models starts to memorize the noisy training examples.

Infinite Memory Transformer

The model has a short-term memory (this is the standard thing from the Transformer) and a long-term memory. Previous work (compressive Transformer) did the long-term memory as a fixed-sized matrix to attend. This paper generalizes this approach and does a continuous approximation of the memory (per dimension) with a constant number of base functions, the attention is then a continuous unimodal distribution. They managed to finetune GPT-2 with this trick.

An Embarrassingly Simple Method to Mitigate undesirable properties of pretrained language model tokenizers

The recipe is: Use a recursive algorithm to find k longest non-overlapping subwords from the vocabulary and throw away the rest. They test it on classification of arXiv abstracs, which have difficult languages, so everything splits into many subwords. While keeping BERT as it is, they just change the tokenization (with the same vocabulary) and get better results. The parts of the words that cannot be parsed are thrown away, so this approach can hardly work for sequence generation. However, it is cool to see that the language models can quickly adapt to differently segmented inputs when the vocabulary remains the same.

A Natural Diet: Towards Improving Naturalness of Machine Translation Output

In the paper, they try to do MT that generates less translationese and more natural texts. They have two language models: a natural one and a translation one and use them to decide what is (or at least sounds like) translationese. In the style of tagged back-translation, they add a tag for naturally sounding targets and for translationese, so we can translate the sentences in both styles. Natural-sounding sentences are much worse in BLEU, but slightly better in human evaluation.

Language-agnostic BERT Sentence Embeddings

Yet another model that you cannot train at home. Sentence embeddings for 100 languages, 12-layer transformer, 501k vocab. They initialize it with BERT, and train using parallel data, they want parallel sentences closer together and non-parallel sentences further apart. Adaptive Margin softmax seems to be a key ingredient.