This week I will have a look at the best paper from this year’s COLING that brings an interesting view on inference in NMT models. The title of the paper is “Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation” and its authors are from the University of Amsterdam.

NMT models learn the conditional probability of the next word in a target sentence given the source sentence and the previous words in the target sentence. Using the chain rule, we can multiply those probabilities and thus get the probability of the target sentence, given a source sentence. During training, we optimize the models such that the target sentences from the training data get as high probability as possible. At the inference time, we want to generate the most probable sentence given the model. Exact optimization would be computationally very expensive, so we approximate the search for the most probable target sentence using a heuristic algorithm called the beam search. At least this is how NMT is usually explained.

When we look at this in more detail, we find out this is not really true. In 2019, Felix Stahlberg and Bill Byrne (MT Weekly 20) found out that the best scoring target sentence if do the exact decoding, very often an empty sequence. In fact, the beam search does not approximate finding the maximally probable, but something else. It is not clear what, but it leads to better translations. This year, an EMNLP Honorable Mention Paper (MT Weekly 56) came with a hypothesis that beam search works well because it prefers sequences with more uniform information density which is in agreement with psycholinguistic theories.

The COLING paper approaches the same problem differently. The paper says that we do not necessarily want to get the most probable sequence, even though they think the models learn the conditional words probability distributions correctly. They support the claim about the well-trained models by comparing the basic statistical properties of the training data with sentences sampled from the model, which are more or less preserved by the model. Properties of sentences generated by beam search are on the other hand quite different.

Based on this, they propose that Minimum Bayes Risk decoding might be more appropriate. This in practice means that they sample a bunch of sentences from the model, compare all of them with each other using the METEOR score, and say that the one that is on average the most similar with all the others is the best. This sounds weird, but there is very good reasoning behind the algorithm. There is not only one good translation but many of them. Since good translations are likely to be similar to each other, they sort of fight each other for the same part of probability mass (partially due to the chain-rule factorization when they share the same prefix). A worse hypothesis with a different prefix can thus get a higher probability score.

Probabilities of single hypotheses

But this is not what we intuitively want. A sentence from the high-probability cluster probably would be better, even though a single sentence gets a smaller probability score. When we sample from the model, we are more likely to get sentences from the high-probability area. A sentence that is similar to other sentences is likely to be in the middle of such a cluster. Unfortunately, the experiments in the paper show translation quality slightly below the standard beam search.

Probabilities of single hypotheses

The idea seems very promising to me. I can very well imagine that when using a better and faster similarity metric than METEOR or some methods to take into account that there might be more clusters of highly probable hypotheses might help.