Last week, I was at EMNLP in Miami, and here are a few notes about what I saw at the conference.

Keynotes

The conference had three keynotes: two good and one amazing.

In the first keynote, Percy Liang talked about research on LLMs that they do at Stanford. One topic was LLM-based agents: Percy Liang predicts that LLMs are awaiting their AlphaGo moment so that we will move from coded agents; soon, the big topic will be agents trained with reinforcement learning, in other words, small black boxes operating over big black boxes. I am not completely sure about this.

The second keynote by Anca Dragan discussed reinforcement learning. As always, when talking about reinforcement learning, we saw examples of learning bad objectives that were exploited by the algorithms. An interesting point was that when using RL in practice, we often need to make choice whther we want a rational agent or a human-like agent.

The third keynote delivered by Tom Griffiths was about Bayesian thinking about LLMs. Initially, I worried about the understandability of the talk because Bayesian typically means plenty of complicated math that I have a hard time understanding, but the talk was excellent. One nice result that he showed in the keynote was that some of the reasoning errors of LLMs (famous example: preferring “A” in multiple-choice QA or being able to decipher substitution ciphers only with popular shifts) can be explained by a high prior from the training data:

\[P(\text{answer}|\text{query}) \propto P(\text{query}|\text{answer})\cdot P(\text{answer})\]

Compensation for the prior probability of the answer may help. These results might seem trivial at first glance, but what shows is that in many cases (when we accepted the popular approach that LLMs the way of doing intelligence, whatever it is), we do not want the LLM to model what is probable in the language. We want it just to do some reasoning. This decomposition might, in turn, show that we sort of have to choose either model language or intelligence, which might suggest that LLMs are not the way to AGI (not that I ever thought they were).

Conference papers

On the first day, there were several papers without much empirical content but with interesting ideas.

  • Pragmatic Norms Are All You Need – Why The Symbol Grounding Problem Does Not Apply to LLMs. Yet another paper that debunks the famous Octopus thought experiment by Bender and Koller (2020). The paper attacks mostly how meaning is defined in the original paper: Even though the octopus paper defines the meaning via intention, the analysis in this paper argues that the underlying assumption is a simple correspondence theory of meaning, which has been criticized many times and better theories exist.
  • How to Compute the Probability of a Word. Linguists often need to estimate word probability (e.g. to estimate reading times based on suprisal). With subword-based language models, it is not straightforward and this paper comes with some probability tricks to compensate for that.
  • Towards similarity-aware Surprisal Theory. This paper tackles an everlasting problem with neural networks and NLP: in the very last layer, we assume that all vocabulary units are conditionally independent (given the context), which is simply not true because words can have a similar meaning. This paper tries to compensate for this by including a similarity term in surprisal computation and, in turn, gets a better estimation of human reading times.

I was pretty surprised at how many NLP tasks people work on. I remember when most papers concerned core NLP tasks like tagging, parsing, or named entity recognition. Now, plenty of tasks, including unusual ones like fairy tale generation, pedagogical assistants, doing math, fact-checking, etc. Also, retrieval augmented generation was a very popular technique. Often, it seemed to me that from the technical perspective, people did pretty much straightforward things. However, because I have almost zero knowledge about many of the tasks, it was still pretty interesting.

And here is my biased and, to some extent, a random selection of what was, for me, the outstanding papers:

Workshops

Here are some random takeaways from the workshops, too:

  • Most submissions to WMT translation tasks finetuned LLMs.
  • An important topic at the panel discussion and BlackboxNLP was that evaluation in interpretability is hard. (My interpretation is that the panelists meant that the credibility of interpretability research is often an issue.)
  • Sebastian Ruder and his Cohere colleagues have a dataset for evaluating wrong language generation.