Scaling won't fix training data quality
Epistemic confidence: 60%. Writing this essay allowed me to see that models could have a spectrum of real intelligence that scales. As of now, I still err on the side of the argument made here.
[Edit: Since writing this I dove deeper into scaling, and now find some of my views here too naive. The training data quality problem however seems to be a real problem.]
A couple days ago Google DeepMind announced their Kaggle competition in which they let frontier models play poker against each other.
They have competition for three games: chess, werewolf, and poker, explaining that the latter is useful to test probabilistic reasoning under imperfect information.
That's NOT what poker offers: It's not about "probabilistic" reasoning; it's about game-theoretic reasoning. Probabilistic uncertainty is baked into the game, but the difficult reasoning aspect of poker comes from game theory, not from the comparably trivial probability aspect.
In the displayed LLM reasoning, which you could follow for every hand, the models use the right words and combine them into sentences that seem legit – in the sense that a real human could have written them (and one with domain insight, like a poker redditor) – but the reasoning is all jargon and zero substance. It was baffling how well a model could create the semblance of intellect without there being any.
Witnessing this, I realized that no matter how well you market chain-of-thought and "extended thinking", the models aren't capable of real thought. Otherwise, they would have reasoned about the game, with it's simple rules, in a completely different way. But no, they only predict the next token. There's no understanding how to predict the next token; it's just their params. Even during chain-of-thought reasoning, they aren't doing anything different; they revisit their old tokens with new tokens, increasing the likelihood that the same mistake won't surface twice. But what if it's mistakes all the way? They require at least some – if not a good chunk – of their training data to provide the correct tokens, but the majority of training data is garbage.
This is especially evident in areas like poker, where high signal information is almost impossible to find lest discern amid the overwhelming noise. But it applies across domains: Most output is mediocre. Coding agents are increasingly powerful, and their ability to prototype fast is incredible. Still, their code quality is garbage. If you consider that the vast majority of codebases are poorly written, that's not a surprise. And just like in poker, the vast majority of engineers can't discern that AI code quality is shit. In poker I don't see how it could get better by scaling alone, and with that in mind, I don't see how it can get solved here either. The current approach requires higher quality training data.
I know the "just a token-predictor" argument is not a new take, and I've had my fair share of going back and forth between that and "LLMs might actually be conscious", but after having seen the reasoning of poker playing LLMs, understanding deeply how wrong it is, and knowing where the fundamental issue comes from, I no longer believe that scaling leads to real AGI. LLMs simply don't have the capacity to actually think. Instead, they are forever bound to the quality of their training data.
Caveats:
LLMs go by examples. And if you give it more good examples, more compute, and more time to go over and over their predictions, they become more powerful. That makes sense. But they can't get better than their examples – unless perhaps through great examples of learning and transfer learning. But even then, I'm not sure if they can ever overcome the mediocrity of their training data.
One possible explanation is that models could actually be "stupid" – in the sense that just like someone learning poker, they wouldn't be able to know how to think properly about the game and instead regurgitate the bullshit of whoever falsely taught them. If that's the case my conclusions are wrong and this would actually speak for them becoming more intelligent through scaling.
I still believe that scaling can lead to shallow AGI, AI that can perform at or above human level at most tasks but devoid of real understanding or consciousness.