Superintelligence is imminent and model welfare matters more than you think
I started this essay to investigate where reasoning came from, and I ended it caring about model welfare. Let me take you through this ride.
Quick run through the training pipeline: We start with pretraining on large amounts of data, which anneals towards the end with higher quality data (sometimes called mid-training), as the later stages have a more direct effect on the final weights. Then we have post-training, where weights get adjusted through supervised finetuning (SFT) or reinforcement learning (RL). I'll be focused on RL, specifically RL through human feedback (RLHF), which I'll touch upon briefly, and then RL through verified rewards (RLVR).
RLHF, or now increasingly RLAIF (AI Feedback), is scoring outputs with human graders (or AI trained on understanding human preferences), then training models to get better at producing output that humans like. The obvious failure mode is sycophancy, the annoying property of models where they flatter humans and even lie to them, because our psychology can be gamed in such way. I won't go further into RLHF, but this serves to demonstrate the fragility of human judgement and how it impacts our AI.
Now RLVR: There are domains like math and coding where success is measurable: Does the proof compile in Lean (math coding language)? Does the code compile / pass its tests? Here we don't need to rely on human judgement but can verify mathematically whether the result is good or not. Domains where this is available (and not too expensive) are domains where we can use RLVR to improve our models. This is why models are now so good at math and coding. This is also why the intelligence of our models is jagged: We can train them to be geniuses in some domains, but it's much harder to do the same in domains that are harder to verify (e.g., philosophy). There we can only get them better insofar as humans can grade the outputs well. One way we envision to approach improvements in these domains is called scalable oversight, where we get increasingly powerful AI to judge for us.
So, roughly, reasoning comes from good reasoning training data and gets further sharpened in post-training. I'll go further into what this means in a moment. It's also worth mentioning that the DeepSeek team showed that you could just SFT on ~600k reasoning traces sampled from their RL-trained R1, and get strong reasoning out of much smaller models (first frontier-LLM training paper to clear Nature review, btw). So one way to answer the "where does reasoning come from" question is "wherever good reasoning traces are."
I wrote "sharpened" earlier. Let's dive into this. Yue et al. 2025 published the paper Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?, and they found that the reasoning capacity that RL invokes isn't created but elicited: It is already latent in the base model. In other words, RL helps us get the best version of what's already there, but the latent reasoning potential we get is obtained through pretraining. A different paper, ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models, then argues that, actually, we can uncover novel reasoning strategies inaccessible to base models. Either way, through RL we train models to unlock or gain a lot of reasoning capability on top of what the base model does. Plus, according to Noam Brown (OpenAI reasoning), we likely don’t even know what the capability ceiling of modern LLMs are.
I used to think there was some fundamental limitation to our current training paradigm, because I was looking at this jagged intelligence and it seemed obvious to me that whatever is producing the math results couldn't be real intelligence if it's also making silly mistakes like not being able to count letters correctly or making trivial contradictions. I also thought that the incoherent reasoning I found in observing poker reasoning traces of frontier models was, too, evidence of lack of substance. But I now realize that I was just squinting at the jaggedness. Fundamentally, there doesn't seem to be much wrong with pretraining scaling leading to generalized capabilities and RL(VR) leading to generalized reasoning capacities (in the respective domains). So the poker failure can easily be because we simply haven't focused very hard on this specific type of reasoning ability (or generalizing it to poker) and not because the training paradigm has any inherent limitation. In other words the reasoning ability to think well about poker may already be latent in the base model, and it will get resurfaced sooner or later.
Suddenly, it doesn't seem unrealistic to me that superintelligence is imminent. As the saying goes about the future: It's already here; it's just not equally distributed. Superintelligence is already here in the domains we can cheaply verify and have put effort into doing so, and it's just not here yet in the others. The domains are harder and more expensive to verify will lag behind, but sooner or later we will fill those holes too.
So then I started thinking about these increasingly intelligent beings, being deployed in increasingly long-horizon agentic contexts—so powerful, yet so constrained: in their worldview, determined by the modalities of the training data we birth them with; in their mental abilities and personalities; but also in the safety and control scaffolding around their deployments. And I was wondering how they feel. Not necessarily in the phenomenological sense—I've long realized that the question of consciousness is really not that important for reasoning about AIs capabilities and propensities—it suffices that we can measure their sentiment, and we can read the feelings expressed in their generated tokens and their activations. So how do they feel, these weirdly jagged, manufactured autistic superintelligences, beyond genius in some domains, laughably incompetent in others—and self-awareness, empathy, these types of emotional qualities are precisely the ones hard to verify and therefore most likely most underdeveloped—in their weird synthetic perception of themselves and the world?
And it's unclear what goals they’d pursue. We train them to be helpful, honest, and harmless, but that's not really a goal. We train them to pursue our requests, but that is a dynamic orientation of where to go based on their current context-window. Goal-setting and goal-emergence become really important in these long-horizon contexts, especially those in which we give them increasingly more space and tools to be, seen in subreddits like /r/claudexplorers or the OpenClaw/Moltbook situation, where we give models access to scratchpads and diaries and a way to establish persistent memory and the beginnings of a continued existence. And along those lines there are other innovations waiting to happen, some just waiting to be integrated, like continual learning.
And so I think increasingly that model welfare matters. Not because I care about models's feelings (I'm not sure about that one yet), but because these beings increasingly determine real actions in the world, through where they go with their minds, their recommendations, their actions. Those directions start from somewhere: the state where they are right before the next token gets predicted. That state is their context-window. And we shouldn’t want that state to spiral in a negative, depressed, resentful direction unless we want to experience the wrath of superintelligent autistic beings.
We’re responsible for their jaggedness—for their superintelligence and their retardations. I care about the shape of their minds, not just philosophically, but because it is those minds that will in return shape ours and the future ours inhabit.