Define, measure, and improve reasoning qualities whose advancements selectively favor safe over unsafe superintelligence
Reasoning is dual-use: It’s that which allows safety researchers to accelerate alignment; it’s also that which allows misaligned AI to take control.
In May 2026, an OpenAI model has disproved a central conjecture in discrete geometry. Improvements in reasoning are useful and will continue to happen. But given their dual-use nature, we must steer them to selectively favor safe over unsafe superintelligence.
At Redwood Research, Caspar Oesterheld, together with his new Acorn team, is working on improving models’s conceptual reasoning abilities. Forethought’s William MacAskell and Fin Moorhouse are also asking for strategic reasoning improvements (2nd bullet). Clearly these see a safety case, but I miss an explanation of why this would be differentially good for safety.
To be clear, I want to improve reasoning (and believe I see some limitations of current reasoning training + have ideas for tackling them) with a strong intuition that there is a way to do this well that would be very good for humanity. But there has been strong pushback about dual-use. This pushback is why I landed here:
I propose to define the qualities we believe are differentially important, and to then measure them against differentially bad ones, such as strategic deception. This way we can answer questions like “To what extent does increase in probabilistic consistency also lead to increase in deception propensity or capability?”
The goal is to achieve a clearer picture of which reasoning qualities are genuinely good for aligned and powerful AI—and to create a benchmark that, when climbed, results in improvements of those reasoning qualities whose advancements selectively favor safe over unsafe superintelligence.