Game-theoretic reasoning scales in general and with test-time compute

16 Jun, 2026

This is quick jot-down of experiment results. I didn't put effort into writing this nicely for the reader, sorry. I just wanted to log my own thoughts.

I tested LLMs performance on a poker river spot for

EV loss against Nash
EV loss vs best response (adaptive counterplay)
Best response vs fixed strategy

I wanted to find out what their current game-theoretic reasoning capabilities are. Namely, I wanted to see whether they could reason from first principles to do well in these three.

Current poker playing evaluations are bad. They lack understanding of what actually matters for game-theoretic reasoning in poker. For example, one benchmark measures whether the model's chosen action is part of a GTO strategy (tells very little, the mix is what matters). Or they measure to what extent LLMs get close to Nash, which really doesn't tell you much when put in a random poker spot, it's way too complex of a spot, and Nash isn't the obvious goal. Or they measure how well LLMs do against a Nash-approximating agent, which also doesn't tell much, because it's not that difficult to do okay against a Nash-approximating agent. I can go more into why all of these aren't great, but I want to keep it short right now. I was initially concerned that we have these sloppy poker reasoning benchmarks, but I no longer think it matters because there's nothing unique to poker that scaling won't swallow.

I set up toygames and tested Anthropic's models: Haiku 4.5, Sonnet 4.6, Opus 4.8.

Without reasoning, they all performed poorly. With reasoning, Sonnet and Opus did well, Haiku did poorly.

The stronger models were able to reason to Nash from first principles (or get close to it), even as I made the toygames more complicated. Their performance scaled with test-time compute. They nailed all the three metrics I mentioned above.

The reason why their performance and reasoning traces in poker overall is not super good yet is just because of domain expertise + training data is confusing. But from my observations, nothing stands in the way of scaling to come for all of this. (Well, maybe opponent modeling).