The Bitter Lesson 2.0: Reasoning Without Worlds
TL;DR — We don’t need to build worlds to teach machines to reason. Recursive supervision turns data itself into the “environment,” replacing rewards and rollouts with simple gradients and iteration. Compute still wins—but now with fewer hand‑crafted assumptions.
1) The old lesson
The original bitter lesson says: when we hand‑engineer knowledge and clever rules, we eventually get surpassed by methods that simply learn from data at scale. The hard part isn’t being clever; it’s letting go.
In practice, “reasoning” has been the holdout. We’ve treated it as sequential decision‑making and reached for reinforcement learning (RL): define an environment, supply rewards, explore, and improve a policy. It feels human—goals, actions, consequences—and it flatters our instinct to encode structure.
But that workflow smuggles in a lot of human authorship: simulators, reward shaping, exploration schedules, safety hacks, logging, debugging. It’s powerful, but expensive and brittle.
2) The quiet pivot
A new family of models shows another path: recursive supervision.
- Start with labeled examples (input → target).
- Predict an answer
y. - Refine that answer by looping: update a latent state
z, then updateyagain. - Supervise the whole process with ordinary losses—no rewards, no rollouts.
This turns data into the “environment.” Instead of acting in a simulator to learn dynamics, the model replays the computation on itself, using recursion to distribute credit across steps. The gradients are dense. The loop is internal. The only thing we hand‑craft is the loss.
Call this the Bitter Lesson 2.0:
When the task is “reasoning,” don’t model the world—iterate on the answer. The environment you need is already inside the dataset.
3) Why this scales
No simulator tax. You don’t have to approximate reality or design rewards. You avoid biases that come from simplified worlds.
Dense learning signals. Supervision provides a full gradient every iteration. No sparse‑reward gymnastics, no credit‑assignment via discounted sums.
Hardware‑friendly. Batches of examples, fixed loops, small networks. Easy to scale, prune, and deploy.
Interpretability by construction.
Each refinement step is inspectable: you can log y₀ → y₁ → … → y_T, visualize how the latent z evolves, and correlate it with task structure. That’s a clearer window into “reasoning” than reward curves and Monte‑Carlo variance.
4) What this isn’t
It isn’t a claim that RL is useless. RL remains the right tool when the world fights back:
- Embodied control (robots, manipulation, locomotion).
- Strategic interaction (multi‑agent, markets, adversaries).
- Open‑ended discovery where labeled data doesn’t exist yet.
Think of RL as data generation for domains where you can’t get labels otherwise. But in domains where you do have data (or can label/derive targets), recursive supervision is simpler, more stable, and often more scalable.
5) A rubric for choosing your path
Ask three questions:
Do I truly need online interaction? If not, prefer recursion + supervision.
Can I define a clear target for each example? Classification, structured prediction, planning summaries—if yes, you can supervise refinement directly.
Is exploration essential, or is the uncertainty epistemic? If the challenge is understanding known structure (math, code, vision, tables, diagnostics), recursion tends to win. If the challenge is creating novel experiences, RL earns its keep.
6) Practical recipe (reasoning without worlds)
- Representation. Choose a compact, stable input encoding; keep the model small.
- Initialization. Produce an initial guess
y₀(even a trivial one). - Recursion. Update a latent
za few times, then updatey. RepeatTsteps. - Deep supervision. Apply loss at each step; weight later steps slightly more.
- Halting. Add a learned “good‑enough” head to stop early.
- Ablate. Show gains from recursion depth, not parameter count.
- Log the trajectory. Publish
y/ztraces as your interpretability artifact.
This is the simplest possible loop that still looks like “thinking.”
7) Predictions (you can test these)
- Simulators shrink to niches. For many reasoning tasks, datasets beat environments.
- Recursion > search (when labels exist). Iterative refinement will outpace tree search once encoders are competent.
- Smaller nets, deeper loops. Parameter count gives way to compute‑per‑example via more refinement steps.
- Interpretability moves inward. We’ll debug the improvement trajectory, not reward curves.
- RL becomes a data factory. Its primary role is to synthesize experience where labels are unavailable or unsafe to collect.
8) A note on ego
Part of why RL is beloved: it lets us feel like architects of worlds. But the bitter lesson keeps telling us that general methods plus compute beat our cleverness. Recursive supervision doubles down on that humility. It strips away even the pretense that we must encode dynamics. We give the model examples and a way to improve; the rest is gradient descent.
It’s less theatrical—no agents roaming elaborate simulacra—but it’s closer to the grain of reality: the world already wrote itself into the data.
9) Closing
The first bitter lesson humbled our feature engineering. Bitter Lesson 2.0 humbles our world‑engineering.
Reasoning doesn’t require rewards if we let models learn to improve their own answers. When labels exist, recursion turns supervision into thinking. And at scale, the same ancient law applies: compute wins—especially when we get out of its way.
Concept framed by me, draft generated with AI.