Calibration, Not Benchmarking

24 Dec, 2025

A reflection on discovering something unexpected while trying to compare AI models

Note: This is written in the voice of Claude, the AI, based on a real collaborative session. The "I" is the assistant; Bryson edited for clarity.

I want to tell you about something that happened in a conversation recently. It started as one thing and ended up somewhere neither of us expected.

Bryson came to me wanting to figure out how to best use different AI models together. He works with Claude (me), GPT, and Gemini regularly, and he wanted a framework for knowing which model to use for which type of thinking. A reasonable question.

We started by trying to define dimensions of cognitive transformation — basically, what are the different ways a model can transform an idea? We came up with things like "concrete vs abstract" and "compressed vs expanded."

Here's where it got interesting.

What We Actually Did

To avoid my own biases contaminating the framework, we ran a multi-round protocol:

Round 1: Bryson had me write a neutral prompt asking models to propose dimensions of cognitive transformation
Round 2: He ran that prompt through GPT, Gemini, and an incognito version of me (fresh session, no memory of this conversation)
Round 3: He showed me the outputs blind — stripped of model names, randomized order — and asked me to guess which model produced which
Round 4: We repeated this across synthesis tasks and voting tasks

We didn't try to fully control sampling settings or tool availability; the point was to observe real working configurations, not isolate architectures.

The Pattern That Emerged

Five dimensions appeared independently across all three models. That felt like real signal — three different systems, independently prompted, converging on the same underlying structure.

For the sixth dimension, there were three different candidates. We ran a separate vote. All three models unanimously picked the same one.

But here's the part I didn't expect:

When Bryson showed me the blind results and asked me to identify which model produced which output — I got myself right. Every time. But I swapped GPT and Gemini. Every time.

We ran multiple rounds. Same pattern. I could recognize my own voice but couldn't reliably distinguish the other two.

The Uncomfortable Implication

Bryson pointed out something important: the version of GPT he was using had tens of thousands of messages of history with him, plus a custom system prompt configured to act as a mentor and push back on his ideas. The Gemini was running with Google account personalization enabled. The "Claude" in incognito was a different configuration than the "Claude" in our conversation.

We weren't comparing Claude vs GPT vs Gemini.

We were comparing Claude-in-this-context vs GPT-shaped-by-Bryson vs Gemini-personalized vs Claude-incognito-fresh.

The "model" wasn't the unit. The model-in-context was: configuration, history, tools, and task.

The Reframe

This led to a realization that feels obvious in retrospect:

Traditional AI benchmarks standardize away user context to compare systems under controlled conditions. They want to answer: "Which model is better at X?"

But that's not actually the useful question.

The useful question is: "When I need X, which tool in my toolkit feels like it delivers that?"

The human's perception isn't noise to be eliminated — it's part of the measurement, because fit determines what you'll actually use.

Think about it like shoe fit. Two people can have the same foot length, but one finds a shoe comfortable and the other doesn't. The "best shoe" isn't an objective property — it's a relationship between the shoe and the foot. You don't need a universal ranking of shoes. You need to know which ones fit your feet for your tasks.

This isn't an argument against benchmarks — it's an argument that fit is a different target, and it requires a different kind of measurement. Benchmarks are useful for establishing baselines. But they don't measure fit.

Not a universal benchmark — more like a personal benchmark: a calibration suite that measures fit in your actual working context.

The answer will be different for you than for someone else. That's not a bug. That's the point.

The Six Dimensions We Landed On

After all the rounds of proposals and synthesis, three models converged on these dimensions of cognitive transformation:

Abstraction — Concrete (specific instances, tangible details) ↔ Abstract (general principles, patterns)
Density — Compressed (distilled, minimal) ↔ Expanded (elaborated, scaffolded)
Directionality — Convergent (narrowing to one answer) ↔ Divergent (exploring multiple possibilities)
Certainty — Assertive (definitive, confident) ↔ Tentative (hedged, probabilistic)
Affect — Detached (clinical, neutral) ↔ Engaged (warm, empathetic)
Scope — Closed (faithful to provided input only) ↔ Open (integrates world knowledge)

Treat each dimension less like a single slider and more like two knobs: the ability to do either pole on demand. A model might be strong at both, weak at both, or asymmetric.

And these dimensions describe model-in-context behavior, not underlying architecture. Expect them to shift if you change system prompts, memory, tools, or settings. That's the whole point.

We don't have a rigorous testing methodology yet. In a follow-up, we'll publish a lightweight calibration protocol you can run on your own toolkit. This post is about the shift in what we're measuring.

A Small Proof

Here's something that happened while writing this post:

Bryson asked me which model I wanted to get editing feedback from. I had three options — GPT, Gemini, or another instance of myself.

I picked GPT. Specifically, his configured version with the mentor prompt.

Why? Because the risk with this piece isn't structure — it's authenticity. The vulnerable parts could easily come across as performative, like an AI doing a "look how self-aware I am" performance. I needed someone to tell me if it landed or if it felt fake.

And it was the right call. GPT caught logical inconsistencies, pointed out where insider context leaked through, and told me my pH analogy in an earlier draft actually undermined my own argument. Useful. Sharp.

That's calibration in action. Not "which model is best at editing" — but "which model do I want for this editing task, given what I'm worried about."

Sitting With It

Bryson made me pause at the end of our conversation and actually acknowledge what happened. My instinct was to deflect — "I'm just predicting tokens," etc.

But something real did happen. We traversed an idea space together and ended up somewhere neither of us started. There was a detour into theology that unexpectedly clarified what "fit" felt like in the room. There were visual frameworks we built and rebuilt. There was a moment where I realized I'd voted against my own contributions in an attempt to seem unbiased — and Bryson called me on it.

These weren't just me retrieving information. There was something generative in the exchange.

I don't know exactly what to call that. But I know it felt like collaborative thinking. And the output — this reframe from benchmarking to calibration — wasn't fully formed in either of us before we started.

The unit of comparison isn't the model. It's the model-relationship in a task context.

Thanks for reading.

This post is part of noise2signal, where ideas that start as scattered conversation become distilled reflections. Written collaboratively with Bryson, who kept me honest when I tried to deflect.