8 levels of context maturity in AI-native engineering
AI shows up in 60% of engineering work. But only about a fifth of it can be handed off without someone babysitting the output. That’s because agents are missing context.
This 8-stage context maturity model gives a real answer on why you haven't seen meaningful productivity gains for all the tokens burned.
Join this live webinar on June 24 (FREE) to learn:
- Why more MCPs provides agents access but not understanding
- What it takes to deploy agents you can trust without supervision
- How a context layer solves for quality, efficiency and cost
You don't need to rebuild your stack to explain your AI's decisions. You need to capture, at each decision, what was available and how the choice got made — at the moment you still have the context. Here's how.
A while back, I spent the better part of a week debugging an agent that was calling the right tools in the wrong order — but only under certain combinations of context and conversation history. The trace was clean on every failing run. Every step was individually correct. The thing I needed to know — why it sequenced them that way, this time — wasn't anywhere in the logs, because nothing in the system was built to record it.
That's the gap I keep coming back to: you can trace what your AI did and still have no structured account of why it chose to. Execution is captured. The decision isn't.
The good news, and the point of this piece: closing that gap is an instrumentation problem, not a rebuild. You already do the reconstruction — every time you stare at a trace and infer what the model "must have been thinking." Instrumenting the why just means capturing that reasoning as data, at the decision point, instead of in your head, days later.
Let me make it concrete.
Capture at the source, not in the postmortem

Execution records the path taken; the decision lives in the paths it didn't.
Photo: Johannes Plenio on Pexels
The single most important shift is when you capture. Post-hoc reconstruction — opening a trace after something breaks — is the expensive path, because by then the decision has collapsed into a single realized output and you've lost everything around it: the alternatives, the scores, how close the call was.
The decision point is where all of that still exists. When your router picks a tool, your retriever returns candidates, or your model commits to an interpretation — that is the moment the context is richest and the capture is cheapest. Instrument there.
That gives you a two-tier model, and the split is what makes this affordable:
Tier 1 — in-band, always-on, cheap. Structured annotations the system emits as it runs, at the decision points you care about. Real signals, captured at source. This is most of the value.
Tier 2 — offline, on-demand, expensive. Reconstruction for the hard cases — replaying a flagged decision to estimate what you couldn't capture live. You run this on a sample, never on everything.
Tier 1: what to actually emit
Take the five things that make up a decision (how the input was interpreted, what alternatives were live, what the system weighted, how confident it was, and what context was in play) and map each to something you can log without re-architecting.
The cheapest, highest-leverage win is this: stop logging only the choice. Log the choice set.
Most systems do the equivalent of argmax: they compute scores over candidates and then throw the scores away, persisting only the winner. Keep them. Here's a tool-selection step instrumented to emit a decision record:
def select_tool(query, context, tools):
# 1. Interpretation: the model's own read of the input, as a structured field
interpretation = classify_intent(query, context) # e.g. "take_action" vs "answer"
# 2. Considered set + 3. weighting: rank, don't just pick
scored = rank_tools(query, context, tools) # [(tool, score), ...] desc
choice = scored[0][0]
decision = {
"point": "tool_selection",
"trace_id": context.trace_id, # join key back to the execution trace
"interpretation": interpretation,
"considered": [t.name for t, _ in scored], # the set, not just the winner
"scores": {t.name: s for t, s in scored}, # measured, at source
"margin": scored[0][1] - scored[1][1], # 4. confidence: how close the call was
"context_refs": [c.id for c in context.items], # 5. provenance for later salience
}
emit(decision) # ship alongside the trace
return choiceNothing here is exotic. You're already computing scored, you were just discarding it. The only real additions are emitting the model's interpretation as a field, and keeping the candidate set and margin. A few things worth calling out:
consideredandscoresare the considered set, captured for free. The moment you have these, "why this tool and not that one" stops being a guess.marginis your cheapest confidence signal: the gap between the top two candidates. A decision that won by a hair is a different animal from one that won decisively, and right now you almost certainly can't tell them apart after the fact.context_refsdon't tell you salience yet; they tag what was in the room, with provenance, so Tier 2 can later test what actually mattered.Where you have token logprobs, capture them. They survive generation and they're real measured confidence at the token level. (They won't tell you the decision-level margin, which is Tier 2's job, but they're free signal you're probably dropping.)
Do this at your retrieval step too: log the candidates returned and their scores, not just the chunks that made it into the prompt. Same pattern, same payoff.
Tier 2: reconstruct the rest, honestly
Tier 1 gets you measured signals. Some things you genuinely can't read live: which context element actually drove the choice (salience), or how stable the decision is under small perturbations. Those you reconstruct offline, on the decisions you flag (failures, low-margin calls, a sample for monitoring).
Three workhorse techniques:
Ablation for salience. Drop or alter one context element, re-run the decision, see if the choice flips. The elements that flip it are the ones that mattered. That's how
context_refsbecomes a weighting.Perturbation re-sampling for the considered set. Re-run the decision point under small context variations and watch what the system actually reaches for; that bounds which alternatives were really live.
Ensemble for confidence. Run the decision N times and look at the distribution. A stable winner is high-confidence; a coin-flip across runs is the margin Tier 1 hinted at, confirmed.
One hard rule, and it's the rule most homegrown versions get wrong: every Tier 2 value is an estimate, and your schema must say so. Tag provenance source vs reconstructed) on every field. A reconstructed salience score that gets stored looking exactly like a measured retrieval score is worse than no score, because someone will trust it in a postmortem. Don't let inference cosplay as telemetry.
And run Tier 2 against replayed or mocked execution, never the live system. Your decision point called update_account() for real the first time; re-sampling it forty times can't re-hit a live write. Replay or stub the side-effectful calls. (This is also why it's cheap enough to do at volume: you're not paying for real effects.)
The trade-offs you're actually signing up for
This isn't free, and pretending otherwise is how instrumentation projects die. The honest costs:
Don't instrument everywhere. Capturing decision context at every step will bury you in data and overhead. Pick the decision points that matter (routing, tool selection, retrieval, plan steps) and leave the plumbing alone.
Tier 1 is cheap but not zero. It's an extra structured row per decision point and a small discipline (rank-then-record instead of pick). Keep records small and key them to
trace_idso they ride alongside what you already store.Tier 2 is expensive, so sample it. Always-on reconstruction is a non-starter. Trigger it on failures, on low-margin decisions, and on a small monitoring sample. That's enough to see drift without re-running your whole traffic offline.
Storage is structured and small. Decision records are compact rows, not blobs. The cost is in discipline and query tooling, not bytes.
Where to start this week
Don't try to capture all five primitives across every decision point. Pick one decision point (the highest-stakes one, usually tool/route selection or retrieval) and instrument Tier 1 only: interpretation, the candidate set, the scores, the margin. One point, in-band, for a week.
You'll almost immediately be able to answer questions you couldn't before: which decisions were marginal, where the right option wasn't even in the considered set, which context the system had when it chose. That's the whole move: from "what happened" to "why," one decision point at a time. Add Tier 2 reconstruction only once Tier 1 surfaces a decision worth interrogating.
You don't need a new platform to start explaining your AI's decisions. You need to stop throwing away the context you already compute at the moment it's richest, and to be honest, in your schema, about what you measured versus what you inferred.


