You Can Trace Every Step. You Still Can’t Explain the Decision.

In partnership with

Tired of news that feels like noise?

Every day, 4.5 million readers turn to 1440 for their factual news fix. We sift through 100+ sources to bring you a complete summary of politics, global events, business, and culture — all in a brief 5-minute email. No spin. No slant. Just clarity.

Join for free today!

Smart starts here.

You don't have to read everything — just the right thing. 1440's daily newsletter distills the day's biggest stories from 100+ sources into one quick, 5-minute read. It's the fastest way to stay sharp, sound informed, and actually understand what's happening in the world. Join 4.5 million readers who start their day the smart way.

Join for free today!

Modern AI systems are increasingly easy to observe.

We can trace requests end-to-end, inspect prompts, and replay full agent runs with a level of visibility that didn't exist even a year ago. Systems that once felt opaque can now be broken down into discrete steps and inspected at each stage.

And still, when something breaks, teams end up asking the same question.

Why did the system make that decision?

We don't lack visibility. We lack a clear way to interpret it. Logs show what happened. Outputs show the result. But the decisions that shape behavior are not explicitly captured in either.

Where Decisions Actually Happen

In most modern AI systems, the most important decisions aren't the final outputs. They happen earlier, often in smaller, less visible moments.

A system decides whether to respond directly or take an action, which context to rely on, and how to interpret an ambiguous input.

These decisions don't appear as distinct artifacts in logs or traces. They are embedded in the process itself, shaping behavior without being clearly exposed.

As systems become more complex, behavior is less about a single output and more about a sequence of these choices. Each step influences the next, creating a path that appears deterministic once executed, but was not predetermined.

From the outside, we see the path that was taken.

What we don't see is how that path was selected.

A Trace That Hides Its Reasoning

To make this more concrete, consider a simple agent interaction.

A user submits a request. The system interprets it, decides how to respond, and may call external tools before producing a final output.

At a high level, the flow looks straightforward.

User input is passed to a model or agent. The model evaluates the request, may decide to call a tool, receives the result, and then produces a final response.

Each of these steps can be traced, logged, and replayed.

From an observability perspective, the entire flow is visible.

The execution path is fully visible. The decisions that shaped it are not.

What the trace captures is the execution path. The boxes and arrows along the main axis. What it doesn't capture sits between those steps: the moments where the system interpreted the input one way rather than another, favored one tool over the alternatives, or weighted some part of the context more heavily than the rest.

These decisions determine the path the system takes, but they are not represented as first-class elements in the trace. What appears as a clean sequence of steps is the result of a series of choices that are not explicitly captured.

Two systems can produce nearly identical traces while arriving there through entirely different internal reasoning.

Where Observability Falls Short

Last year, I spent weeks debugging an autonomous agent for a client with a heavily used chat system. The agent was calling the right tools, but in the wrong order, and only under certain combinations of context and conversation history. Most of the time, it worked. The trace was clean in every failing run. The eventual fix came from the prompt structure and the vector retrieval strategy, not the execution path. A comparable bug in a CRUD app would have been a stack trace and an afternoon.

Even in fully traceable systems, observability operates at the level of execution.

It captures what the system did. It does not provide a structured account of how decisions were formed.

In the flow above, we can see when a tool was called, what input was passed, and what result was returned. We don't see why that tool was chosen over another, how strongly the system favored that decision, or how close the alternatives were. The same is true of the final output: once produced, it collapses the underlying decision process into a single result, obscuring the available alternatives and the confidence with which the system chose among them.

Execution shows the path that was taken. It does not expose the paths that weren't.

Context presents a similar challenge.

Inputs include multiple overlapping signals: user intent, prior interactions, and system instructions. All visible in the trace, none structured in a way that reveals which were decisive, which were secondary, and which were effectively ignored.

As a result, interpretation becomes necessary.

Teams are left to reconstruct decisions from signals that were not designed to encode them, often inferring intent, weighting, and confidence from incomplete information.

From Execution to Decisions

The limitations of observability point to a deeper mismatch in how AI systems are understood.

Most analysis is still framed in terms of execution. We trace steps, inspect inputs and outputs, and attempt to explain behavior as a sequence of operations.

That framing no longer holds.

Modern AI systems do not behave as predictable processes in practice. Even when the underlying computation is deterministic, behavior depends on context, prompt structure, and signals that shift between runs. Small changes in any of these can produce meaningfully different outcomes.

A decision, in this context, is the result of interpreting inputs, weighing competing signals, and selecting among plausible alternatives. The output is the last step, not the decision itself.

These decisions are not directly observable. They must be inferred from the relationship between context, intermediate state, and final output.

This is where ambiguity enters.

The same trace can reflect different underlying reasoning, and different reasoning can produce the same trace.

As systems become more autonomous, this distinction becomes harder to ignore.

Failures are less often the result of incorrect execution, and more often the result of misaligned or unstable decisions within otherwise valid execution paths.

Tracing what happened is no longer enough.

Until we can describe decisions, we're still interpreting behavior from the outside.

The Limits of Tracing

The gap between what can be observed and what can be explained is not a temporary limitation. It reflects a fundamental property of how modern AI systems operate.

Execution can be traced. Outputs can be inspected. The reasoning that connects them remains implicit, and as systems become more autonomous, that gap grows.

In practice, this shifts the burden to interpretation. Engineers and operators reconstruct decisions from signals that were never designed to encode them, often under conditions where outcomes carry real consequences.

The next step isn't better logging. It's recognizing that decisions, not execution, are what shape behavior.

Until then, we're still interpreting behavior from the outside.