A First-Pass AI Cost Audit: Finding the Hidden Tax in Your Stack

In partnership with

Analytics on Live Data. No Pipeline. Just Postgres.

Most teams treat analytics as a separate problem. As data grows, they add a warehouse, a pipeline, a sync job. By the time data reaches their dashboard, it's already stale.

TimescaleDB takes a different approach: extend Postgres instead of splitting away from it.

Your transactions and your analytics run on the same database, on live data, with no pipeline in between.

Hypertables partition time-series data automatically as volume grows. Hypercore compression cuts storage by up to 95%. Continuous aggregates pre-compute rollups so dashboards stay fast without re-querying everything.

CERN runs it on Postgres to handle sensor data from the Large Hadron Collider.

No second database. No migration. Same Postgres you already know.

Get $1000 Credit To Start

A while back, I got handed an AI bill that had roughly doubled month over month, along with the only question that ever comes with it: why. The provider dashboard could tell me that spend was up and which model ate most of it. It could not tell me what any of that money bought. I had the total. I had no idea which of them was working.

I've written before about why that gap exists. The hidden AI tax is the gap between what you spend and the value your AI actually creates. That piece named the problem. This one is the audit: no new platform, no FinOps hire, just a way to find the leaks using data you already have.

Step zero: change the denominator

Your bill measures cost per token. The audit measures cost per outcome. Everything below depends on that one shift, and the shift requires exactly one join.

Most stacks already have both halves; they just aren't connected. Your provider gives you a token for usage per request. Your app gives you a trace per task. The join key is whatever request or trace ID you already thread through. Tie the spend to the unit of work it belongs to, and the undifferentiated number on the invoice becomes a table you can interrogate.

# You already have two things: usage events (tokens, model, $) and traces (a task).
# The whole audit hangs on one join key tying spend to a unit of work.

def cost_per_outcome(usage_events, traces):
    rows = []
    for t in traces:                               # a task = the unit you judge value against
        spend = sum(e.cost for e in usage_events   # every model call made under this task
                    if e.trace_id == t.id)
        rows.append({
            "trace_id":  t.id,
            "task_type": t.task_type,              # "support_reply", "code_review", ...
            "outcome":   t.outcome,                # succeeded / failed / escalated / unknown
            "steps":     t.step_count,             # model calls in the task
            "cost":      spend,
        })
    return rows

Once you can sort tasks by cost and group them by type and outcome, the audit is four passes over that table. Each pass targets one way the tax gets levied. None of them produces a definitive answer on its own. Each one narrows the search: it surfaces the workflows that deserve a closer look.

Pass 1: The loop tax

Group successful tasks by type and look at how much effort each one took to reach the same kind of result. When tasks that succeed at the same job show wide variance in steps, the expensive ones are paying the loop tax: identical outcome, many extra model calls. The answer still landed, so nothing flagged it, but you rented the model several extra times to get there.

# Within a task type, the cost tail is usually a step tail.
# Compare the cheap completions to the expensive ones doing the SAME job.
for task_type, group in group_by(rows, "task_type").items():
    if successful_tasks_show_wide_step_variance(group):   # same outcome, very different effort
        flag(task_type, "loop tax")

What it looks like in the wild: an agent re-retrieving and re-deciding its way to a conclusion it could have reached cleanly the first time. Identical result, several times the spend.

A pen, a calculator, and a magnifying glass resting on financial documents

Pass 2: The tier mismatch

Now group by model. You are hunting for cheap, high-volume, low-stakes tasks running on a frontier model because "use the good one" was the default and nobody ever drew the line. Each one is pure tier tax: you paid premium rates for work a small model would have nailed.

# Cheap, repetitive, low-stakes tasks on a frontier model are tier tax.
for task_type, group in group_by(rows, "task_type").items():
    if high_volume_low_stakes(task_type, group) and mostly_runs_on_frontier(group):
        flag(task_type, "tier mismatch")

The fix here is the cheapest of the four, because the work was never hard. It was just routed to the expensive room.

Pass 3: The retrieval drag

You pay for every token you stuff into the prompt, used or not. So compare what you retrieved into context against what the response actually leaned on. There's no single perfect measure of "used," so reach for one or more practical signals of what materially influenced the answer: what it cited, what it built on, what its output would lose if the context weren't there.

# You pay to retrieve context whether the model touches it or not.
# Use one or more practical signals for what actually influenced the answer.
for r in rows:
    if most_retrieved_context_went_unused(r):
        flag(r["trace_id"], "retrieval drag")

This one bills you twice: once to pull the context in, then again as the noise dilutes the next decision and nudges it toward the kind of marginal call Pass 1 and Pass 4 charge you for.

Pass 4: The silent-failure tax

The most expensive tasks are sometimes the ones marked "succeeded." A confident wrong answer passes every eval, ships, and bills you twice: once for the computer that produced it, again for the person who later cleans it up. Neither the invoice nor your evals caught it, because both saw success.

To find it, join the cost to any downstream rework signal you already collect: a reopened ticket, a reverted PR, a thumbs-down, a human correction.

# The dangerous tax: tasks that succeeded on paper and failed in the world.
for r in rows:
    if looked_successful_but_was_reworked(r):
        flag(r["trace_id"], "silent-failure tax")

A person checking printed documents against a calculator, reconciling the numbers

Rank by recoverable dollars, not by count

Four passes leave you with a list of flagged workflows. Don't sort it by how often each one fires. Sort it by the spend you could plausibly recover. A tier mismatch on a million cheap calls quietly outweighs a dramatic loop that runs twice a day, and the dramatic one is what will grab your attention. Put a rough dollar figure next to each flag and let that be the tiebreaker, every time.

That is the whole output: an undifferentiated bill turned into a ranked list of leaks, each one tied to a decision the system made, each with a number next to it. You can finally point at a workflow and say, "This is what it costs, and this is the part that bought nothing”.

What a first pass won't catch

This is a flashlight, not an X-ray. Be honest about its limits, or the findings will mislead you:

It's attribution, not causation. A flag tells you where money is pooled, not that all of it is recoverable. Some long loops are doing real work. Read each flag, don't auto-trust it.
Proxies are proxies. "Used context" and "reworked" are estimates. Tag them as estimates and keep them separate from measured spend. A heuristic that gets treated like telemetry is worse than no heuristic, because someone will trust it in a review.
It's a snapshot. Run it once, and you've found this month's leaks. Behavior drifts, models change, and next month the money pools somewhere new. A one-time audit finds the tax; it doesn't keep it down.
Stakes beat savings. The cheapest model is not always the right one. Don't optimize a high-stakes decision into a cheaper, worse version of itself. Cost is a constraint, not the objective.

Where to start this week

Don't audit everything. Pick the one workflow whose bill made someone ask why. Do step zero, the join, and run Pass 1 only, over a week of traffic.

You will almost certainly find one task type whose worst runs do the same job for several times the effort. That single sentence, this is what the workflow cost, and this is the part that bought nothing, is the entire point. Add the other three passes once the first one surfaces a leak worth chasing.

Doing this by hand, once, is the first pass. Doing it continuously, across every workflow, is the hard part. That's the direction Nalyqor is being built toward: helping teams see where AI spends creates value, where it quietly doesn't, and giving them a way to investigate those patterns continuously.

But you don't need Nalyqor to get started. Join your bill to your traces and run the first pass. The leaks are already there, levied on every workflow, itemized on none, waiting for someone to look.