Run the same coding-agent task more than once and the cost can swing by as much as 30x. Token spend on these tasks is highly variable, and how hard a task looks barely predicts what the agent will consume getting through it. The invoice at the end of the month cannot tell you which run happened, or why.
This is happening across organizations that have adopted coding agent tools like Claude Code, Devin, Cursor, etc. Uber is one widely reported case: it blew through its entire annual AI budget in four months, then capped agentic coding tools at $1,500 per employee, per tool, each month.
And it is not confined to tech companies. Evident's Banking Brief reported this month that RBC's token usage is up 500% year over year. At JPMorgan, some employees now spend more on tokens than they earn. CommBank's CEO warned that token costs stop scaling linearly once reasoning, tool use, and larger context windows enter the picture. The pattern repeats across regulated, cost-disciplined organizations that are usually the last to overspend.
Costs stopped being linear, and most teams are still budgeting as if they were.
Seat math to agent math Until recently, AI coding spend was a seat-license problem. Tools billed per developer seat, so you multiplied seats by price, put it in the R&D line, and moved on. The cost was bounded by headcount.
Agents broke that arithmetic. The bill is no longer seats times price. It is tokens consumed times model rate.
Source: Stanford Digital Economy Lab, "How Do AI Agents Spend Your Money?" (2026). SWE-bench Verified tasks ordered by actual token consumption, colored by expert-rated difficulty. The colors are scrambled across the cost range. Human-judged difficulty only weakly aligns with what an agent actually spends. How hard a task looks barely predicts what it costs. How hard a task looks barely predicts what it costs. Tasks experts rated quick (light) and slow (dark) are scattered across the whole spending range.
A handful of things drive that spread:
Session length. Context accumulates on every turn. Prompt caching keeps the re-sent history cheap, but an edit to earlier context or a model switch invalidates the cache, and the whole history is billed again at full input rates.Tool loops. A failing grep-edit-test cycle re-spends tokens on every iteration without making progress.Model choice. Reasoning models and large context windows cost more per step, and the strongest model is often the default.Fan-out. Subagents and parallel sessions multiply token usage across repos.The expensive run is not the better run. The Stanford analysis behind that 30x figure found accuracy peaks at an intermediate cost and then saturates, so tokens spent past that point add cost without adding accuracy. Much of the spread is waste on top of the unpredictability.
Source: Stanford Digital Economy Lab, "How Do AI Agents Spend Your Money?" (2026). Higher token usage does not translate into higher accuracy (left), and on the same problem, accuracy peaks at an intermediate cost before plateauing (right). More tokens do not buy more accuracy.
What a total cannot tell you A vendor bill tells you Claude Code cost, say, $47,000 last month. It will not tell you:
Which sessions shipped a merged PR and which spun on a failing test suite for an hour Whether tool.Bash or tool.Grep dominated spend in a given repo Which model tier ran on which kind of task Whether the tokens a session burned produced a diff anyone kept This is the gap between a finance number and an engineering decision.
Banks are already engineering around this. In a pilot of its CAI 2.0 platform, CIBC removed model choice and auto-routes each prompt to the most economical model for the task. TD Bank stood up an AI FinOps function to watch token trends. PNC is leaning on on-prem and open-weight models so it stops renting every token. Coinbase already routes prompts to cheaper models where the task allows it; CEO Brian Armstrong expects roughly 80% of workloads to shift to cheaper models within 12 to 18 months, reserving frontier systems for harder work.
Every one of those moves assumes the same prerequisite: you can classify the work before you route or cap it. For a coding agent, classification starts with observability: seeing the trajectory the agent actually took, linked to its results.
Measure the trajectory before you cap it HoneyHive for Coding Agents treats a coding session as a trace tree rather than a line item. The session, its turns, tool calls, model events, and artifacts each become a node, and cost attaches at each node, not only on the session total.
For Claude Code, the honeyhive-daemon captures sessions through hooks with no changes to your code or your agent config.
Each run produces a tree you can open and read:
A Claude Code session in HoneyHive as a trace tree, from session.start through session.end, with each tool call expandable to its inputs, outputs, duration, and annotations.
Token counts land on each event under metadata.usage (input_tokens, output_tokens, cache_read_tokens), and cost rolls up to the span and session so you can filter and chart it. The Trajectory view plots every step as a bubble sized by duration, cost, or evaluator score. That is usually where you catch the expensive loop, before it reaches an invoice.
Trajectory view: every agent step plotted as a bubble, grouped by Model, Tool, and Chain and sized by duration, cost, or evaluator score. The outlier bubble is the run worth investigating.
Devin works the same way through its exporter. It syncs autonomous sessions with PR links, status, internal shell, git, browser, and file operations, plus metrics.acus_consumed for Devin's own billing unit. Re-running the exporter is idempotent, so it only picks up new sessions.
What you get out of it is the breakdown finance keeps asking for and engineering can answer:
By session: total cost and duration for one agent runBy step: the turn or tool call that drove the spendBy model: tier and token counts per eventBy outcome: cost tied to merged PRs, eval scores, or whatever metric you defineAttach an evaluator and you score the work next to its price, so a $12 session that landed a clean fix and a $12 session that looped on a dead end are no longer indistinguishable on the bill.
Getting Started Claude Code (real-time, hook-based):
pip install honeyhive-daemon
honeyhive-daemon init
export HH_API_KEY=<your-api-key>
honeyhive-daemon run Full setup, daemon options, and CI analysis are in the Claude Code guide .
Devin (batch or daemon sync):
git clone <https://github.com/honeyhiveai/honeyhive-daemon.git>
cd honeyhive-daemon/devin && pip install -r requirements.txt
export DEVIN_API_KEY=<key> HH_API_KEY=<key>
python devin_to_honeyhive.py --daemon --interval 60
See the Devin guide for API key setup and the event schema.
Cursor tracing is available through the cookbook while first-class support ships, and the broader coding agents hub covers skills, CLI, and the docs MCP.
Caps are a symptom These caps will not be the last ones written. When spend outruns understanding, a hard limit is the fastest control available, and it does curb the immediate overspend. But it also constrains the engineers doing the most valuable work, because a flat ceiling cannot tell productive sessions from wasteful ones. It is blind to the same information the overspend was.
The fix that holds up is to connect cost to the work the agent did. Once you can see which sessions, steps, and models drive spend, and which ones produced something worth keeping, you can set budgets that catch the runaway cases and leave the productive work alone.
________
See where your tokens actually go. HoneyHive for Coding Agents turns every Claude Code and Devin session into a cost-attributed trace tree, so you can separate the runs worth keeping from the ones worth capping. Start with the Claude Code daemon , or book a demo if you'd like a hand setting it up.