Back to all posts
AI & Strategy

The Token Bill Is Now a Hiring Decision: What the AI Cost Myths Get Wrong

In April 2026, Uber's chief technology officer blew through his entire annual AI budget before the year was half over. The line item that did it was not headcount or infrastructure. It was tokens. That is not a story about Uber. It is the story arriving at every company that builds on these models, and it is reframing a question most teams thought was settled: do we hire another person, or do we grow the AI budget. For the first time those two numbers sit on the same line, and the model you reach for by default is quietly deciding which way the answer goes.

I run two SaaS products and a marketing consultancy on these models. I am writing this in Claude Opus 4.8, the most capable model available, and I will switch to a cheaper one the moment the thinking is done. That one habit - capable model to plan, cheaper model to execute - is the difference between a bill I can defend and one I cannot. The reason it works is that almost everything people believe about model cost is wrong. So let me take the myths down one at a time.

Why are AI bills suddenly a hiring decision?

AI spend has crossed the line where it competes directly with payroll. Per-employee spending on AI rose roughly 50 percent in a single year, from about $1,358 per employee in 2025 to a projected $2,068 in 2026, which rolls up to an estimated $280 billion of private AI investment for the year. That is no longer a tooling cost you bury in software expense. It is a number that shows up next to salaries on the same spreadsheet.

The people running the largest deployments are saying the quiet part out loud. Bryan Catanzaro, a vice president at Nvidia, told Axios that "the cost of compute is far beyond the costs of the employees" for his team. Meta announced layoffs of around 8,000 people, roughly 10 percent of its workforce, and scrapped 6,000 open roles, explicitly to "offset the other investments we're making." When a company chooses compute over a headcount, the model routing decision is no longer an engineering preference. It is a budget decision with someone's job on the other side of it.

Is the most capable model always the right call?

No, and defaulting to the frontier model on every request is the single most common way teams set money on fire. Claude Opus 4.8, which Anthropic released on May 28, 2026, costs $5 per million input tokens and $25 per million output. Claude Haiku 4.5 costs $1 and $5. That is an exact 5x gap on both ends for the same token volume. If your agent loop runs a hundred turns a day and most of those turns are mechanical - reformatting JSON, extracting a field, rephrasing a paragraph, following a plan that already exists - you are paying frontier prices for clerical work.

The trap is that the capable model genuinely is better, so it never feels wrong in the moment. It feels responsible. But "better at everything" and "necessary for everything" are different claims. Opus 4.8 earns its rate on the small fraction of work that is genuinely hard: novel architecture, ambiguous tradeoffs, the reasoning you cannot specify in advance. The rest of the run is execution, and execution is where the cheaper tiers were built to live.

What "execution" actually means

Execution is any task where the hard thinking has already happened and the model is carrying out a plan. Writing the function once you have decided the interface. Transforming a CSV you have already mapped. Drafting copy against a brief that is already locked. None of that needs frontier reasoning, and on that work a mid-tier model is not a downgrade. It is the right-sized tool.

Does a cheaper model mean worse output?

Cheaper means worse only when the task needs the capability you stopped paying for. Tier the work correctly and the cheaper model matches the expensive one on the thing you actually asked it to do. The failure mode is not "Haiku is bad." It is "Haiku was handed a job that needed Opus." Match the tier to the task and the quality gap on that task collapses, while the cost gap stays.

Here is how the three current Claude tiers actually divide the labor, with the per-million-token rates and the place each one breaks down.

ModelInput / output per MWhere it winsWhere it breaks
Opus 4.8$5 / $25Novel reasoning, architecture, ambiguous tradeoffs, the plan itselfAnywhere the plan already exists - you are overpaying 5x
Sonnet 4.6$3 / $15Executing a clear plan, most coding, long-context work at a flat 1M-token rateGenuinely open-ended design with no spec to follow
Haiku 4.5$1 / $5Classification, extraction, formatting, high-volume mechanical turnsMulti-step reasoning or anything requiring real judgment

Read the table as a routing map, not a ranking. Sonnet 4.6 at $3 and $15 is the workhorse: three-fifths the cost of Opus, and on a plan it can follow you will rarely see the difference. Haiku at $1 and $5 is for the volume - the turns you run hundreds of times where judgment is not the point.

Where exactly should the handoff happen?

The handoff happens at the moment the plan stops changing. Everything upstream of a locked plan is reasoning and belongs in the capable model. Everything downstream is execution and belongs in a cheaper one. The skill is noticing that moment and actually switching, because the capable model will happily keep doing the cheap work and never tell you it was overkill.

This is the sequence I run on real builds, the same one behind the products I ship:

  1. Scope and architect in Opus 4.8. Decide the structure, the interfaces, the hard tradeoffs. This is where frontier reasoning pays for itself, and it is usually the smallest slice of total tokens.
  2. Lock the plan in writing. Write it down as an explicit spec so the next model is following instructions, not inventing them. A written plan is what makes the downgrade safe.
  3. Execute in Sonnet 4.6. Hand it the locked plan and let it do the building. Most of your token volume lives here, now at three-fifths the per-token cost.
  4. Route mechanical turns to Haiku 4.5. Classification, extraction, formatting, repetitive transforms - the high-frequency work that needs speed and consistency, not judgment.
  5. Escalate back to Opus only on a genuine block. When execution hits something the plan did not anticipate, kick that one decision up a tier, then drop back down. Escalate the decision, not the whole session.
The frontier model is for the thinking you cannot outsource. Everything downstream of the plan is a cheaper model's job, and paying Opus rates to do it is not diligence. It is waste with good intentions.

What costs more than the model you picked?

The context you resend on every single turn usually costs more than the model tier you agonized over. In an agent loop, the system prompt, the tool definitions, and the accumulated history get re-sent with each request, and you pay full input price for the same tokens over and over. The model name on the request is one decision. The tokens you push through it hundreds of times is the actual bill.

Two pricing levers attack this directly, and most teams have neither turned on. Prompt caching lets the model store the stable part of your context and read it back at a 90 percent discount. The batch API runs non-urgent work at half price across the board. Here is the math that makes caching obvious.

Prompt caching - Claude pricing
  cache READ      = 0.10x base input   (a 90% discount)
  5-min cache WRITE = 1.25x base input (one-time, per window)
  1-hour cache WRITE = 2.0x base input

Breakeven: the 5-minute cache pays for itself
after ~2 reads. On agent loops and RAG pipelines it
cuts input-token cost 30-50% with no change in output.

Batch API: 50% off every token, input and output,
for any work that does not need a real-time reply.

Stack these against correct routing and the savings compound. A mechanical, high-volume job moved from Opus to Haiku is already 5x cheaper per token. Cache the stable context on top and the input side drops another 90 percent on every repeat. Run it through batch instead of real time and halve what remains. None of that touches the output a user sees. It only stops you from paying three times for the same work.

How do I cut my own bill this week?

You can move the number in an afternoon without re-architecting anything. The wins are mostly about not paying premium rates for commodity work, and they are sitting in defaults you have never changed. Start here:

  • Audit where your tokens actually go. Pull a week of usage by model and by task. The waste is almost always frontier-tier money spent on mechanical turns, and you cannot route what you have not measured.
  • Set a default tier below the top. Make Sonnet 4.6 the model that runs unless a task earns Opus, rather than the other way around. Flip the default and you change the baseline cost of everything.
  • Use Opus 4.8's effort control instead of always switching models. Opus 4.8 added low, high, extra, and max effort levels - dialing one Opus call down to low cuts its tokens and latency when you need the capability but not the depth.
  • Turn on prompt caching for any repeated context. System prompts, tool definitions, and long reference documents are the obvious candidates, and the 90 percent read discount starts paying after about two hits.
  • Move non-urgent work to the batch API. Overnight reports, bulk enrichment, anything without a human waiting - half price for changing one parameter.

Do those five and most teams cut their bill by half or more while the work that reaches a customer looks identical. That is the whole point of the myths-versus-reality framing: the savings do not come from accepting worse output. They come from refusing to pay for capability the task never needed.

Common questions

Is switching to a cheaper model just accepting lower quality?

No, when the task is matched to the tier. Quality drops only if you hand a cheaper model work that genuinely needs frontier reasoning. On execution against a locked plan, a mid-tier model like Sonnet 4.6 produces output you will struggle to tell apart from Opus, at three-fifths the per-token cost.

How much can prompt caching actually save?

Caching cuts input-token cost by 30 to 50 percent on typical agent loops and RAG pipelines, and 88 to 95 percent on workloads with heavy repeated context. Cache reads are priced at 10 percent of standard input, and the 5-minute cache pays for its one-time write cost after roughly two reads.

What is the price difference between Claude's model tiers right now?

As of Opus 4.8's May 28, 2026 launch, it is $5 per million input tokens and $25 output, Sonnet 4.6 is $3 and $15, and Haiku 4.5 is $1 and $5. That is a clean 5-to-3-to-1 ratio, so an identical token job costs five times more on Opus than on Haiku.

Should small teams worry about this, or only enterprises?

Small teams feel it faster, because there is no procurement layer to absorb a runaway bill. The same routing discipline that saves an enterprise millions is what keeps a solo operator's monthly invoice from quietly tripling, and the levers - tiering, caching, batch - are identical at any scale.

Related reading

Pfender Marketing Co.

Every audit, build, and word on this site comes from one operator. If something above sounded like your own funnel, the fastest way to find the leak is to look at the numbers.

See where your budget is leaking

A 60-second scan of your site and marketing stack, the same first pass I run before any engagement.

Run the complimentary audit

Not ready for the audit? Get each new post in your inbox.