Cheap Tokens, Expensive Workflows: Deterministic AI Wins

Rob Brown, SVP National Security Solutions

15 Jun, 2026

6 min read

Cheap Tokens, Expensive Workflows: Deterministic AI Wins

The Case for Deterministic AI in Legacy Modernization

Three years ago, the cautious position on AI economics was that token prices might not fall fast enough to make large-scale AI workloads affordable. That prediction aged badly. GPT-4-class inference cost about $30 per million input tokens in early 2023. Today you can buy equivalent capability for under a dollar. Epoch AI measured price declines between 9x and 900x per year depending on the capability level. Nothing in the history of computing has gotten cheaper this fast.

And yet enterprise AI bills keep going up.

This is the part the cost-curve optimists missed. The unit of consumption changed. A user task handled by an agentic workflow doesn’t trigger one inference call, it triggers ten or twenty: planning, tool calls, retries, self-review, verification. Reasoning models burn large volumes of internal “thinking” tokens that get billed as output, sometimes 100x what the final answer contains. RAG and large-context analysis multiply tokens per request by 3-5x. And agentic coding tasks vary wildly in consumption from run to run. Two attempts at the same task can differ in cost by multiples.

It’s also worth noticing what the frontier itself costs now. Anthropic’s new flagship, Claude Fable 5, launched this month at $10 per million input tokens and $50 per million output — double its predecessor. The commodity tier keeps collapsing toward free while the capability tier holds premium pricing, and the agentic workloads everyone actually wants run on the capability tier. The per-token price collapsed; total spend became less predictable, not more. For a consumer chatbot, that’s a budgeting annoyance. For a multi-year modernization program with a fixed budget and congressional oversight, it’s a real problem.

The benchmark I leaned on just got crushed. Let me be honest about that.

A year ago, the strongest single number in this argument was the gap between public-benchmark and private-codebase performance: frontier models in the high 70s on SWE-bench Verified, low 20s on SWE-bench Pro, teens on private codebases. Code the model has never seen, the argument went, is where it falls apart — and a legacy system is by definition code the model has never seen.

Then Anthropic shipped Fable 5 and Mythos 5 on June 9, and the model scored 80.3% on SWE-bench Pro. Not Verified — Pro, the hard one. That’s an 11-point jump over Opus 4.8 and roughly 22 points clear of GPT-5.5. SWE-bench Verified is at 95% and effectively saturated. The headline customer story is Stripe running a codebase-wide migration across 50 million lines of Ruby in a single day — work Stripe estimated at over two months for a full team.

If you wrote a thesis on the private-codebase gap, intellectual honesty requires admitting that gap is closing much faster than skeptics expected. The accelerator didn’t just get better. It got dramatically better.

So is the argument dead? Look closer at three things.

First, the hard tail is still hard. On FrontierCode Diamond — Cognition’s benchmark holding models to production-codebase standards, not just “does the test pass” — Fable 5 scores 29.3% at maximum reasoning effort. Best in the world, more than double Opus 4.8, and still failing seven out of ten tasks held to the standard a mission-critical system actually requires: performant at scale, idiomatic, structured for long-term maintainability. That’s the standard a modernized federal system has to meet, and the frontier is at 30%.

Second, the Stripe story is real and it’s Ruby. Fifty million lines of one of the best-represented languages in any training corpus, at a company with elite engineering infrastructure to validate the output. It’s a genuinely impressive proof point for the accelerator role. It tells you very little about four decades of COBOL, PL/I, Natural, or a proprietary 4GL, where the validation infrastructure doesn’t exist and has to be built.

Third — and this is the one procurement people should sit with — the cost-variance problem got worse, not better, with the model that got better. Fable 5’s own system card shows its agentic coding score climbing from 75.0% to 80.4% on SWE-bench Pro as you turn the reasoning-effort dial from low to maximum, and FrontierCode nearly tripling from 11.5% to 30.9%. Accuracy is now literally a function of how many thinking tokens you’re willing to buy, at $50 per million on output. And Fable 5 introduces a new flavor of nondeterminism: its safety layer reroutes flagged queries to Opus 4.8 mid-task — about 5% of sessions overall, but over 20% of trials on some agentic benchmarks. Your agent can silently switch models partway through a trajectory. For a demo, fine. For an auditable transformation pipeline, that’s a finding waiting to be written.

Modernization was never a code generation problem

GenAI is genuinely good at explaining code, drafting documentation, generating tests, and helping developers move faster — and the industry numbers back this up. Across recent enterprise programs, AI-assisted modernization is credited with cutting timelines by 40-50%, mostly in analysis, translation, documentation, and test generation. In one healthcare program, AI-assisted translation converted about 65% of a legacy codebase while compliance review stayed in the loop. A fintech migration scoped at 700-800 hours cut effort by 40% using generative agents. None of that is in dispute, and none of it is the hard part.

Because modernizing a mission-critical system means preserving business rules, mapping dependencies, transforming architecture, validating that the new system behaves like the old one, and proving all of it to auditors and authorizing officials. In federal environments, getting this wrong doesn’t mean a bad sprint. It means benefits don’t go out, payments fail, cases stall, or a compliance finding lands on someone’s desk.

“Right 80% of the time” is a historic benchmark score and a disqualifying transformation standard. The model improved from “fails most unfamiliar tasks” to “fails a meaningful minority of them, unpredictably, at variable cost.” That’s enormous progress for an accelerator and still not an assurance story.

Why deterministic approaches hold up

Deterministic modernization treats the problem as controlled transformation rather than open-ended generation: parsing, dependency graphing, rule extraction, mapping, validation. The case for it has gotten stronger, not weaker, as the models improved.

The same source logic transforms the same way every time, across the whole codebase, with no run-to-run variance, no reasoning-effort dial that trades accuracy for token budget, and no degradation as the work scales. Every decision traces from legacy code to modernized output, which is what NIST AI RMF and federal governance guidance actually require, and what probabilistic generation can’t natively give you. The cost model is per system or per line of code, not per token consumed by an agent loop of unknown length, so neither a price correction in the inference market nor a flagship launch at double the old rate touches your modernization budget. And because deterministic transformation enforces a target architecture and coding standards uniformly, you come out the other side with less technical debt instead of a fresh layer of inconsistent generated code.

The hybrid model won — officially, this time

The argument was never GenAI versus deterministic AI, and the market has now formalized that. Gartner’s new tool category for this space — AI-Augmented Code Modernization — is defined explicitly as the combination of specialized AI agents, generative AI, and deterministic analysis. The hybrid isn’t a contrarian position anymore. It’s the category definition.

The division of labor is the same one that’s been emerging for two years, just with a much stronger accelerator. Deterministic AI carries the assurance burden: transformation, dependency analysis, rule extraction, behavioral validation. GenAI — and Fable 5 is a real step change here — accelerates everything around it: documentation, test scaffolding, requirements interpretation, helping SMEs understand forty-year-old code. Humans validate business logic and resolve the ambiguity that neither machine can.

What changed this month is that the accelerator crossed a threshold where it can do genuinely large mechanical migrations in friendly territory. What hasn’t changed is which component you can bet the mission on.

Buyers have caught up to this. With 85% of enterprises reporting that legacy systems block their AI adoption and legacy consuming the bulk of IT budgets, the evaluation questions are blunt: Can you scale across millions of lines without drift? Can you prove behavioral equivalence? Can you show line-level traceability? Can you commit to a fixed price? Can you survive an ATO process?

That’s the design point for Continuum Code: a deterministic modernization engine built for predictability, auditability, and cost control, using GenAI where it actually earns its keep — and Fable 5 just made that part of the engine considerably more valuable.

The bottom line

The strangest lesson of the past three years still holds: tokens got radically cheaper and cost discipline got harder. The newest frontier model is the best coding system ever built, and it ships with a reasoning dial that prices accuracy by the token, a premium rate card, and a safety layer that can swap models mid-task. Every one of those is fine for exploration and disqualifying for a fixed-budget assurance pipeline.

GenAI will keep getting better and will keep earning a bigger role as an accelerator — a bigger role than I would have predicted a year ago, frankly. But the core engine for large-scale legacy modernization needs to be deterministic, because the things that survived both the price collapse and the capability jump are the things that mattered all along: knowing what it costs, proving what it did, and getting the same answer every time.

Previous blog