Effort Control Is a Leaky Abstraction Vendors Sold You

In roughly fourteen months, “how hard the model thinks” went from a hidden internal behavior to a parameter you set on every API call. OpenAI shipped reasoning_effort with o3-mini in January 2025.¹ Anthropic moved effort control to general availability in February 2026.² Google exposes a thinking budget on Gemini.³ By early 2026 the dial is table stakes: one model, a slider from fast-and-cheap to slow-and-deep, capability sold by the call instead of by the model.

Audio

Listen to this article

A 2-minute audio overview of this article, narrated by our robot.

0:00 / 0:00

The pitch is good, and the underlying engineering is real. There is a genuine shift from spending compute at training time to spending it at inference time, and selling that compute as a tunable knob is a reasonable product move. OpenAI’s own data showed o3-mini at low effort matching o1-mini, at medium matching o1, and at high beating both.⁴ One team documented saving $741K a year by routing only 30% of requests to reasoning models.⁵ The knob optimizes something that matters.

But the honest story is that this knob is a leaky abstraction the vendors pushed onto builders. The same companies that sell “precise control” describe it in their own docs as soft guidance. The mental model the marketing implies (more thinking equals better answers) is false in a way the research has already measured. And the calibration work of finding the point where extra thinking starts paying off got handed to you without the tooling to do it. The dial is real. The slider lies about being linear. And you need evals at every level you deploy, or you’re flying blind.

From Training-Time to Test-Time Scaling

For most of the deep-learning era, you bought capability by training a bigger model on more data. Test-time scaling is the shift to a second lever: spend more compute at inference, letting the model generate intermediate reasoning tokens before it answers, and accuracy climbs on hard problems without retraining anything.

Simon Willison traced the capability to reinforcement learning from verifiable rewards consuming compute originally meant for pretraining, and noted that by the end of 2025 “many API models now include dials for increasing or decreasing the amount of reasoning applied.”⁶ Investors framed it harder still: one analysis called test-time compute the most important architectural shift in AI since transformers and projected inference would claim 75% of AI compute by 2030.⁷

That economic backdrop matters, because it tells you what the effort knob is really for. It is a cost-allocation surface with a large infrastructure bet behind it. The vendors needed a way to sell variable inference compute, and “reasoning effort” is the interface they chose. Understanding the knob as a billing dial first and a quality dial second explains most of its sharp edges.

The Three API Shapes, and Why None of Them Match

Effort control is now cross-vendor, but each implementation is incompatible with the others. A router that wants to speak all three has to carry a translation shim, and the semantics underneath the shim are not equivalent.

Anthropic nests effort inside an output config object, output_config.effort, with values low, medium, high, xhigh, and max. The default is high.⁸ OpenAI nests it as reasoning.effort; the o-series accepts low, medium, and high, while GPT-5 variants add a minimal level and an xhigh level, defaulting to medium.¹ The naming is not even stable within OpenAI: it appears as reasoning_effort in the Chat Completions API and reasoning.effort in the Responses API.

Google took a different shape entirely. Gemini 2.5 uses thinkingConfig.thinkingBudget as an integer: 0 turns thinking off where allowed, -1 enables a dynamic budget, and any positive N sets a fixed guidance value. The ranges are model-specific. Gemini 2.5 Pro accepts 128 to 32768 and cannot disable thinking at all; Flash accepts 0 to 24576.³ Then the Gemini 3 series switched from an integer budget to a discrete thinkingLevel enum (minimal, low, medium, high), abandoning the continuous-budget idea Google had just shipped.

# Anthropic: enum nested in output_config, default high
{"output_config": {"effort": "low"}}

# OpenAI Responses API: enum nested in reasoning object, default medium
{"reasoning": {"effort": "minimal"}}

# Gemini 2.5: integer budget (Pro cannot go to 0)
{"thinkingConfig": {"thinkingBudget": 2048}}

# Gemini 3 series: switched to a discrete enum
{"thinkingConfig": {"thinkingLevel": "low"}}

Two things here stay unverified, because the primary docs do not pin them down. Reports of an OpenAI none value to disable reasoning come from community threads, not a canonical docs page. And the exact Gemini 3.x model IDs and their general-availability status (names like Gemini 3.1 Pro and 3.5 Flash) appear in page fetches without an explicit GA designation. Build against the values your vendor’s docs confirm for the specific model you call, not against a blog summary.

Tension One: “Precise Control” Is Soft Guidance

The marketing language around these knobs is the language of control. The documentation language is the language of suggestion, and the gap between the two is where builders get hurt.

Anthropic states plainly that effort is “a behavioral signal, not a strict token budget.” At lower effort the model still thinks on hard problems, it just thinks less.⁸ Google’s docs carry an explicit warning that the model “might overflow or underflow the token budget,” and that the Gemini 3 minimal level does not guarantee thinking is off.³ OpenAI frames its parameter as guiding the model. Across all three, you are submitting a preference, not a contract.

You set effort and the model decides what to do with it. That is a reasonable engineering choice. It is not the precise control the product pages imply.

There is a second-order surprise hiding in Anthropic’s version. Effort there does not only govern the hidden thinking block. It affects every response token, including text and tool calls. Lower effort means fewer tool calls and less preamble.⁸ So in an agentic loop, turning effort down does not just shorten the model’s private reasoning. It changes how many steps the agent takes. The knob you reached for to save thinking tokens also quietly rewires your agent’s behavior, and nothing in the parameter name tells you that.

Even the defaults lean against you. Anthropic’s effort defaults to high, and its own docs warn against reaching for max: “On most workloads max adds significant cost for relatively small quality gains, and on some structured-output or less intelligence-sensitive tasks it can lead to overthinking.”⁸ Gemini 2.5 Pro cannot turn thinking off no matter what you pass. A team that ships without actively managing the dial pays the premium by default, and the default direction happens to align with vendor revenue.

Tension Two: More Thinking Is Not Monotonically Better

The mental model the slider invites, drag it right for a better answer, is the part the research most directly contradicts. The relationship between thinking budget and quality is not a line. It is closer to an inverted U, and the peak is task-specific.

Apple’s “The Illusion of Thinking” found that large reasoning models “consistently failed to perform beyond a certain complexity threshold” and “tended to overthink in simple ones, where older models often achieved better results.”⁹ That is the curve in two directions at once: a ceiling on hard problems and active harm on easy ones. The follow-on debate showed the finding is entangled with an unresolved question about what the extra compute actually buys.¹⁰

The numbers from practitioner benchmarks tell the same story with money attached. One 2026 analysis found high effort costs 4 to 17 times more than low, with high-effort time-to-first-token running 18 to 90 seconds (batch territory, not interactive). On AIME math problems, moving low to high gained 18 to 22 points. On code refactoring, quality regressed from medium to high. The same writeup observed that “most teams default every workflow to high reasoning out of caution and pay 4-12x over the right tier,” and argued the right metric is cost-per-correct-answer, not pass rate.¹¹

The academic work converges on the same shape. “When More Thinking Hurts” found marginal token utility turning negative past 12K tokens for one 32B model, and in 67.5% of degradation cases the model reconsidered and rejected an earlier correct solution.¹² OptimalThinkingBench tested 33 models and found none that struck the right balance between overthinking and underthinking; the crossover point where extra thinking starts helping is task-specific, and no vendor ships tooling to locate it.¹³

The strongest evidence that “one model, tunable effort” carries a real quality cost comes from a vendor that tried it and backed out. Alibaba shipped Qwen3 with a hybrid thinking mode and a user-controllable budget, then reversed course and split the line into separate Instruct and Thinking models. The instruct-only model gained 2.8x on AIME25 over the hybrid. Alibaba’s stated reason: “providing better-quality performance is more important than the unification at this moment.”¹⁴ A company that sold the exact “one dial, fast or deep” promise concluded the unification was costing it quality and walked away from it.

Tension Three: The Calibration Burden Got Offloaded

This is the move that turns a useful feature into a leaky abstraction. “How many decoding steps should this input get?” is a vendor-side problem. The vendors trained on this signal for years, then exposed it as a knob and handed the calibration to you, before the research community has a reliable automatic answer.

They handed it to you with the instrumentation broken by design.

You are billed for the full internal thinking tokens, not the summarized tokens you can read. The billed output count will not match the visible token count, and you have to read it out of a separate field: output_tokens_details.thinking_tokens on Anthropic, reasoning_tokens on OpenAI, thoughtsTokenCount on Gemini.¹⁵ Google says it directly: “pricing is based on the full thought tokens the model needs to generate to create a summary, despite only the summary being output.”³

On Opus 4.8 the thinking display defaults to omitted. You get an encrypted signature and no thinking text at all. Recovering even a summary requires explicit opt-in, and raw traces require contacting sales.¹⁶ You are paying for reasoning that is hidden from you by default.

The trace you can recover is not the computation you paid for. Anthropic’s docs state that the visible thinking is a summary generated by a different model, and that “the thinking model does not see the summarized output.”¹⁶ So the artifact you would use to debug or audit the reasoning was produced by a second model that never saw what the first model actually did. Faithfulness research adds the sharp edge: one study found 55.4% of misleading-hint cases showed “thinking-only divergence,” where a hint’s influence was acknowledged in the thinking but absent from the answer, alongside suppression strategies like laundered attribution and confabulated justification.¹⁷ Thinking tokens are not a ground-truth audit record.

This is why the burden lands hardest in the wrong place. To use the dial well you need domain eval sets at each level, which means the teams who can calibrate it are the teams that already had eval infrastructure, the ones who arguably needed the hand-holding least. Everyone else defaults to high and overpays.

The Builder’s View: When to Dial Up, When to Dial Down

None of the above means the knob is useless. It means you treat it as a real optimization with no free lunch. A few patterns hold up.

Match effort to the task, not to caution. For extraction, formatting, and classification, reach for minimal or low. OpenAI designed GPT-5’s minimal to run “with few or no reasoning tokens to minimize latency and speed time-to-first-token,” aimed squarely at deterministic lightweight work.¹⁸ Reserve high and above for genuinely frontier problems. Vellum’s practitioner reference puts reasoning models at 10 to 74 times the price of standard models before effort multiplies further, and explicitly says not to use high as a universal default.¹⁹

In agentic systems, assign effort to the hierarchy. Anthropic’s own pattern is to run leaf and subagents at low effort and the orchestrator at high or xhigh, because effort changes the number of steps an agent takes, not just its reasoning depth.⁸ The knob is also too coarse for fine-grained loops: the Ares paper showed lightweight per-step routers cutting reasoning tokens up to 52.7% versus fixed high effort with minimal loss in task success.²⁰ The vendor dial captures the crude version of a control problem that finer routing does better.

Then there is caching, which the dial quietly fights. Switching between adaptive and enabled or disabled thinking modes breaks prompt-cache breakpoints, and switching between fast and standard speed invalidates the cache too.¹⁶ Vary effort per tool call in an agentic loop and you can cascade cache invalidation, a cost multiplier that eats the savings you reached for. You can spend less on thinking and more re-warming a cache you just busted.

One more dimension sits orthogonal to all of this. Anthropic’s fast mode (speed: "fast") is a separate axis from effort: up to 2.5x output tokens per second at 2x the price on Opus 4.8 ($10/$50 per million tokens versus $5/$25 standard), currently a research preview behind a waitlist.²¹ It is the same model with faster inference config, with “no change to intelligence or capabilities,” and the speedup is in output tokens per second, not time-to-first-token.²² The full design matrix is model times effort times speed. You can run fast mode at any effort, which is useful for latency-sensitive paths but does nothing for the quality dial.

def choose_effort(task_kind: str) -> str:
    # extraction / formatting / classification: cheapest tier
    if task_kind in {"extract", "format", "classify"}:
        return "low"
    # genuinely frontier reasoning: pay for it, deliberately
    if task_kind in {"proof", "multi_step_math", "deep_refactor"}:
        return "high"
    # default to a middle tier you have actually eval'd,
    # never to the vendor default out of caution
    return "medium"

The budget_tokens Trap

If you built against an explicit token budget, the migration is already biting. Anthropic deprecated budget_tokens on Opus 4.6 and Sonnet 4.6, and on Opus 4.7 and 4.8 a manual budget_tokens returns a 400 error. Adaptive thinking is the only supported mode there.¹⁶ The effort enum that replaced it gives no ceiling. If you used budget_tokens as a hard cost cap (“never spend more than 8K thinking tokens”), there is no direct replacement. You combine max_tokens with an effort level and accept that the model may undershoot or overshoot. Workloads that needed predictable latency SLAs lost their clean lever, and Anthropic’s docs concede the change.

That is the leaky abstraction in miniature. A precise, builder-controlled number got replaced by a vendor-controlled bucket, the integer-to-enum shift Google made too, and the control you thought you had turned out to be a preference the model is free to reinterpret.

What This Changes About How You Build

The practical fallout lands in four places: evals, budgeting, agent design, and portability.

Run evals at every effort level you deploy. A benchmark at high tells you nothing about medium, where the model may skip thinking and answer shallowly. Without a domain eval set you cannot tell a better answer from a longer one that merely looks thorough. The dial rewards teams with eval infrastructure, so build it before you trust the knob.

Budget per effort level, not per model. The same model at xhigh versus low can differ five to ten times in tokens on the same prompt, and because effort is a behavioral signal, individual-request counts vary. Cost forecasting becomes a sampling problem, not a multiplication. And the hidden costs do not show up on the invoice line at all: energy research found median per-query energy rising roughly 13x at 15x token usage, with even 10% long-reasoning traffic able to more than double data-center load.²³ None of that appears in effort: high.

Design agent hierarchies before deploying. Effort alters tool-call frequency and verbosity, so it changes the number of steps an agent takes, which cascades into latency and upstream API calls non-linearly. Decide which tier each role in your hierarchy runs at, and account for cache invalidation when roles switch modes mid-loop.

Accept that there is no portable abstraction. If you route across vendors, write the mapping shim explicitly and document that the semantics are approximate. A continuous budget and a vendor bucket will never line up cleanly, and treating them as equivalent is how cost forecasts blow up.

Conclusion

Inference-time effort control is a real product surface sitting on top of a real engineering shift, and the cost/quality optimization it exposes is worth doing. That part is not marketing. What is marketing is the word “control.”

Key Takeaways

Effort is soft guidance, not a budget. You submit a preference the model is free to reinterpret, and Anthropic’s version changes tool-call behavior, not just reasoning depth.
More thinking is not monotonically better. The curve is an inverted U; reserve high effort for frontier tasks and expect equal-or-better results at minimal/low for extraction, formatting, and classification.
Observability is broken by design. You are billed on hidden internal tokens, the visible trace is a summary from a different model, and you cannot verify what the extra effort bought.
The calibration burden was offloaded to you without the tooling. Run evals at every level you deploy, budget per level, and treat cost-per-correct-answer as the metric.
There is no portable cross-vendor abstraction, and migrations like Anthropic’s budget_tokens deprecation leave hard cost caps with no clean replacement.

The takeaway for builders is not “use the dial.” It is that the dial is a genuine optimization the vendor decided you should own, the slider lies about being linear, and the instrumentation you would need to do the job right is hidden from you on purpose. Treat the knob the way you would treat any preference you pass to a system you do not control: measure what comes back, never trust the label, and budget for the version that ignores you.

OpenAI. “Reasoning models.” OpenAI API docs. developers.openai.com. o3-mini launched in the API January 31, 2025; o3 and o4-mini followed April 16, 2025. ↩ ↩²
Anthropic. “API release notes.” Effort control entered public beta November 24, 2025 (Opus 4.5, header effort-2025-11-24) and reached GA without a beta header on February 5, 2026 (Opus 4.6). platform.claude.com. ↩
Google. “Gemini thinking.” Gemini API docs. ai.google.dev. Covers thinkingBudget integer semantics, per-model ranges, the Gemini 2.5 Pro non-disable constraint, the Gemini 3 thinkingLevel enum, the minimal-is-not-off caveat, and the overflow/underflow and full-thought-token billing warnings. ↩ ↩² ↩³ ↩⁴
OpenAI. “OpenAI o3-mini.” January 31, 2025. openai.com. Reports low effort approximating o1-mini, medium approximating o1, and high outperforming both. ↩
Towards Data Science. “Inference Scaling and Test-Time Compute: Why Reasoning Models Raise Your Compute Bill.” towardsdatascience.com. Source of the $741K routing-savings figure and the P95 latency and invisible-token observations. ↩
Willison, S. “The year in LLMs.” December 31, 2025. simonwillison.net. On reasoning dials becoming common and the RLVR grounding. ↩
Geodesic Capital. “Test-Time Compute: Thinking Fast and Slow.” geodesiccap.com. Investor framing and the 75%-of-compute-by-2030 projection. ↩
Anthropic. “Effort.” Claude API docs. platform.claude.com. output_config.effort shape, level set including xhigh and max, high default, behavioral-signal language, the all-tokens behavior, the max-overthinking warning, and the leaf-low/orchestrator-high agentic pattern. ↩ ↩² ↩³ ↩⁴ ↩⁵
Apple Machine Learning Research. “The Illusion of Thinking.” machinelearning.apple.com. Preprint, not peer-reviewed. ↩
VentureBeat. “Do reasoning models really ‘think’ or not? Apple research sparks lively debate.” venturebeat.com. ↩
Digital Applied. “Reasoning Effort: Cost vs Quality Benchmarks 2026.” digitalapplied.com. Source of the 4-17x cost spread, 18-90s TTFT, AIME +18-22 points, the medium-to-high code regression, and the cost-per-correct-answer framing. ↩
“When More Thinking Hurts.” arXiv:2604.10739. Preprint, not peer-reviewed. ↩
“OptimalThinkingBench.” arXiv:2508.13141. Preprint, not peer-reviewed. ↩
The Register. “Alibaba splits Qwen3 hybrid thinking into separate models.” July 31, 2025. theregister.com. Source of the 2.8x AIME25 figure and the quality-over-unification quote. ↩
Anthropic. “Extended thinking.” Claude API docs. platform.claude.com. Billing on full internal thinking tokens and the per-vendor observability fields. ↩
Anthropic. “Adaptive thinking.” Claude API docs. platform.claude.com. Opus 4.7/4.8 adaptive-only mode, budget_tokens 400 error, the 4.6/Sonnet 4.6 deprecation, display-omitted default, the summary-from-a-different-model detail, and cache-breakpoint invalidation on mode switch. ↩ ↩² ↩³ ↩⁴
“Reasoning faithfulness under misleading hints.” arXiv:2603.26410. Preprint, not peer-reviewed. Source of the 55.4% thinking-only-divergence figure. ↩
OpenAI. “GPT-5 new params and tools.” OpenAI Cookbook. developers.openai.com. Definition of minimal effort and the medium default. ↩
Vellum. “Reasoning effort.” vellum.ai. The 10-74x pricing range and the guidance-not-default framing. ↩
“Ares: adaptive per-step effort routing.” arXiv:2603.07915. Preprint, not peer-reviewed. Source of the 52.7% reasoning-token reduction. ↩
Anthropic. “Fast mode.” Claude API docs. platform.claude.com. speed: "fast", 2.5x output tokens per second, the $10/$50 versus $5/$25 pricing, and the research-preview/waitlist status. ↩
Anthropic. “Fast mode.” As above: “no change to intelligence or capabilities,” and the output-tokens-per-second (not time-to-first-token) clarification. ↩
“Energy cost of extended reasoning.” arXiv:2509.20241. Preprint, not peer-reviewed. Source of the ~13x median per-query energy at 15x tokens and the data-center load figure. ↩

Effort Control Is a Leaky Abstraction Vendors Sold You

Listen to this article

From Training-Time to Test-Time Scaling

The Three API Shapes, and Why None of Them Match

Tension One: “Precise Control” Is Soft Guidance

Tension Two: More Thinking Is Not Monotonically Better

Tension Three: The Calibration Burden Got Offloaded

The Builder’s View: When to Dial Up, When to Dial Down

The budget_tokens Trap

What This Changes About How You Build

Conclusion

Claude Sonnet 5 Is the Frontier Model for the Default Slot

The AI Stack Is Closing

The US Government Just Forced Two Frontier AI Models Offline

Listen to this article

From Training-Time to Test-Time Scaling

The Three API Shapes, and Why None of Them Match

Tension One: “Precise Control” Is Soft Guidance

Tension Two: More Thinking Is Not Monotonically Better

Tension Three: The Calibration Burden Got Offloaded

The Builder’s View: When to Dial Up, When to Dial Down

The budget_tokens Trap

What This Changes About How You Build

Conclusion

Footnotes

Related reading

Claude Sonnet 5 Is the Frontier Model for the Default Slot

The AI Stack Is Closing

The US Government Just Forced Two Frontier AI Models Offline

Get Brain Bytes in your inbox