Thinking Tokens Are the New Denial-of-Wallet Attack Surface
嗯,用户让我总结一下这篇文章的内容,控制在一百个字以内,而且不需要用“文章内容总结”之类的开头。首先,我需要通读整篇文章,抓住主要观点。 文章讲的是API费用的问题,特别是Gemini 3 Flash虽然标价便宜,但实际因为思考令牌的消耗,费用反而更高。作者还提到了斯坦福和伯克利的研究,揭示了价格反转现象,即更便宜的模型实际可能更贵。此外,文章讨论了安全问题和如何防御这些成本问题。 接下来,我需要把这些要点浓缩到100字以内。要确保涵盖价格反转、思考令牌的影响、安全威胁以及防御措施。 可能的结构是:先点出API费用问题,然后提到价格反转研究,接着说明原因(思考令牌),最后提到安全威胁和防御措施。 现在组织语言:这篇文章探讨了API费用中的价格反转现象,揭示了更便宜的模型可能因思考令牌消耗而成本更高,并强调了其对安全的威胁及防御措施。 检查字数:刚好在100字左右。确保没有使用开头模板,并且内容准确。 </think> 这篇文章探讨了API费用中的价格反转现象,揭示了更便宜的模型可能因思考令牌消耗而成本更高,并强调了其对安全的威胁及防御措施。 2026-3-31 06:49:19 Author: infosecwriteups.com(查看原文) 阅读量:4 收藏

Raviteja Nekkalapu

Press enter or click to view image in full size

Last month I burned through $340 in API credits in a single day. Not because I was doing anything fancy. I was running a batch of security classification prompts through Gemini 3 Flash, and I picked it because the pricing page said it was cheap.

On paper, it was 78% cheaper than GPT-5.2. In practice, the invoice told completely a different story.

That experience sent me down a rabbit hole that ended at a research paper from Stanford and UC Berkeley (Chen et al., 2026). The findings are uncomfortable for anyone building on top of reasoning model APIs, and flat out alarming if you think about them through a security lens.

What Happened to My API Bill

Before I explain the research, I want to walk through what happened to me, because it maps perfectly onto the paper’s findings.

I had a classification pipeline. Take a chunk of text, determine if it contains PII, classify the sensitivity level, suggest a redaction strategy. Straightforward stuff. I built it on Gemini 3 Flash because the listed price was $0.50 per million input tokens and $3.00 per million output tokens. Compare that to GPT-5.2 at $1.75 and $14.00 respectively.

No contest, right?

Wrong.

When I pulled the usage breakdown the next day, Gemini 3 Flash had consumed over 40 million thinking tokens on about 8,000 queries. Thinking tokens are the internal chain-of-thought tokens that reasoning models generate before they give you their final answer. You never see them in the response. But you pay for them at the output token rate.

GPT-5.2, which I ran as a comparison on the same queries, used about 4.5 million thinking tokens for the same batch.

Nine times fewer.

The model that was supposed to be cheap was actually 22% more expensive overall. And the only reason was thinking tokens.

The Research That Explains It

The paper is titled “The Price Reversal Phenomenon” and it studied 8 frontier reasoning models across 9 different task categories. Competition math, code generation, science QA, open-ended chat, knowledge-intensive retrieval. A proper spread.

Their central finding is that in 21.8% of all model pair comparisons, the cheaper-listed model actually costs more in practice.

One in five. That is not a rounding error.

Some of the specific reversals are wild:

- Gemini 3 Flash is listed at 78% cheaper than GPT-5.2. Actual cost across all tasks? 22% higher.

- Claude Opus 4.6 is listed at 2x the price of Gemini 3.1 Pro. Actual cost? 35% lower.

- Worst case reversal on a single benchmark (MMLUPro): Gemini 3 Flash was listed 1.7x cheaper than Claude Haiku 4.5 but cost 28x more in practice.

28 times. That was not a typo in the paper. I checked twice.

Why This Is a Security Problem

If the gap between listed price and actual cost can be 28x, and the primary driver is thinking token consumption, and thinking token consumption varies by up to 900% across models on the same query, then an attacker who understands this can weaponize it.

This is not a theoretical concern. OWASP already lists “Unbounded Consumption” as LLM10 in the 2025 Top 10 for LLM Applications.

Denial-of-wallet attacks specifically targeting API billing are a recognized threat category. But the existing guidance assumes you can estimate costs from the pricing page and set budgets accordingly.

The pricing reversal phenomenon breaks that assumption.

Building a Quick Cost Monitor

After reading the paper, I wrote a small wrapper to track actual token consumption per request. Nothing complicated, but it saved me from another surprise invoice.

import time
import json
from openai import OpenAI
class TokenCostTracker:
"""Track actual vs expected costs per API call."""

PRICING = {
"gpt-5.2": {"input": 1.75, "output": 14.00},
"gemini-3-flash": {"input": 0.50, "output": 3.00},
"claude-opus-4.6": {"input": 15.00, "output": 75.00},
"claude-haiku-4.5": {"input": 0.80, "output": 4.00},
}

def __init__(self, model_name, alert_threshold=3.0):
self.model = model_name
self.threshold = alert_threshold
self.call_log = []

def estimate_cost(self, prompt_tokens):
"""What you THINK it'll cost based on listed pricing."""
p = self.PRICING[self.model]
# assume output roughly equals input (rough heuristic)
expected_output = prompt_tokens
return (prompt_tokens * p["input"] + expected_output * p["output"]) / 1_000_000

def actual_cost(self, usage):
"""What it ACTUALLY costs based on real token counts."""
p = self.PRICING[self.model]
input_cost = usage.prompt_tokens * p["input"] / 1_000_000
# output tokens include thinking tokens for reasoning models
output_cost = usage.completion_tokens * p["output"] / 1_000_000
return input_cost + output_cost

def call_with_tracking(self, client, messages):
estimated = self.estimate_cost(
sum(len(m["content"].split()) * 1.3 for m in messages)
)
response = client.chat.completions.create(
model=self.model,
messages=messages
)
real = self.actual_cost(response.usage)
ratio = real / estimated if estimated > 0 else float("inf")

record = {
"timestamp": time.time(),
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"estimated_usd": round(estimated, 6),
"actual_usd": round(real, 6),
"ratio": round(ratio, 2),
}

# pull thinking tokens if the API exposes them
if hasattr(response.usage, "completion_tokens_details"):
details = response.usage.completion_tokens_details
if hasattr(details, "reasoning_tokens"):
record["thinking_tokens"] = details.reasoning_tokens

self.call_log.append(record)

if ratio > self.threshold:
print(f"[COST ALERT] {self.model}: actual cost is "
f"{ratio:.1f}x the estimate. "
f"Thinking tokens: {record.get('thinking_tokens', 'N/A')}")
return response

def summary(self):
if not self.call_log:
return "No calls recorded."
total_est = sum(r["estimated_usd"] for r in self.call_log)
total_real = sum(r["actual_usd"] for r in self.call_log)
avg_ratio = sum(r["ratio"] for r in self.call_log) / len(self.call_log)

return (
f"Model: {self.model}\n"
f"Calls: {len(self.call_log)}\n"
f"Total estimated: ${total_est:.4f}\n"
f"Total actual: ${total_real:.4f}\n"
f"Avg cost ratio: {avg_ratio:.2f}x"
)

The key insight is the ‘ratio’ field. If actual cost consistently exceeds 3x the estimate, something is off. Either the model is overthinking your queries, or you picked the wrong model for your workload. Both are problems.

How an Attacker Exploits This

Let us go through a concrete attack scenario.

Get Raviteja Nekkalapu’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

Let’s say you run a customer-facing application that lets users submit text for analysis. Maybe it is a compliance checker, maybe a document summarizer, maybe a chatbot.

You picked a reasoning model because you need accurate results. You set a budget based on the listed API price and your expected traffic.

An attacker who understands thinking token heterogeneity can craft inputs that force the reasoning model into deep deliberation. Not through prompt injection (that is a different problem). Through legitimate-looking queries that happen to trigger expensive reasoning paths.

def generate_expensive_queries():
"""
Queries that look normal but trigger heavy thinking.
These are NOT prompt injections. They are legitimate inputs
that exploit how reasoning models allocate compute.
"""
patterns = [
# ambiguous classification forces the model to consider
# many possible interpretations before settling
"The patient record shows Smith, J. visited on 3/15. "
"Address listed as 'same as employer'. "
"Phone: see file. SSN: on record.",
# nested conditional logic triggers extended reasoning
"Classify this: IF the document contains financial data "
"AND the sender is external BUT the recipient has "
"clearance level 3 OR above, AND the timestamp falls "
"within the audit window, THEN determine sensitivity. "
"Document follows: quarterly projections show…",
# contradictory signals make the model deliberate longer
"This message is marked CONFIDENTIAL in the header "
"but was posted on a public forum. The content discusses "
"internal pricing but uses only publicly available "
"figures. Classify appropriately.",
]
return patterns

None of those are malicious payloads in the traditional sense.

No SQL injection, no XSS, no shell commands.

They are valid inputs that a reasonable user might submit. But they share a property - they force the reasoning model to consider multiple competing hypotheses, which drives up thinking token consumption.

The paper found that thinking token consumption varies by up to 9.7x for the same query across repeated runs. That is within-query variance, meaning even if you run the exact same input again, the cost fluctuates wildly. An attacker does not even need to be precise. Just sending queries that tend to trigger deeper reasoning is enough to blow through budgets.

The Irreducible Problem

One finding from the paper that stuck with me is about irreducible variance. The researchers ran the same AIME math problems through GPT-5.2, GPT-5 Mini, and Gemini 3 Flash 6 times each.

Same prompts, same parameters, same everything.

The per-query cost fluctuated with an average coefficient of variation of 0.29 across all three models. GPT-5 Mini was the worst offender, with max-to-min cost ratios reaching 9.7x on the same query.

This means that even a perfect cost prediction system would still be wrong by an average of 29% on any given request. There is no fixing this with better estimation. It is inherent to how these models reason.

def simulate_cost_variance(base_cost, cv=0.29, num_runs=100):
"""
Simulate the irreducible cost variance documented in the paper.
cv=0.29 is the average coefficient of variation they measured.
"""

import numpy as np
costs = np.random.lognormal(
mean=np.log(base_cost),
sigma=cv,
size=num_runs
)
return {
"mean": round(float(np.mean(costs)), 4),
"min": round(float(np.min(costs)), 4),
"max": round(float(np.max(costs)), 4),
"max_min_ratio": round(float(np.max(costs) / np.min(costs)), 2),
"std": round(float(np.std(costs)), 4),
}
# a query that should cost $0.005 based on listed pricing
result = simulate_cost_variance(0.005)
print(json.dumps(result, indent=2))
# you might see max/min ratios of 2–3x easily

For budget planning, this means your safety margin needs to be much wider than you thought.

A 2x buffer?

Probably not enough for reasoning-heavy workloads.

Defensive Measures That Actually Help

Based on the paper’s findings and my own experience, some things actually make a difference.

Track thinking tokens per request, not just total output token - Most API responses now include a breakdown. If your provider does not expose this, you can infer it by comparing ‘completion_tokens’ with the visible response length. Log it. Alert on it.

Set per-request cost ceilings, not just monthly budgets - A monthly budget of $500 means nothing if a single batch burns through it in an hour. The cost tracker I showed above does per-call alerting, which catches problems before they compound.

Benchmark your actual workload before committing to a model - The paper showed that cost rankings are task-dependent. GPT-5.2 is cheaper than Claude Opus 4.6 on AIME math problems, but more expensive on SimpleQA. There is no universally cheapest model. You have to test with YOUR data.

Consider routing - If some of your queries are simple and some are complex, route them to different models. Use a cheap non-reasoning model for straightforward classification and save the reasoning model for cases that genuinely need it.

The paper’s cost decomposition shows that thinking tokens are 80–97% of output tokens depending on the model. Avoiding thinking tokens on simple queries is the single biggest cost lever you have.

def route_query(query, complexity_score):
"""
Simple cost-aware routing.
Below threshold: use a non-reasoning model (no thinking tokens).
Above threshold: use a reasoning model.
"""
THRESHOLD = 0.6
if complexity_score < THRESHOLD:
return {
"model": "gpt-4.1-mini", # non-reasoning, predictable cost
"reason": "low complexity, skip reasoning overhead"
}
else:
return {
"model": "gpt-5.2", # reasoning enabled
"reason": "high complexity, reasoning justified"
}

What the Industry Needs to Do

The paper’s authors call for per-request cost breakdowns and cost estimation APIs. I agree, but I would go further. Reasoning model providers should let users set thinking token budgets per request. Some already do (OpenAI’s ‘reasoning_effort’ parameter, for example). But this should be standard, not optional.

From a security standpoint, the industry needs to treat unpredictable API costs the same way we treat unpredictable resource consumption in cloud environments. We learned that lesson with Lambda functions and S3 egress. We will learn it again with thinking tokens. The question is how many surprise invoices it takes.

References -
“The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More” (Chen, Zhang, He, Stoica, Zaharia, and Zou, 2026) is available at [arxiv.org/abs/2603.23971]. The authors released their data and analysis code at [github.com/lchen001/pricing-reversal], which is worth pulling if you want to run the cost comparisons on your own workloads.


文章来源: https://infosecwriteups.com/thinking-tokens-are-the-new-denial-of-wallet-attack-surface-d23394007577?source=rss----7b722bfd1b8d---4
如有侵权请联系:admin#unsafe.sh