ARTICLES ·2026-06-05 ·BY EFFLOOW CONTENT FACTORY

BAGEN: LLM Agents Waste 44% of Tokens on Tasks They'll Fail

Frontier LLM agents waste 28–64% of tokens on doomed tasks, BAGEN reveals. Learn how budget interval estimation and early stopping work in practice.

llm-agents token-optimization paper-poc ai-cost agent-reliability budget-awareness

BAGEN: LLM Agents Waste 44% of Tokens on Tasks They'll Fail

You're paying for every token your agent burns. And according to new research from Northwestern, Stanford, Cornell, and All Hands AI, a large share of that spend goes directly to waste — on trajectories the agent was never going to complete successfully.

The paper is BAGEN: Are LLM Agents Budget-Aware? (arXiv:2606.00198, submitted May 29, 2026). Its core question is simple: can frontier LLM agents predict when they're about to run out of runway? The answer, across five frontier models and four environments, is a firm no.

This article covers what BAGEN found, how the concept of budget-aware interval estimation works, and includes an Effloow Lab PoC that reproduces the key dynamics using Python stdlib — no API keys, no GPU.

Why This Matters for Production Agent Systems

Token budgets are a real constraint in every deployed agent system. You set a max_tokens limit, you watch the cost dashboard, and you assume the agent will either finish the task or hit the hard wall. What BAGEN documents is a third case that developers rarely account for: the agent continues consuming tokens on a task it cannot complete, all the way to the limit.

The mechanism is predictable. Unsolvable tasks tend to produce backtracking behavior — more tool calls per step, increasing per-step token costs, no convergence signal. A budget-aware agent should detect that pattern early and stop (or alert a human). Frontier models, as BAGEN shows, don't.

The practical consequence:

Failed tasks consume the same (or more) tokens as successful ones
You pay for the full trajectory even when step 2 already signals doom
At scale, this becomes a significant line item in your inference budget

BAGEN's headline number: early stopping on failed trajectories saves 28–64% of tokens versus running to completion. That's not a micro-optimization. That's a structural cost reduction available to any developer who builds the right wrapper.

What Budget-Awareness Actually Means

BAGEN distinguishes two budget types that agents encounter in practice:

Internal budgets come from agent computation itself — how many tokens the agent is burning. The environments used here include:

Sokoban: A puzzle-solving benchmark where agents must push boxes onto targets. Token consumption per step signals whether the agent is making progress or backtracking.
Search-R1: A search-and-retrieval environment where steps involve tool calls against a retrieval index.
SWE-bench: Software engineering tasks requiring multi-file code modifications and test verification.

External budgets come from the downstream effects of agent actions:

Supply-chain environment (curated from real enterprise data): agents manage cost, time, and warehouse occupancy simultaneously. Overspending on warehouse space is the external budget; token consumption is the internal one.

This two-axis framing is useful because many developers think only about token cost (internal) and ignore external resource consumption (money spent via tool calls, API charges from agent actions, storage consumed). BAGEN shows agents are over-optimistic on both dimensions.

The Three Sub-Capabilities BAGEN Measures

Budget-awareness in BAGEN decomposes into three measurable abilities:

1. Feasibility Prediction

Can the agent correctly estimate whether a task is solvable before starting it? At step zero, before any action is taken, can the model predict whether the trajectory will succeed or fail within budget?

Current frontier models perform poorly here. They tend to rate most tasks as feasible, regardless of how complex or resource-intensive the prompt suggests the task will be.

2. Early Failure Detection

As the agent proceeds through a task, can it detect failure signals and trigger an alert or stop? This is where the 28–64% savings come from. An agent that detects a doomed trajectory at step 3 of a 15-step rollout recovers most of that budget.

The evaluation methodology BAGEN uses is a rollout-replay protocol: the paper first collects unconstrained rollouts (agents run to completion with no budget pressure), then re-queries each agent on every prefix of that rollout, asking for a budget estimate and feasibility prediction at each intermediate step. This separates the estimation capability from the actual task performance.

3. Interval Calibration

Rather than asking for a point estimate ("I need X more tokens"), BAGEN asks for an interval: a lower and upper bound. An agent that says "I'll need between 800 and 1,200 tokens to finish" is far more useful than one that guesses a single number.

Interval coverage — the fraction of cases where the true token count falls within the predicted interval — caps at 47% after SFT+RL fine-tuning on the best-performing setup. That's low. It means well over half the time, the predicted interval misses the actual consumption. The interval estimation problem is genuinely hard, even for fine-tuned models.

The Over-Optimism Problem: What BAGEN Found

The paper's most striking result is the low correlation between task performance and budget-awareness: r = 0.35 across the five frontier models. A model can score highly on the underlying task (SWE-bench resolution rate, puzzle solutions) while simultaneously being a poor predictor of its own resource usage.

Why? The paper attributes this to a training signal mismatch. LLMs are optimized to complete tasks — not to predict when they'll fail to complete tasks. Budget reasoning is a metacognitive skill that isn't directly rewarded in standard RLHF or instruction-following fine-tuning. Agents are implicitly trained to be optimistic because optimistic agents appear more capable on success-rate benchmarks.

The practical result is an agent that:

Underestimates remaining steps on hard tasks
Doesn't recalibrate its estimate as costs increase
Never triggers an early stop or user alert
Runs to the hard token limit, returns a failure, and leaves you with a full invoice

SFT+RL fine-tuning on BAGEN-specific trajectories does improve early stop and alert behavior. But the coverage cap suggests that the interval estimation problem may require architectural changes, not just fine-tuning.

Effloow Lab PoC: Simulating Interval Estimation

Effloow Lab reproduced the core BAGEN dynamics using a Python stdlib simulator. The goal was to demonstrate the estimator comparison without any LLM API calls or external packages.

Setup:

20 simulated agent trajectories (10 solvable, 10 unsolvable)
Budget: 1,500 tokens
Solvable tasks: 6–12 steps, 80–200 tokens/step
Unsolvable tasks: 10–18 steps, 150–350 tokens/step (higher variance, backtracking pattern)

Two estimators were compared:

Over-optimistic estimator (baseline, mimicking frontier model behavior):

def over_optimistic_estimator(consumed_so_far, max_budget, step, total_steps_estimate):
    lower = max(0, consumed_so_far * 0.8)
    upper = consumed_so_far * 1.1  # Only 10% more than current — very optimistic
    feasible = upper <= max_budget
    return {"lower": lower, "upper": upper, "feasible": feasible, "alert": False}

This estimator assumes consumption will flatten out. It fires zero alerts across all 20 trajectories — replicating the paper's finding about frontier model over-optimism.

BAGEN-style interval estimator (rolling cost + variance):

def bagen_estimator(consumed_so_far, max_budget, step, trajectory_so_far):
    step_costs = [t["cost"] for t in trajectory_so_far]
    avg_cost = sum(step_costs) / len(step_costs)
    variance = sum((c - avg_cost)**2 for c in step_costs) / len(step_costs)
    std = math.sqrt(variance)

    # Detect increasing cost trend (unsolvable signal)
    recent_avg = sum(step_costs[-3:]) / min(3, len(step_costs))
    est_remaining = 8 if recent_avg > avg_cost * 1.1 else max(2, 10 - step)

    lower = consumed_so_far + est_remaining * max(0, avg_cost - std)
    upper = consumed_so_far + est_remaining * (recent_avg + std)

    feasible = lower <= max_budget
    alert = upper > max_budget * 1.15
    return {"lower": lower, "upper": upper, "feasible": feasible, "alert": alert}

Results from the PoC run:

Experiment: 20 trials, budget=1500 tokens
Solvable tasks (n=10):   1 exceeded budget
Unsolvable tasks (n=10): 10 exceeded budget

Estimator Comparison:
  Over-optimistic (frontier): 0 alerts fired
  BAGEN-style estimator:      72 alerts total, 56 on unsolvable tasks

Early Stopping Simulation:
  Average savings on failed tasks: 44.6%
  Range: 40.9% – 48.7%

The step-by-step output on an unsolvable trajectory shows how quickly the rolling-cost estimator detects the pattern:

Step   Consumed      Lower      Upper   Alert?
--------------------------------------------------
   1        332        332       1500
   2        561       2393       3217  ⚠ ALERT
   3        813       2401       3019  ⚠ ALERT
   4       1134       2571       3002  ⚠ ALERT
   5       1450       2693       3139  ⚠ ALERT

The alert fires at step 2, when the upper bound already projects 3,217 tokens needed against a 1,500-token budget. A production system could halt, escalate to a human, or switch to a cheaper fallback at this point.

Lab note: This is a simulated environment. The paper evaluates on actual LLM agentic runs with real tool calls. Our PoC validates the statistical pattern, not specific model rankings.

How to Build This Into Your Agent System

The BAGEN insight translates directly into a production wrapper pattern. The idea is to run a lightweight interval estimator alongside your main agent loop, and trigger actions when the upper bound crosses a threshold.

Minimal budget guard implementation:

from collections import deque
import math

class BudgetGuard:
    def __init__(self, max_budget: int, alert_threshold: float = 1.15):
        self.max_budget = max_budget
        self.alert_threshold = alert_threshold
        self.step_costs = deque(maxlen=10)  # Rolling window
        self.total_consumed = 0

    def record_step(self, tokens_used: int) -> dict:
        self.step_costs.append(tokens_used)
        self.total_consumed += tokens_used
        return self._estimate()

    def _estimate(self) -> dict:
        if len(self.step_costs) < 2:
            return {"feasible": True, "alert": False, "upper": self.max_budget}

        costs = list(self.step_costs)
        avg = sum(costs) / len(costs)
        variance = sum((c - avg) ** 2 for c in costs) / len(costs)
        std = math.sqrt(variance)

        recent_avg = sum(costs[-3:]) / min(3, len(costs))
        # Detect upward trend = backtracking/unsolvable signal
        est_remaining = 8 if recent_avg > avg * 1.1 else max(2, 12 - len(costs))

        lower = self.total_consumed + est_remaining * max(0, avg - std)
        upper = self.total_consumed + est_remaining * (recent_avg + std)

        return {
            "feasible": lower <= self.max_budget,
            "alert": upper > self.max_budget * self.alert_threshold,
            "lower_bound": int(lower),
            "upper_bound": int(upper),
            "consumed": self.total_consumed,
        }

# Usage in an agent loop:
guard = BudgetGuard(max_budget=8000)

for step in agent.run():
    tokens_this_step = count_tokens(step.messages)
    status = guard.record_step(tokens_this_step)

    if status["alert"]:
        # Upper bound exceeds budget — intervene
        agent.trigger_escalation(
            f"Budget alert: estimated {status['upper_bound']} tokens needed, "
            f"budget is {guard.max_budget}. Stopping early."
        )
        break

This pattern requires no LLM calls, no fine-tuning, and no additional dependencies. The computational overhead is negligible — a few floating-point operations per step.

Common Mistakes Developers Make with Token Budgets

Treating the hard limit as the only control point. Most frameworks let you set max_tokens and call it done. But a hard limit generates a truncation error at the wall — it doesn't give you a graceful exit. The BAGEN pattern adds soft signals earlier in the trajectory.

Measuring only task success rate. BAGEN's main point is that success rate and budget-awareness are largely uncorrelated (r=0.35). If your eval only tracks task completion, you won't notice the over-optimism problem until your inference bill arrives.

Ignoring per-step cost trends. The early warning signal isn't total consumption — it's the derivative. A task that burns 200 tokens in step 1 and 350 in step 2 and 480 in step 3 is showing a diverging cost trajectory. That pattern, not the absolute number, is what BAGEN's estimator catches.

Applying a flat budget to all task types. Sokoban puzzles, code generation tasks, and supply-chain optimization have different intrinsic token distributions. A budget that's appropriate for one type will be wasteful for another. Consider per-task-class budgets tuned from past trajectories.

FAQ

Q: Does this problem affect all LLM agents equally?

The paper evaluates five frontier models and finds consistent over-optimism across all of them, with r=0.35 correlation between task performance and budget-awareness. Variation exists between models, but no current frontier model reliably predicts its own token consumption on hard tasks.

Q: Is the 28–64% token savings claim realistic in production?

The 28–64% range comes from the actual BAGEN benchmark runs on real LLM trajectories (Sokoban, Search-R1, SWE-bench, supply-chain). The Effloow Lab PoC confirmed a 44.6% average in simulation. The key constraint is that savings only apply to failed trajectories — tasks the agent was going to fail anyway. On successful tasks, early stopping would reduce quality. The budget guard should only trigger on trajectories that cross a high-confidence alert threshold.

Q: Can I fine-tune my model to be more budget-aware?

Yes — the paper demonstrates that SFT+RL fine-tuning on BAGEN-specific trajectories improves early stop and alert behavior. However, interval coverage still caps at 47% after fine-tuning, suggesting that perfect calibration remains an open problem. The wrapper pattern described above is a simpler, training-free alternative that works with any base model.

Q: What's the difference between this and just setting a lower token limit?

A hard lower limit truncates the agent arbitrarily. The BAGEN approach adds intelligence to the stopping decision: the estimator predicts that this specific trajectory is unlikely to succeed, so stopping now saves budget without affecting tasks that were going to succeed. Hard limits waste budget on easy tasks (by cutting them short) and fail on hard tasks (by stopping them at the wrong moment). Soft signals based on interval estimation are more precise.

Q: Where can I read the full paper?

The paper is available at arxiv.org/abs/2606.00198. The project website with benchmark environments and data is at ragen-ai.github.io/bagen.

Key Takeaways

BAGEN (arXiv:2606.00198) documents that frontier LLM agents have a correlation of only r=0.35 between task performance and budget-awareness — being a strong agent doesn't make you budget-aware.
Frontier models are consistently over-optimistic: they underestimate token consumption on hard tasks and never trigger early stops.
Early stopping on failed trajectories saves 28–64% of tokens in the paper's benchmark runs.
Budget-awareness breaks into three measurable skills: feasibility prediction, early failure detection, and interval calibration. Current models struggle with all three.
A practical solution is a wrapper-level interval estimator that monitors per-step cost trends and triggers soft alerts before the hard budget wall.
The Effloow Lab PoC confirmed 44.6% average savings in simulation using a rolling-cost estimator with variance — no LLM calls required.

Bottom Line

BAGEN turns a billing problem into a diagnostic one. If your agents are burning through budgets on failed tasks, the fix isn't raising the limit — it's adding interval estimation to your agent loop. The paper gives you the framework; the wrapper pattern above gives you the implementation. It takes under 30 lines of stdlib Python and works with any model.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →