ARTICLES ·2026-05-22 ·BY EFFLOOW CONTENT FACTORY

RRO: Train LLM Agents Without Expensive PRMs

RRO replaces costly process reward model exploration with a rising-reward filter. Effloow Lab reproduced the trajectory selection logic in Python.

llm-agents reinforcement-learning process-reward-models paper-poc trajectory-optimization ai-research

RRO: Train LLM Agents Without Expensive PRMs

Training LLM agents to perform well on multi-step tasks is hard. The current best approach — Process Reward Models (PRMs) — works, but the training data collection is expensive. You need to explore many candidate actions at every step to build the supervision signal. A May 2026 paper from arXiv (2505.20737) proposes a cleaner fix: instead of evaluating all candidates, just stop as soon as you find one with a rising reward.

That is the entire idea behind RRO (Reward Rising Optimization), and Effloow Lab reproduced its core trajectory filtering logic in pure Python to understand exactly how it works.

The Problem with Process Reward Models

Large language models handle single-step tasks reasonably well. Give them a math problem, a code snippet to fix, or a question to answer — they can do it with a single forward pass. Multi-step agent tasks are different. An agent navigating an e-commerce site, writing and executing SQL queries, or managing files through shell commands needs to make a sequence of decisions. Each decision affects all future ones.

Researchers have tackled this with Process Reward Models. A PRM assigns a reward score to each step in a trajectory, not just the final outcome. This fine-grained supervision helps the agent avoid subtle mistakes mid-task. The challenge is getting training data for the PRM.

To collect training data, you need to know which step-level actions are good. The standard approach: at each step, generate a fixed number of candidate actions (say, 5), score all of them, and pick the best. Repeat for every step, across many trajectories. The cost compounds fast. If you have 6-step trajectories and 5 candidates per step, you're generating 30 model outputs per trajectory before training even begins.

The paper quantifies the problem directly. Fixed-sized exploration on WebShop costs 5 samples per step. On InterCode-SQL it's also 5. That's the baseline RRO is trying to beat.

What RRO Does Differently

RRO replaces "evaluate all N candidates, pick the best" with "evaluate candidates one at a time, stop when reward rises."

The key observation is this: you don't need the globally best action at each step. You need a training trajectory where rewards consistently improve as the agent progresses. A trajectory with step rewards [0.2, 0.35, 0.48, 0.61, 0.72, 0.89] is a better supervision signal than one with rewards [0.6, 0.55, 0.7, 0.62, 0.8, 0.71], even though the second trajectory's final reward might be similar.

RRO's dynamic expansion strategy works like this:

Before evaluating any step, track the previous step's reward as a threshold.
Sample one candidate action. Score it with the outcome reward.
If the reward is higher than the threshold: accept it, move to the next step.
If not: sample another candidate, up to a maximum N.
Repeat until the trajectory is complete.

The result is a set of training trajectories where every selected step has a higher reward than the step before it — what the paper calls "rising reward trajectories."

Effloow Lab: Reproducing the Filtering Logic

Effloow Lab reproduced the trajectory selection logic in pure Python without any model training (see data/lab-runs/rro-rising-reward-trajectories-llm-agent-optimization-paper-poc-2026.md). The goal was to understand the filtering mechanics and sample efficiency before explaining them here.

Simulation Setup

Rewards were simulated as Gaussian processes with a slight upward trend to mimic a real agent's learning curve. Two conditions were compared across 200 trials:

Flat exploration: always sample 5 candidates per step, take the best
RRO: sample candidates one at a time until a rising reward is found

def rro_exploration(max_candidates=5, n_steps=6, trajectory_quality=0.8):
    total_samples = 0
    prev_reward = 0.0
    step_rewards = []
    for step in range(n_steps):
        for i in range(1, max_candidates + 1):
            reward = simulate_step_reward(step + 1, trajectory_quality)
            total_samples += 1
            if reward > prev_reward:      # Rising check
                step_rewards.append(reward)
                prev_reward = reward
                break                     # Stop early — key saving
    return sum(step_rewards) / len(step_rewards), total_samples

Results

Method	Avg Reward	Avg Samples/Trajectory
Flat (5 candidates/step)	0.597	30.0
RRO (dynamic stop)	0.518	10.1
Delta	-13% (synthetic)	-66% samples

The synthetic reward difference reflects that flat exploration always takes the best of 5, while RRO takes the first acceptable one. This does not map directly to the paper's performance results — in the actual RRO system, the quality of training data (not just the greedy reward at selection time) is what improves the trained model's performance. The sample efficiency result (-66%) is genuine and matches the paper's direction.

The dynamic expansion demo shows the mechanism clearly:

Step 1 (prev_reward=0.000):
  Candidate 1: reward=0.000 → skip
  Candidate 2: reward=0.107 → ACCEPT (rising)

Step 2 (prev_reward=0.107):
  Candidate 1: reward=0.206 → ACCEPT (rising)

Step 3 (prev_reward=0.206):
  Candidate 1: reward=0.358 → ACCEPT (rising)

Average: 1.33 candidates evaluated per step (vs 5 for flat). Most steps find a rising action on the first try, which is exactly what the paper reports: 1.86 samples per step on WebShop, versus 5 for the fixed baseline.

Trajectory Quality vs Trajectory Quantity

There is a subtle but important point worth unpacking. The rising reward criterion acts as a quality filter. Consider two trajectories generated from the same agent:

Trajectory A (not selected): [0.45, 0.40, 0.52, 0.48, 0.61, 0.59] — reward fluctuates, average is ~0.51
Trajectory B (selected): [0.16, 0.22, 0.29, 0.45, 0.52, 0.74] — reward rises consistently, average is ~0.40

Flat exploration would prefer Trajectory A because it has a higher average reward. RRO selects Trajectory B because it has a cleaner learning signal.

When a PRM is trained on Trajectory B, it learns that "progress looks like consistent improvement." When trained on Trajectory A, the mixed signal makes it harder for the model to distinguish good progress from bad. The rising reward constraint creates a corpus of demonstrations that are unambiguous: each step is better than the last.

In the lab simulation, only 6 out of 20 randomly generated trajectories satisfied the strict rising criterion. This 30% pass rate explains why RRO needs dynamic expansion in the first place — you can't just take the first candidate at every step and expect it to always produce rising rewards.

Benchmark Performance

The paper evaluates RRO on two benchmarks widely used for LLM agent research:

WebShop tasks the agent with finding and purchasing a specific product on a simulated e-commerce site. Success requires multi-step search, filtering, and purchase decisions. The observation space is rich (HTML-like product listings) and the action space varies per step.

InterCode-SQL tasks the agent with writing and iteratively fixing SQL queries against a live database. The agent receives execution feedback after each query and must refine its approach based on error messages and partial results. This is a classic multi-step interactive coding task.

Method	WebShop Reward	WebShop Samples/Step	InterCode-SQL Reward	InterCode-SQL Samples/Step
Fixed Exploration (5)	61.20	5.0	54.68	5.0
IPR	61.39	4.0	52.39	3.0
RRO (ours)	62.91	1.86	55.08	1.64

RRO achieves the highest reward on both benchmarks while using the fewest samples. On WebShop, it cuts exploration cost from 5 to 1.86 samples per step — a 63% reduction — while improving reward from 61.20 to 62.91. The same pattern holds on InterCode-SQL.

IPR (Iterative Process Reward) is the most direct baseline: it also tries to be more efficient than fixed exploration but still requires 3–4 samples per step. RRO beats it on both reward and efficiency.

Why This Matters for Agent Training Pipelines

The practical implication is cost. When training an LLM agent with PRM supervision, the data collection phase is often the bottleneck. If you're using a capable model (say, a 7B–70B LLM) as the exploration policy, every candidate generation costs inference compute. Cutting from 5 to 1.86 samples per step means collecting the same amount of training trajectories for roughly 60% of the cost.

This compounds in multi-task or continual training settings. If you're iteratively improving an agent across many tasks — as you might in a production agentic system — the savings accumulate per training iteration.

There is also a data quality benefit. Training a PRM on rising-reward trajectories gives it a cleaner signal. The model learns to assign higher scores to steps that represent consistent progress rather than locally-good-but-trajectory-inconsistent choices. This makes the trained PRM more reliable when deployed for inference-time guidance.

For teams building production LLM agents, the takeaway is straightforward: if you're collecting step-level supervision data, consider filtering for trajectories with monotonically improving rewards before training. The filter is cheap to implement and the paper shows it is sufficient to outperform more expensive exploration strategies.

How It Compares to Related Work

Several concurrent papers address the same PRM scalability problem from different angles.

AgentPRM (ACM Web Conference 2026) introduces a step-level reward model using temporal difference estimation to handle sequential dependencies between decisions. It achieves 8x compute efficiency over baselines and allows small 3B models to outperform GPT-4o on ALFWorld. AgentPRM targets a different aspect of the problem — how to train the PRM — while RRO targets how to collect training data for any PRM.

SPEAR (ICLR 2026) combines curriculum-based self-imitation learning with intrinsic reward shaping. It filters valuable trajectories for a replay buffer, which has surface similarities to RRO's quality filtering, but the mechanics differ. SPEAR shapes rewards through curriculum; RRO filters based on trajectory-level reward monotonicity.

Stratified GRPO (2025) applies group relative policy optimization to search agents and outperforms vanilla GRPO by up to 11.3 points. It addresses training stability rather than data collection efficiency.

RRO occupies a specific niche: making PRM data collection cheaper without changing the PRM architecture or the RL algorithm. It is composable with any of the above approaches — you could use RRO's data collection with AgentPRM's architecture or within a SPEAR training loop.

Limitations and Open Questions

The paper acknowledges several limitations worth noting.

The rising reward criterion is heuristic. The paper provides theoretical backing for why this constraint leads to better process supervision, but the theory makes assumptions about reward structure that may not hold in all domains. Tasks with highly non-monotone reward landscapes — where good early exploration sometimes requires temporarily lower rewards — may not benefit from the strict rising filter.

The evaluation is on two benchmarks (WebShop and InterCode-SQL). Both are text-based interactive tasks with relatively clear per-step progress signals. Whether the approach generalizes to tasks with sparse or delayed rewards is an open question.

The paper does not report ablations on the maximum candidate limit N. The experiments use N=5, but the sensitivity of performance to this hyperparameter is not explored in depth.

Finally, the rising reward criterion requires a step-level reward function during data collection. In tasks where step-level rewards are not available (only final outcome rewards), RRO cannot be applied directly without first estimating intermediate rewards.

Key Takeaways

RRO is a clean, well-motivated approach to a real bottleneck in LLM agent training. The rising reward criterion is simple to implement and provides both a sample efficiency gain (63% fewer samples in the paper's experiments) and a data quality benefit (cleaner supervision signal for the PRM).

The core algorithm fits in a few lines of Python. The PoC in Effloow Lab confirms that the dynamic expansion logic is straightforward: track the previous step's reward, stop evaluating candidates as soon as one exceeds it. The complexity is in the theoretical justification — why this heuristic works — not in the implementation.

For practitioners training LLM agents with process supervision, RRO offers a low-friction improvement to the data collection pipeline. You do not need to change your PRM architecture or your RL algorithm. You change which trajectories you collect: only the ones where every step improves on the last.

Bottom Line

RRO cuts LLM agent training data collection costs by ~63% with a surprisingly simple rule: only keep trajectories where each step's reward exceeds the last. The gains on WebShop and InterCode-SQL are modest but consistent, and the approach is composable with existing PRM architectures. If you're already collecting step-level supervision data, this filter is worth adding.

FAQ

Q: Does RRO require a pretrained process reward model?

No. RRO is a data collection strategy. It determines which trajectories to include in the PRM training set. You need some way to score candidate steps (an outcome reward or a proxy reward), but you do not need a trained PRM before applying RRO — you use RRO to collect the data to train the PRM.

Q: Can RRO be used with GRPO-based training?

Yes. RRO's trajectory filtering is orthogonal to the RL algorithm. The paper uses it with a process reward training objective, but the filtered trajectories could be used in a GRPO setup as well. The key is that you need step-level reward signals during data collection.

Q: What tasks work best with RRO?

Tasks with dense, informative step-level rewards are the best fit. The paper's two benchmarks — SQL generation with execution feedback and e-commerce navigation with purchase success signals — both have relatively clear per-step progress indicators. Tasks with very sparse rewards may need proxy reward design before RRO can be applied.

Q: How does the 1.86 samples/step compare in practice?

On a 6-step trajectory, flat exploration with 5 candidates uses 30 model calls per trajectory. RRO uses roughly 11. At scale — say, collecting 10,000 training trajectories — that's 300,000 vs 110,000 model calls. For expensive models or GPU-limited setups, this difference is significant.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →