Thinking vs. Doing: Improving Agent Reasoning by Scaling Test-Time Interaction

Junhong Shen^1,2*, Hao Bai^3*
Lunjun Zhang⁴, Yifei Zhou⁵, Amrith Setlur¹, Shengbang Tong⁷, Diego Caples⁶, Nan Jiang³, Tong Zhang³
Ameet Talwalkar¹, Aviral Kumar¹

¹ CMU, ² Scribe, ³ UIUC, ⁴ U Toronto, ⁵ UC Berkeley, ⁶ The AGI Company, ⁷ NYU

*Equal contribution

📝 Paper 💻 Code 📂 WebVoyager Model 📂 WebArena Model

Abstract

Test-time scaling in agentic tasks often relies on generating long reasoning traces ("think" more) before acting, but this does not allow agents to acquire new information from the environment or adapt behavior over time. We propose scaling test-time interaction---increasing an agent's interaction horizon---to enable rich behaviors like exploration, backtracking, and re-planning. To demonstrate the promise of this scaling dimension, we situate our study in the domain of web agents. We first show that even prompting-based interaction scaling can improve task success. We then introduce TTI, a curriculum-based RL approach that trains agents by adjusting interaction lengths during rollout. Using Gemma 3 12B, TTI achieves state-of-the-art results among open-source agents trained on public data. Our analysis shows that TTI enables agents to dynamically balance exploration and exploitation, establishing interaction scaling as a powerful, complementary dimension to test-time compute scaling.

Scaling Test-Time Interaction: A New Dimension of Agent Scaling

Motivation: Why Do We Need Longer Interaction?

Prior methods for agent test-time scaling usually scale the number of thinking tokens at each step, but this does not enable the agent to engage in longer interactions with the environment to collect new information. We argue that to building more robust and generalizable agents requires learning adaptive policies that can adjust on-the-fly to new information. A key to such adaptability is being able to take more actions during deployment. We therefore propose a new dimension of test-time scaling: increasing the number of interaction steps of the agent. This allows agents to have sufficient time to explore different paths. For example, in a hotel booking task, an agent must browse many listings, compare user reviews, and check availability before selecting the best option. Interaction scaling is orthogonal to existing methods based on chain-of-thought (CoT), which emphasize deeper reasoning per step but do not support gathering new information from the environment.

Scaling Interaction via Prompting

To demonstrate the potential of test-time interaction scaling, we first introduce a purely inference-time "check-again" mechanism: after the agent issues the task completion action, we explicitly prompt it to reconsider its decision: "You just signaled task completion. Let's pause and think again..." We study its effect on web navigation, using a subset of WebArena. As shown in the leftmost figure above, prompting the agent to re-check not only increases the actual interaction length as expected (dotted lines), but also improves the success rates on most domains (bars).

Comparison with Traditional Test-Time Scaling

We also compare interaction scaling against per-step budget forcing and best-of-n. We aim to answer the question: Given a total token budget, should agents prioritize more interaction steps or generating longer reasoning traces at each step?

Performance: The middle figure above shows that among the three strategies, interaction scaling (green stars) shows the steepest upward trend, achieving the highest success rate as the allowed token budgets increase.
Compute decomposition: The rightmost figure above decomposes total compute into tokens per step (y-axis) and steps per rollout (x-axis). Interaction scaling extends along the x-axis, while per-step reasoning scales along y-axis. We find that scaling across steps is more effective in web settings, likely because it enables the agent to gather new information and enrich its context.

However, the check-again strategy only allows the agent to revisit its behavior upon task completion, it does not enable it to implement nuanced behaviors such as switching between exploration and exploitation in the middle of a rollout. This shows the need for methods that train agents to scale test-time interaction internally.

TTI: Curriculum-Based Online RL for Interaction Scaling

A natural post-training approach is to apply online reinforcement learning (RL) using binary task rewards over extended horizons. However, our experiments reveal that training with a fixed rollout horizon presents a trade-off: short horizons limit the agent's ability to explore, while long horizons suffer from slow convergence due to optimization difficulties and noisy reward signals (assigning high value to exploratory behaviors like "going back" or "trying random links" early in training). To address these challenges, we propose TTI (Test-Time Interaction):

Curriculum on interaction horizon: The agent is initially trained on short trajectories, and is gradually exposed to longer horizons.
Multicaptive schedule: The curriculum uses a multiplicatively increasing schedule, assuming the agent can quickly acquire basic skills and benefits from early exposure to exploratory behaviors.
Compatible with most online RL algorithms: We mainly employ online filtered behavior cloning in our experiments.

Experiment Results on WebVoyager

To enable large-scale training without training on the benchmark itself, we adopt synthetic task generation inspired by PAE. We first evaluate agents on WebVoyager, which consists of 427 tasks across 13 domains (we replace Google Search with Bing due to ReCaptcha issues due to ReCaptcha issues). Using Gemma 3 12B, TTI-trained agent achieves an average task success rate of 66.5%, setting a new state-of-the-art among open agents trained purely on public data. Our curriculum approach outperforms fixed 10-horizon baseline by 7.4% and fixed 30-horizon baseline by 21.3% in average accuracy.

Baseline results are taken from Zhou et al., Qin et al..

We also show the the training dynamics of TTI below. The performance of TTI improves substantially as the number of GoBack and Bing actions and the trajectory length grow. Also note that the trajectory length and the numbers of GoBack and Bing actions begin to increase with TTI, once the maximum allowed horizon length is increased as a part of the curriculum schedule (this regime is shown by the green shaded area in the figure).

Experiment Results on WebArena

We also evaluate using the full WebArena benchmark. TTI obtains the highest performance among open-source agents trained entirely via self-improvement, without relying on proprietary models for task completion or distillation.

WebArena results. For proprietary agents, we include the top 8 from the official leaderboard.

Further Scaling of TTI Agents

We also study the question: Can we further amplify performance by combining TTI with inference-time interaction scaling techniques, such as re-checking? We apply the check-again strategy to intermediate TTI checkpoints. As shown in the figure on the right, applying re-checking on top of TTI improves task success across various training stages. The benefits are more obvious in the early stages of training, when the agent has a stronger bias to terminate prematurely.

Case Studies

Success Mode: Effective Exploration in Complex Tasks

For complex, exploratory tasks that require information retrieval, TTI trains the agent to extend its interaction horizon through searches and backtracking, gathering and comparing information before making decisions. Here is an example:
Task: Locate a recipe for an American apple pie on Allrecipes with a rating of at least 4 stars and more than 50 reviews. Note the maximum temperature mentioned in the Directions.

Success Mode: Strategic Exploitation in Simple Tasks

For simpler tasks with clear, deterministic paths (e.g., form filling or direct lookups), The TTI agent completes tasks efficiently without over-exploration.
Task: Identify the latest top-trending open-source project in the category of 'Machine Learning' on GitHub, and check the number of stars it has received.

Failure Mode: Over-Reliance on Resets

When an action fails, our agent may reset the task by returning to the Bing search page rather than attempting recovery within the target domain. This suggests the agent treats search as a universal fallback, even when more domain-specific actions (e.g., revisiting menus, refining filters) would be more effective.
Task: On Apple's website, how many different types of keyboards are available when customizing your 14-inch MacBook Pro?

Failure Mode: Limited Self-Verification

We also observe that the the agent may fail to verify its actions against the task goal, especially in the last step. An important next step is to combine TTI with scaling per-step reasoning.
Task: Identify a new open-source project on GitHub related to 'AI agriculture' that created in 2022, and note its main programming language and description.

Citation

If you find our work relevant, please consider citing it:

@misc{shenbai2025tti,
      title={Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction}, 
      author={Junhong Shen and Hao Bai and Lunjun Zhang and Yifei Zhou and Amrith Setlur and Shengbang Tong and Diego Caples and Nan Jiang and Tong Zhang and Ameet Talwalkar and Aviral Kumar},
      year={2025},
      eprint={2506.07976},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.07976}, 
}