ASTRA:
Automated Synthesis of Agentic Trajectories and Reinforcement Arenas

January 22, 2026  —  Beike Language and Intelligence

ASTRA is an open-source project that generates high-quality, executable agent–environment interaction data for AI training and evaluation.

It first collects real MCP interactions to capture authentic, multi-turn tool usage, then synthesizes sandbox-verifiable environments by generating structured specifications and Python implementations that define and implement environment tools. This two-stage pipeline enables scalable, reliable training of multi-step, tool-augmented agents.

The videos below show the four stages of ASTRA’s environment synthesis pipeline and show how executable environments are built automatically.

Overview

Model Performance
Model Performance On BFCL-V3 Multi-Turn Subset.

Recent progress in tool-augmented large language models has shown promise, yet existing methods struggle in fully agentic, multi-turn settings: tool interactions are often simulated rather than executed, training data relies on human annotation, and reinforcement learning lacks verifiable, rule-based verification environments, leading to noisy rewards and unstable optimization.

To address these limitations, we propose an end-to-end, fully automated pipeline for tool-agent training based on executable trajectories. We construct a large-scale, high-quality data pipeline combining massive real MCP interactions with selective, session-consistent simulation for supervised fine-tuning, and introduce a fully verifiable environment synthesis pipeline where each step is grounded in executable code and validated by sandboxed execution, enabling scalable rule-based reinforcement learning.

Our contributions are as follows:

  • Fully open and automated agentic data synthesis pipeline. All code and model weights are fully open-sourced. Moreover, the environment synthesis framework can automatically construct executable environments given only domain specifications and relevant knowledge, without human-in-the-loop intervention.
  • Multi-domain, code-verifiable environments for RL. A fully automated environment synthesis pipeline constructs multi-domain environments equipped with fully executable, rule-based verification code, making them directly suitable for rule-based, multi-turn reinforcement learning.
  • State-of-the-art performance at comparable model scales. Our approach enables models at comparable parameter scales to achieve state-of-the-art performance, approaching that of leading closed-source models.

Using this end-to-end automated framework, a 32B-scale open model trained with SFT followed by RL achieves performance competitive with and approaching closed-source systems on complex multi-turn tool-use tasks, demonstrating that strong agentic capabilities can be learned without human-in-the-loop supervision.

Multi-turn Tool-Integrated Trajectory Synthesis

Trajectory Synthesis Pipeline
Overview of Trajectory Synthesis Pipeline.

Data Pipeline for Tool-grounded SFT. We construct an SFT-ready dataset via an end-to-end pipeline that enforces realism and executability throughout:

  • Tool pool construction. We aggregate tools from open MCP registries, internal production services, and public tool datasets, then normalize all interfaces into an OpenAI-style tool-calling schema and group them by MCP server. We filter out servers that cannot support non-trivial multi-turn interactions, yielding a clean tool pool (1,585 servers; 19,036 tools; 41 domains) for downstream synthesis.
  • Tool-chains and task synthesis. For each server, we derive executable tool-chains by analyzing schema-level dependencies, verifying parameter satisfiability, and enforcing acyclic workflows. We then synthesize multi-step user tasks using (i) chain-conditioned generation to maximize executability and (ii) server-level generation to improve coverage and diversity, followed by lightweight augmentation and filtering for clarity, realism, and tool-use necessity.
  • Multi-turn rollout for trajectories. We collect trajectories by rolling out multi-turn interactions between a tool-augmented agent and its environment, recording observations, tool calls, and feedback. Deployed MCP services are executed directly, while document-only servers are handled by a session-consistent tool emulator that maintains cross-turn state and injects controlled failures, producing coherent training sequences for SFT and downstream reinforcement learning.
  • Automated reward modeling and filtering. All trajectories are scored by a fully automated, rule-based LLM reward pipeline without human annotation. The reward aggregates seven dimensions spanning understanding, planning, tool use, execution, and final answer quality, and is used to reliably filter high-quality SFT trajectories at scale.

Fully Verifiable and Automated Environment Synthesis

Environment Synthesis Pipeline
Overview of Environment Synthesis Pipeline.

Our objective is to enable RLVR at scale with end-to-end automation, code-verifiable process signals, and no human labels. We therefore synthesize self-contained executable environments where each intermediate step can be verified by code sandbox, which makes the pipeline naturally scalable. These environments are then used to roll out multi-turn interactions and to derive process-level feedback for subsequent training.

  • Decomposed trajectories as environment blueprints. Each instance is represented as a main QA with explicit sub-QA steps organized by a dependency graph (chain/DAG), providing an inspectable structure for reward attribution and step-wise validation.
  • Execution-oriented validation. We discard decompositions solvable by linguistic reasoning alone and score the remaining candidates along dependency consistency, atomicity, sequential rationality, and task completeness. Only high-quality trajectories are retained, ensuring they are well-structured and suitable for tool grounding.
  • Tool grounding with sandbox checks. For each retained trajectory $\tau=\{(q_i,a_i,d_i)\}_{i=1}^m$, we synthesize a tool specification, invocation, and Python implementation per sub-step, then execute the code in a sandbox. A sub-environment is accepted only if execution reproduces the target answer; otherwise synthesis is retried, and validated sub-environments are composed into a complete environment for the original task.
  • Compactness via intra-instance merging. To reduce redundancy and control action-space growth, we merge functionally equivalent sub-environments that differ only in parameters. Homogeneous groups are detected via an LLM-based classifier; a single implementation is retained and incrementally extended to cover all cases, with sandbox re-execution after each update to preserve correctness.

Training Tool Agents

We train tool agents in two stages: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). SFT provides a strong behavioral prior by training on curated multi-turn tool trajectories, enabling basic capabilities such as tool invocation, workflow following, and long-context reasoning. Building on this initialization, RL further improves long-horizon decision making and tool-use efficiency in fully executable environments.

Online Multi-Turn Reinforcement Learning

We adopt an online, multi-turn agentic RL paradigm. Each training instance corresponds to an independent simulated environment with no shared state. During rollout, the agent repeatedly generates tool calls, executes them in a sandboxed runtime, and conditions subsequent decisions on the full interaction history. A trajectory terminates when all sub-tasks are solved, a maximum interaction length is reached, or the agent stops issuing tool calls.

Trajectory-Level Reward Design

Reinforcement learning is driven by a trajectory-level reward that jointly captures task completion and interaction efficiency. Each instance is formalized as a job consisting of $n$ sub-tasks:

$$ \text{job} = \{(q_1, a_1), (q_2, a_2), \ldots, (q_n, a_n)\}. $$

Suppose the agent successfully solves $\hat{n}$ sub-tasks using $c$ tool calls. We define

$$ r = \frac{\hat{n}}{n}, \qquad p = \frac{\hat{n}}{c}, $$

where $r$ measures sub-task recall and $p$ measures tool-use precision. The final reward is computed as the harmonic mean:

$$ \text{reward} = \frac{2pr}{p + r}. $$

This reward formulation explicitly encourages solving as many required sub-tasks as possible while minimizing redundant tool invocations, providing a structured learning signal over entire trajectories.

Stable Online Optimization with Adaptive Batch Filling

We optimize the policy using a GRPO-style objective. The original GRPO objective is defined as:

$$ \begin{aligned} \mathcal{J}_{\mathrm{GRPO}}(\theta) &= \mathbb{E}\!\left[ q \sim P(Q),\ \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\mathrm{old}}}(\cdot \mid q) \right] \\[6pt] &\quad \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \Bigl\{ \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} {\pi_{\theta_{\mathrm{old}}}(o_{i,t} \mid q, o_{i,<t})} \hat{A}_{i,t}, \; \operatorname{clip}\!\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} {\pi_{\theta_{\mathrm{old}}}(o_{i,t} \mid q, o_{i,<t})}, 1-\epsilon,\ 1+\epsilon \right) \hat{A}_{i,t} \right) \\ &\qquad -\, \beta D_{\mathrm{KL}}\!\left( \pi_\theta \,\|\, \pi_{\mathrm{ref}} \right) \Bigr\}. \end{aligned} $$

In practice, we remove KL regularization and entropy bonuses. We adopt Adaptive Batch Filling, which buffers valid trajectories and continues rollout generation until a full batch of $n$ effective samples is collected. We further adopt a token-level policy gradient loss. The final reinforcement learning objective is:

$$ \begin{aligned} \mathcal{J}_{\mathrm{GRPO}}'(\theta) &= \mathbb{E}_{(q,a)\sim\mathcal{D},\ \{o_i\}_{i=1}^{G}\sim\pi_{\theta_{\mathrm{old}}}(\cdot\mid q)} \!\left[ \,\cdot\ \Bigm|\ \textcolor{red}{\operatorname{Std}\!\big(R(q,\{o_i\})\big) > \delta} \right] \\[6pt] &\quad \left[ \frac{1}{\textcolor{red}{\sum_{i=1}^{G} |o_i|}} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} \min \left( \frac{\pi_\theta(o_{i,t}\mid q,o_{i,<t})} {\pi_{\theta_{\mathrm{old}}}(o_{i,t}\mid q,o_{i,<t})} \hat{A}_{i,t}, \; \operatorname{clip}\!\left( \frac{\pi_\theta(o_{i,t}\mid q,o_{i,<t})} {\pi_{\theta_{\mathrm{old}}}(o_{i,t}\mid q,o_{i,<t})}, 1-\epsilon, 1+\epsilon \right) \hat{A}_{i,t} \right) \right]. \end{aligned} $$

Together, trajectory-level rewards, adaptive batch filling, and token-level optimization enable stable online RL for learning multi-turn, tool-augmented policies in fully executable and verifiable environments.

BibTeX

@misc{astra2026,
      title    = {ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas},
      author   = {Beike Language and Intelligence (BLI)},
      year     = {2026}
}