ASTRA:
ASTRA is an open-source project that generates high-quality, executable agent–environment interaction data for AI training and evaluation.
It first collects real MCP interactions to capture authentic, multi-turn tool usage, then synthesizes sandbox-verifiable environments by generating structured specifications and Python implementations that define and implement environment tools. This two-stage pipeline enables scalable, reliable training of multi-step, tool-augmented agents.
The videos below show the four stages of ASTRA’s environment synthesis pipeline and show how executable environments are built automatically.
Recent progress in tool-augmented large language models has shown promise, yet existing methods struggle in fully agentic, multi-turn settings: tool interactions are often simulated rather than executed, training data relies on human annotation, and reinforcement learning lacks verifiable, rule-based verification environments, leading to noisy rewards and unstable optimization.
To address these limitations, we propose an end-to-end, fully automated pipeline for tool-agent training based on executable trajectories. We construct a large-scale, high-quality data pipeline combining massive real MCP interactions with selective, session-consistent simulation for supervised fine-tuning, and introduce a fully verifiable environment synthesis pipeline where each step is grounded in executable code and validated by sandboxed execution, enabling scalable rule-based reinforcement learning.
Our contributions are as follows:
Using this end-to-end automated framework, a 32B-scale open model trained with SFT followed by RL achieves performance competitive with and approaching closed-source systems on complex multi-turn tool-use tasks, demonstrating that strong agentic capabilities can be learned without human-in-the-loop supervision.
Data Pipeline for Tool-grounded SFT. We construct an SFT-ready dataset via an end-to-end pipeline that enforces realism and executability throughout:
Our objective is to enable RLVR at scale with end-to-end automation, code-verifiable process signals, and no human labels. We therefore synthesize self-contained executable environments where each intermediate step can be verified by code sandbox, which makes the pipeline naturally scalable. These environments are then used to roll out multi-turn interactions and to derive process-level feedback for subsequent training.
We train tool agents in two stages: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). SFT provides a strong behavioral prior by training on curated multi-turn tool trajectories, enabling basic capabilities such as tool invocation, workflow following, and long-context reasoning. Building on this initialization, RL further improves long-horizon decision making and tool-use efficiency in fully executable environments.
We adopt an online, multi-turn agentic RL paradigm. Each training instance corresponds to an independent simulated environment with no shared state. During rollout, the agent repeatedly generates tool calls, executes them in a sandboxed runtime, and conditions subsequent decisions on the full interaction history. A trajectory terminates when all sub-tasks are solved, a maximum interaction length is reached, or the agent stops issuing tool calls.
Reinforcement learning is driven by a trajectory-level reward that jointly captures task completion and interaction efficiency. Each instance is formalized as a job consisting of $n$ sub-tasks:
Suppose the agent successfully solves $\hat{n}$ sub-tasks using $c$ tool calls. We define
where $r$ measures sub-task recall and $p$ measures tool-use precision. The final reward is computed as the harmonic mean:
This reward formulation explicitly encourages solving as many required sub-tasks as possible while minimizing redundant tool invocations, providing a structured learning signal over entire trajectories.
We optimize the policy using a GRPO-style objective. The original GRPO objective is defined as:
In practice, we remove KL regularization and entropy bonuses. We adopt Adaptive Batch Filling, which buffers valid trajectories and continues rollout generation until a full batch of $n$ effective samples is collected. We further adopt a token-level policy gradient loss. The final reinforcement learning objective is:
Together, trajectory-level rewards, adaptive batch filling, and token-level optimization enable stable online RL for learning multi-turn, tool-augmented policies in fully executable and verifiable environments.
@misc{astra2026,
title = {ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas},
author = {Beike Language and Intelligence (BLI)},
year = {2026}
}