DRIFT is an open-source online self-evolution policy optimization framework designed to enable continuous reasoning improvements in large language models without relying on external expert supervision.
Through the coordinated integration of dynamic difficulty routing, token-level rhythm gating, two-stage curriculum learning and robust experience replay, DRIFT creates a structured self-improving loop that ensures reliable and sustained policy improvement across diverse domains.
Achieving stable self-evolution in large language models without external expert supervision remains a central challenge in complex reasoning and scientific problem solving. Existing self-distillation and reinforcement learning methods often suffer from imprecise credit assignment, where coarse trajectory-level signals fail to distinguish the relative importance of different reasoning steps. Moreover, they typically treat model-generated trajectories as static supervision, without explicitly modeling learning progress or dynamically adjusting training strategies. As a result, they may over-optimize easy problems, provide insufficient guidance on hard problems, and inadequately explore boundary cases. To address these issues, we propose DRIFT, an online self-evolutionary framework for large language models post-training.
Our contributions are as follows:
We evaluate DRIFT on diverse task categories, including biology, chemistry, materials science, physics, and tool use. Experimental results show that DRIFT significantly outperforms reproduced SDPO and GRPO baselines, and further improves upon recent methods such as SRPO, demonstrating stronger training stability and better cross-domain generalization.
Training Framework for DRIFT. We integrate on-policy self-distillation and online reinforcement learning into a coherent self-evolution loop guided by a two-stage curriculum.
The overall hybrid training objective minimizes a difficulty routing combination of self-distillation on incorrect trajectories and GRPO with Rhythm Gating on correct trajectories:
where $\gamma_i$ is determined by difficulty routing.
Further details on the theoretical motivation, underlying insights, and mathematical demonstration will be provided in our upcoming paper.
We evaluated DRIFT on five prominent, multi-turn reasoning and academic benchmarks. Across nearly all domains, DRIFT achieves consistent improvements:
| Method / Model | Biology | Chemistry | Materials | Physics | Tool Use | Average |
|---|---|---|---|---|---|---|
| Qwen3-8B | 30.8 | 41.2 | 58.9 | 59.2 | 57.5 | 49.5 |
| SDPO (Paper) | 56.8 | 80.9 | 78.4 | 75.6 | 68.5 | 72.0 |
| SDPO (Reproduced) | 64.8 | 78.9 | 76.1 | 72.7 | 67.7 | 72.0 |
| GRPO (Paper) | 59.9 | 74.5 | 77.1 | 72.7 | 65.7 | 70.0 |
| GRPO (Reproduced) | 47.4 | 65.6 | 73.5 | 60.6 | 67.7 | 63.0 |
| SRPO (Paper) | 72.8 | 83.0 | 81.5 | 78.4 | 71.2 | 77.4 |
| DistIL (Paper) | 66.6 | 80.8 | 76.2 | 80.8 | - | - |
| SC-SDPO (Paper) | 65.4 | 80.6 | 79.3 | 81.6 | 67.3 | 74.8 |
| PGPO (Paper) | 61.3 | 77.6 | 78.7 | 77.6 | - | - |
| DRIFT (Ours) | 74.4 | 82.0 | 81.4 | 80.5 | 79.2 | 79.5 |
@misc{beike2026drift,
title = {DRIFT: Difficulty Routing Self-Distillation with Rhythm Gating Exploration and Success Buffer Training},
author = {Haisen Luo and Yiwei Liu and Haoning Wang and Dan Liu and Junxi Yin and Haotian Wang and Lei Zhang and Xiaoyu Tian and Shuaiting Chen and Yuansheng Song and Baoyan Guo and Xiongfei Yan and Bolan Yang and Chengwei Liu and Ming Cui and Jiong Chen},
year = {2026},
eprint = {2606.30345},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://github.com/LianjiaTech/drift},
}