The rStar2-Agent technical report introduces a 14-billion-parameter math reasoning model, rStar2-Agent-14B, which achieves top-tier performance by learning to "think smarter" through **agentic reinforcement learning (RL)** rather than just "thinking longer". This model is trained to actively interact with **Python coding tools** and reflect on the feedback from code execution to autonomously explore, verify, and refine its problem-solving steps. Its effectiveness is due to three main innovations: an efficient RL infrastructure that supports high-throughput code execution and mitigates high rollout costs on limited GPU resources; **GRPO-RoC**, an agentic RL algorithm that uses a "Resample-on-Correct" strategy to handle the inherent noise from coding tools and focus on high-quality successful reasoning paths; and an efficient multi-stage training recipe that begins with non-reasoning supervised fine-tuning (SFT) before progressing to RL stages. As a result, rStar2-Agent-14B quickly achieved frontier-level math reasoning in only 510 RL steps within one week, scoring 80.6% on AIME24 and 69.8% on AIME25, outperforming much larger models like DeepSeek-R1 (671B) with significantly shorter responses, and also demonstrating strong generalization to scientific reasoning and other agentic tool-use tasks.
https://arxiv.org/pdf/2508.20722
https://arxiv.org/pdf/2508.20722
- Category
- Artificial Intelligence & Business
- Tags
- AI research, machine learning, deep learning


Comments