alphaXiv

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

28 May 2025

Qiyuan Tech and Renmin University researchers developed Light-R1, an open-source suite for training long Chain-of-Thought reasoning models using public data, achieving state-of-the-art mathematical performance. Their Light-R1-32B model, trained for approximately $1000, surpassed DeepSeek-R1-Distill-Qwen-32B on AIME24 (76.6 vs. 72.6) and AIME25 (64.6 vs. 54.9), while the 14B variant demonstrated a ~2% absolute improvement on AIME24 via Reinforcement Learning without typical response length reduction.

View blog

#chain-of-thought #computer-science #computation-and-language

Resources 745

6,009

d

^2

Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

27 Sep 2025

Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education

The DCache framework accelerates diffusion-based Large Language Models (dLLMs) by introducing a training-free approximate Key-Value (KV) cache. It achieves an average 3.2x to 4.0x speedup in inference throughput over vanilla dLLM inference while maintaining or improving generation quality across various benchmarks.

View blog

#computer-science #computation-and-language #efficient-transformers

Resources 19

101

360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training

08 Oct 2025

Qiyuan Tech Renmin University

Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at this https URL. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.

View blog

#computer-science #computation-and-language #machine-learning

Resources 580

125

Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision

28 Feb 2025

光香赵

lin sun

Peking University Qiyuan Tech

Researchers from Peking University and Qiyuan Tech developed LongRePS, a process-supervised framework that trains language models to generate high-quality reasoning paths for improved long-context performance. The framework significantly enhances reasoning capabilities, achieving gains of up to 13.6 points on specific datasets and enabling smaller open-source models to perform comparably to larger proprietary models on long-context reasoning tasks.

View blog

#chain-of-thought #computer-science #computation-and-language

Resources

177

LongAttn: Selecting Long-context Training Data via Token-level Attention

27 Feb 2025

光香赵

lin sun

Peking University Qiyuan Tech

LongAttn introduces a framework that selects high-quality long-context training data for language models by analyzing token-level attention mechanisms. The method, developed by researchers from Peking University and Qiyuan Tech, consistently improves performance on long-context tasks like Needle In A Haystack and RULER while reducing the required training data volume compared to existing methods.

View blog

#computer-science #computation-and-language

Resources 4

129

Router Upcycling: Leveraging Mixture-of-Routers in Mixture-of-Experts Upcycling

31 Aug 2025

Peking University Qiyuan Tech

Router Upcycling introduces a method that leverages attention modules from pre-trained dense models to initialize a mixture-of-routers for Mixture-of-Experts (MoE) upcycling. This approach achieves state-of-the-art performance, outperforming vanilla MoE upcycling by 2.05 points on average across ten benchmarks, while adding negligible computational overhead.

View blog

#attention-mechanisms #computer-science #computation-and-language

Resources

49

HI-TransPA: Hearing Impairments Translation Personal Assistant

14 Nov 2025

Northeastern University Tongji University

Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired speech and the limited adaptability of existing models, we develop a multimodal preprocessing and curation pipeline that detects facial landmarks, stabilizes the lip region, and quantitatively evaluates sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. Architecturally, we employs a novel unified 3D-Resampler to efficiently encode the lip dynamics, which is critical for accurate interpretation. Experiments on purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. Our work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

View blog

#ai-for-health #computer-science #conversational-ai

Resources 1

14

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

21 Sep 2025

光香赵

lin sun

Peking University Qiyuan Tech

In this paper, we propose a ``Generalization Stress Test" to assess Large Language Models' (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts.

View blog

#adversarial-robustness #computer-science #artificial-intelligence

Resources

79

TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation

17 Mar 2025

光香赵

lin sun

Peking University Qiyuan Tech

The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textit{selectively distilled} into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.

View blog

#computer-science #artificial-intelligence #computation-and-language

Resources

338

TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation

17 Mar 2025

Peking University Qiyuan Tech

The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textit{selectively distilled} into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.

View blog

#computer-science #artificial-intelligence #computation-and-language

Resources

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Ask or search anything...

Events