Ask or search anything...

History

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Hot

DataCanvas Alaya NeW

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

18 Mar 2025

Eliver Q

Renmin University of China DataCanvas Alaya NeW

R1-Searcher, from Renmin University of China, introduces a two-stage outcome-based reinforcement learning framework that enables Large Language Models to autonomously invoke and leverage external search systems. This approach significantly outperforms strong RAG baselines on multi-hop question answering benchmarks and demonstrates robust generalization to out-of-domain and online search scenarios.

View blog

#agents #chain-of-thought #computer-science

Resources 261

63,421

R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning

22 May 2025

jiahao zhao

Renmin University of China DataCanvas Alaya NeW

R1-Searcher++ presents a framework that enables large language models to dynamically choose between using their internal knowledge and performing external searches, while also allowing them to internalize retrieved information. This approach improves performance on multi-hop question answering tasks and significantly reduces the number of external retrieval calls compared to prior methods.

View blog

#agents #computer-science #artificial-intelligence

Resources 64

913

SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

08 Oct 2025

Northeastern University Renmin University of China logo

Renmin University of China

Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at this https URL.

View blog

#computer-science #artificial-intelligence #computation-and-language

Resources 50

567

An Empirical Study on Eliciting and Improving R1-like Reasoning Models

06 Mar 2025

Eliver Q

daixuan cheng

BAAI

Renmin University of China

Researchers from Renmin University and BAAI present a comprehensive empirical study of reinforcement learning techniques for enhancing LLM reasoning capabilities, demonstrating dramatic improvements through novel reward engineering and tool manipulation while achieving 86.67% accuracy on AIME 2024 mathematics problems through an innovative combination of RL training and external computation tools.

View blog

#chain-of-thought #computer-science #computation-and-language

Resources 739

2,025

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

19 May 2025

Eliver Q

Haoxiang Sun

BAAI

Renmin University of China

In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1, OpenAI's o3-mini and Gemini 2.5 Pro Exp demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the benchmark, evaluation code, detailed results and a data visualization tool at this https URL

View blog

#chain-of-thought #computer-science #computation-and-language

Resources

335

Towards Effective Code-Integrated Reasoning

30 May 2025

BAAI Gaoling School of Artificial Intelligence, Renmin University of China

This paper from Renmin University, DataCanvas, and BAAI presents a systematic approach to improve the training effectiveness and stability of tool-augmented reinforcement learning for code-integrated reasoning in large language models. The method achieves state-of-the-art performance on mathematical reasoning benchmarks and provides mechanistic insights into how code integration extends model capabilities, offering efficiency over traditional reasoning methods.

View blog

#agentic-frameworks #agents #chain-of-thought

Resources 14

253

Analyzing and Mitigating Object Hallucination: A Training Bias Perspective

06 Aug 2025

Renmin University of China

University of California, San Diego

As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering ``Yes'' to questions about masked objects. To understand this issue, we conduct probing experiments on the models' internal components, revealing that this training bias is primarily located in the language modeling (LM) head. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2\% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination. Our code and data will be publicly released.

View blog

#computer-science #computation-and-language #computer-vision-and-pattern-recognition

Resources

CAFE: Retrieval Head-based Coarse-to-Fine Information Seeking to Enhance Multi-Document QA Capability

15 May 2025

Renmin University of China DataCanvas Alaya NeW

Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce

\textbf{CAFE}

, a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show CAFE outperforms baselines, achieving up to 22.1% and 13.7% SubEM improvement over SFT and RAG methods on the Mistral model, respectively.

View blog

#computer-science #computation-and-language #information-extraction

Resources

There are no more papers matching your filters at the moment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Ask or search anything...

Events