alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Noah’s Ark Lab Huawei logo

Huawei

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

30 Jan 2024

Tsinghua University Peking University logo

Peking University

Despite significant advancements in text-to-image models for generating high-quality images, these methods still struggle to ensure the controllability of text prompts over images in the context of complex text prompts, especially when it comes to retaining object attributes and relationships. In this paper, we propose CompAgent, a training-free approach for compositional text-to-image generation, with a large language model (LLM) agent as its core. The fundamental idea underlying CompAgent is premised on a divide-and-conquer methodology. Given a complex text prompt containing multiple concepts including objects, attributes, and relationships, the LLM agent initially decomposes it, which entails the extraction of individual objects, their associated attributes, and the prediction of a coherent scene layout. These individual objects can then be independently conquered. Subsequently, the agent performs reasoning by analyzing the text, plans and employs the tools to compose these isolated objects. The verification and human feedback mechanism is finally incorporated into our agent to further correct the potential attribute errors and refine the generated images. Guided by the LLM agent, we propose a tuning-free multi-concept customization model and a layout-to-image generation model as the tools for concept composition, and a local image editing method as the tool to interact with the agent for verification. The scene layout controls the image generation process among these tools to prevent confusion among multiple objects. Extensive experiments demonstrate the superiority of our approach for compositional text-to-image generation: CompAgent achieves more than 10\% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation. The extension to various related tasks also illustrates the flexibility of our CompAgent for potential applications.

#computer-science #computer-vision-and-pattern-recognition #image-generation

Paper thumbnail

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

06 Dec 2025

Monash University CSIRO

A comprehensive synthesis of Large Language Models for automated software development covers the entire model lifecycle, from data curation to autonomous agents, and offers practical guidance derived from empirical experiments on pre-training, fine-tuning, and reinforcement learning, alongside a detailed analysis of challenges and future directions.

#agentic-frameworks #agents #ai-for-cybersecurity

Paper thumbnail

SWE-Exp: Experience-Driven Software Issue Resolution

31 Jul 2025

Shanghai Jiao Tong University Nankai University

SWE-Exp introduces an experience-enhanced framework that enables Large Language Model (LLM) agents to learn from past software issue resolution attempts, transforming problem-solving into a continuous learning process. This approach achieved a 41.6% Pass@1 score on SWE-bench-Verified with DeepSeek-V3-0324, representing a 7.2% relative improvement over prior methods using the same model.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models

05 Dec 2024

The University of Hong Kong

This paper introduces "Diffusion of Thought (DoT)", a method for integrating Chain-of-Thought (CoT) reasoning into diffusion language models. The approach demonstrates high accuracy and significant efficiency gains on reasoning tasks, outperforming autoregressive counterparts while offering a flexible computation-accuracy trade-off.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

29 Sep 2025

Shanghai AI Laboratory Nanjing University logo

Nanjing University

StreamForest introduces a novel multimodal large language model architecture designed for efficient online video understanding, leveraging a persistent event memory and a fine-grained spatiotemporal window. The model demonstrates state-of-the-art performance on streaming video benchmarks, including a new autonomous driving benchmark, and maintains robust accuracy even under extreme visual token compression.

#autonomous-vehicles #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

14 Oct 2025

The University of Hong Kong

SRUM is a post-training framework that enables Unified Multimodal Models (UMMs) to improve their image generation capabilities by using their internal understanding module as a self-sufficient evaluator. It achieves an 88.37 overall score on T2I-CompBench, representing a +3.91 point improvement over the baseline, and demonstrates strong generalization across various complex compositional and reasoning tasks.

#computer-science #computation-and-language #computer-vision-and-pattern-recognition

Paper thumbnail

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

31 Jul 2025

Shanghai Jiao Tong University Xidian University

Researchers from Shanghai Jiao Tong University, Huawei, and Xidian University developed SWE-Debate, a framework that leverages competitive multi-agent debate and graph-guided fault localization for resolving repository-level software issues. This approach achieved a 41.4% Pass@1 success rate on the SWE-Bench-Verified dataset and an 81.67% file-level localization accuracy on SWE-Bench-Lite.

#agentic-frameworks #agents #computer-science

Paper thumbnail

AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

01 Aug 2025

The AutoSchemaKG framework constructs knowledge graphs autonomously from web-scale corpora by dynamically inducing schemas, integrating entity, event, and concept nodes. This approach generates billion-scale KGs that enhance the factuality and multi-hop question answering capabilities of large language models.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way

28 Oct 2025

Shanghai Jiao Tong University Shanghai AI Lab

dLLM-Var, developed at Shanghai Jiao Tong University's EPIC Lab, enables diffusion-based large language models to generate text with native variable lengths, achieving a 30.1x speedup over traditional dLLMs and a 2.4x speedup over autoregressive models, while maintaining competitive accuracy and demonstrating self-correction capabilities.

#computer-science #computation-and-language #efficient-transformers

Paper thumbnail

DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning

10 Nov 2025

Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing prompting and supervised fine-tuning (SFT) methods remain fixed by prompt rules or training corpora, and are usually benchmarked only on well-structured wiki sources, limiting real-world adaptability. We introduce WebPuzzle, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k WebPuzzle instances, we develop DeepDiver, a reinforcement-learning (RL) framework that cultivates Search Intensity Scaling (SIS)-an emergent ability to escalate search frequency and depth instead of settling on overconfident, under-evidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.

#agents #computer-science #computation-and-language

Paper thumbnail

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

16 Oct 2025

Imperial College London

Shanghai Jiao Tong University

VTimeCoT equips multimodal large language models with a "visual clock" to perform precise video temporal grounding and reasoning by allowing them to "think by drawing" directly on video timelines. The framework, developed by researchers from Shanghai Jiao Tong University, Noah’s Ark Lab, and Imperial College London, achieved substantial performance gains on temporal grounding and long-video question answering benchmarks without requiring additional training.

#chain-of-thought #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

08 Dec 2025

derong-xu

Derong Xu

pengyue-jia

Pengyue Jia

City University of Hong Kong Huawei logo

Retrieval-augmented generation (RAG) enhances the text generation capabilities of large language models (LLMs) by integrating external knowledge and up-to-date information. However, traditional RAG systems are limited by static workflows and lack the adaptability required for multistep reasoning and complex task management. To address these limitations, agentic RAG systems (e.g., DeepResearch) have been proposed, enabling dynamic retrieval strategies, iterative context refinement, and adaptive workflows for handling complex search queries beyond the capabilities of conventional RAG. Recent advances, such as Search-R1, have demonstrated promising gains using outcome-based reinforcement learning, where the correctness of the final answer serves as the reward signal. Nevertheless, such outcome-supervised agentic RAG methods face challenges including low exploration efficiency, gradient conflict, and sparse reward signals. To overcome these challenges, we propose to utilize fine-grained, process-level rewards to improve training stability, reduce computational costs, and enhance efficiency. Specifically, we introduce a novel method ReasonRAG that automatically constructs RAG-ProGuide, a high-quality dataset providing process-level rewards for (i) query generation, (ii) evidence extraction, and (iii) answer generation, thereby enhancing model inherent capabilities via process-supervised reinforcement learning. With the process-level policy optimization, the proposed framework empowers LLMs to autonomously invoke search, generate queries, extract relevant evidence, and produce final answers. Compared to existing approaches such as Search-R1 and traditional RAG systems, ReasonRAG, leveraging RAG-ProGuide, achieves superior performance on five benchmark datasets using only 5k training instances, significantly fewer than the 90k training instances required by Search-R1.

#computer-science #information-retrieval

Paper thumbnail

OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue

03 Sep 2025

Northwestern Polytechnical University Huawei logo

The OSUM-EChat system developed by the Audio, Speech and Language Processing Group at Northwestern Polytechnical University enhances end-to-end empathetic spoken chatbots by integrating an understanding-driven training strategy and a linguistic-paralinguistic dual think mechanism. It achieved a GPT-4 score of 72.0 on a new EChat-eval benchmark for multi-label empathy, demonstrating improved empathetic responsiveness and efficient speech understanding without relying on massive, proprietary datasets.

#computer-science #sound

Paper thumbnail

λ

: Credit Assignment improves LLM Reasoning

30 Sep 2025

Université de Montréal Mila - Quebec AI Institute logo

Mila - Quebec AI Institute

Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving their reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO's ability to assign fine-grained credit across token sequences. In this work, we present GRPO-

\lambda

, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks. We approximate learning from

\lambda

-return with a reformulation of eligibility traces using token-level log-probabilities applied after each sequence generation, and a novel critic-free approximation of the temporal-difference error. We introduce a few variations for the weighting of the

\lambda

-return, and their applications to the eligibility-trace, where all the variations provide significant gains over GRPO. We compare GRPO-

\lambda

against GRPO by training models from 1.5B to 7B parameters on

4

different math reasoning datasets. The training plots demonstrate 30-40% improved performance during RL training on both LLaMA-3.1 and Qwen-2.5 architectures. Finally, we show that with GRPO-

\lambda

, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over

3

points and a

4.5

points improvement on the 7B model.

#agents #chain-of-thought #computer-science

Paper thumbnail

Masked Diffusion Models as Energy Minimization

27 Nov 2025

Huawei Noah’s Ark Lab Renmin University of China logo

Renmin University of China

We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations--kinetic, conditional kinetic, and geodesic energy--are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

26 Sep 2025

Peking University Huawei logo

Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model's ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model's planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.

#agentic-frameworks #agents #chain-of-thought

Paper thumbnail

DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

13 Mar 2017

Harbin Institute of Technology Huawei logo

DeepFM integrates Factorization Machines (FM) and Deep Neural Networks (DNN) into a single, end-to-end trainable model to predict Click-Through Rate (CTR) by simultaneously capturing both low-order and high-order feature interactions without manual feature engineering. The model consistently outperformed nine baseline models on large datasets, achieving up to 0.48% higher AUC than Wide & Deep variants on a commercial app store dataset.

#computer-science #computation-and-language #information-retrieval

Paper thumbnail

DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction

30 Aug 2025

Shanghai Jiao Tong University

The Chinese University of Hong Kong

DiffKV introduces a framework for large language models that optimizes Key-Value (KV) cache memory through differentiated compression and an on-GPU parallel compaction manager. This approach achieves substantial memory compression and throughput improvements with minimal accuracy degradation, particularly for complex reasoning and long-context generation tasks.

#computer-science #distributed-parallel-and-cluster-computing #machine-learning

Paper thumbnail

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

10 May 2025

Fudan University Huawei logo

Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training

02 Jul 2025

AsyncFlow introduces an asynchronous streaming reinforcement learning framework for efficient large language model post-training, achieving an average throughput gain of 1.59x and up to 2.03x peak improvement over state-of-the-art baselines on Huawei Ascend NPU clusters. The framework maintains algorithmic stability with negligible differences in reward scores compared to synchronous methods.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

There are no more papers matching your filters at the moment.