Noah’s Ark LabHuawei logoHuawei
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation
Despite significant advancements in text-to-image models for generating high-quality images, these methods still struggle to ensure the controllability of text prompts over images in the context of complex text prompts, especially when it comes to retaining object attributes and relationships. In this paper, we propose CompAgent, a training-free approach for compositional text-to-image generation, with a large language model (LLM) agent as its core. The fundamental idea underlying CompAgent is premised on a divide-and-conquer methodology. Given a complex text prompt containing multiple concepts including objects, attributes, and relationships, the LLM agent initially decomposes it, which entails the extraction of individual objects, their associated attributes, and the prediction of a coherent scene layout. These individual objects can then be independently conquered. Subsequently, the agent performs reasoning by analyzing the text, plans and employs the tools to compose these isolated objects. The verification and human feedback mechanism is finally incorporated into our agent to further correct the potential attribute errors and refine the generated images. Guided by the LLM agent, we propose a tuning-free multi-concept customization model and a layout-to-image generation model as the tools for concept composition, and a local image editing method as the tool to interact with the agent for verification. The scene layout controls the image generation process among these tools to prevent confusion among multiple objects. Extensive experiments demonstrate the superiority of our approach for compositional text-to-image generation: CompAgent achieves more than 10\% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation. The extension to various related tasks also illustrates the flexibility of our CompAgent for potential applications.
View blog
Resources18
From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

A comprehensive synthesis of Large Language Models for automated software development covers the entire model lifecycle, from data curation to autonomous agents, and offers practical guidance derived from empirical experiments on pre-training, fine-tuning, and reinforcement learning, alongside a detailed analysis of challenges and future directions.

View blog
Resources
SWE-Exp: Experience-Driven Software Issue Resolution

SWE-Exp introduces an experience-enhanced framework that enables Large Language Model (LLM) agents to learn from past software issue resolution attempts, transforming problem-solving into a continuous learning process. This approach achieved a 41.6% Pass@1 score on SWE-bench-Verified with DeepSeek-V3-0324, representing a 7.2% relative improvement over prior methods using the same model.

View blog
Resources28
Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models

This paper introduces "Diffusion of Thought (DoT)", a method for integrating Chain-of-Thought (CoT) reasoning into diffusion language models. The approach demonstrates high accuracy and significant efficiency gains on reasoning tasks, outperforming autoregressive counterparts while offering a flexible computation-accuracy trade-off.

View blog
Resources107
StreamForest: Efficient Online Video Understanding with Persistent Event Memory

StreamForest introduces a novel multimodal large language model architecture designed for efficient online video understanding, leveraging a persistent event memory and a fine-grained spatiotemporal window. The model demonstrates state-of-the-art performance on streaming video benchmarks, including a new autonomous driving benchmark, and maintains robust accuracy even under extreme visual token compression.

View blog
Resources8
SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

SRUM is a post-training framework that enables Unified Multimodal Models (UMMs) to improve their image generation capabilities by using their internal understanding module as a self-sufficient evaluator. It achieves an 88.37 overall score on T2I-CompBench, representing a +3.91 point improvement over the baseline, and demonstrates strong generalization across various complex compositional and reasoning tasks.

View blog
Resources13
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

Researchers from Shanghai Jiao Tong University, Huawei, and Xidian University developed SWE-Debate, a framework that leverages competitive multi-agent debate and graph-guided fault localization for resolving repository-level software issues. This approach achieved a 41.4% Pass@1 success rate on the SWE-Bench-Verified dataset and an 81.67% file-level localization accuracy on SWE-Bench-Lite.

View blog
Resources17
AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
01 Aug 2025

The AutoSchemaKG framework constructs knowledge graphs autonomously from web-scale corpora by dynamically inducing schemas, integrating entity, event, and concept nodes. This approach generates billion-scale KGs that enhance the factuality and multi-hop question answering capabilities of large language models.

View blog
Resources470
Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way

dLLM-Var, developed at Shanghai Jiao Tong University's EPIC Lab, enables diffusion-based large language models to generate text with native variable lengths, achieving a 30.1x speedup over traditional dLLMs and a 2.4x speedup over autoregressive models, while maintaining competitive accuracy and demonstrating self-correction capabilities.

View blog
Resources1
DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning
10 Nov 2025
Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing prompting and supervised fine-tuning (SFT) methods remain fixed by prompt rules or training corpora, and are usually benchmarked only on well-structured wiki sources, limiting real-world adaptability. We introduce WebPuzzle, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k WebPuzzle instances, we develop DeepDiver, a reinforcement-learning (RL) framework that cultivates Search Intensity Scaling (SIS)-an emergent ability to escalate search frequency and depth instead of settling on overconfident, under-evidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.
View blog
Resources
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

VTimeCoT equips multimodal large language models with a "visual clock" to perform precise video temporal grounding and reasoning by allowing them to "think by drawing" directly on video timelines. The framework, developed by researchers from Shanghai Jiao Tong University, Noah’s Ark Lab, and Imperial College London, achieved substantial performance gains on temporal grounding and long-video question answering benchmarks without requiring additional training.

View blog
Resources
Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning
08 Dec 2025
Retrieval-augmented generation (RAG) enhances the text generation capabilities of large language models (LLMs) by integrating external knowledge and up-to-date information. However, traditional RAG systems are limited by static workflows and lack the adaptability required for multistep reasoning and complex task management. To address these limitations, agentic RAG systems (e.g., DeepResearch) have been proposed, enabling dynamic retrieval strategies, iterative context refinement, and adaptive workflows for handling complex search queries beyond the capabilities of conventional RAG. Recent advances, such as Search-R1, have demonstrated promising gains using outcome-based reinforcement learning, where the correctness of the final answer serves as the reward signal. Nevertheless, such outcome-supervised agentic RAG methods face challenges including low exploration efficiency, gradient conflict, and sparse reward signals. To overcome these challenges, we propose to utilize fine-grained, process-level rewards to improve training stability, reduce computational costs, and enhance efficiency. Specifically, we introduce a novel method ReasonRAG that automatically constructs RAG-ProGuide, a high-quality dataset providing process-level rewards for (i) query generation, (ii) evidence extraction, and (iii) answer generation, thereby enhancing model inherent capabilities via process-supervised reinforcement learning. With the process-level policy optimization, the proposed framework empowers LLMs to autonomously invoke search, generate queries, extract relevant evidence, and produce final answers. Compared to existing approaches such as Search-R1 and traditional RAG systems, ReasonRAG, leveraging RAG-ProGuide, achieves superior performance on five benchmark datasets using only 5k training instances, significantly fewer than the 90k training instances required by Search-R1.
View blog
Resources27
OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue

The OSUM-EChat system developed by the Audio, Speech and Language Processing Group at Northwestern Polytechnical University enhances end-to-end empathetic spoken chatbots by integrating an understanding-driven training strategy and a linguistic-paralinguistic dual think mechanism. It achieved a GPT-4 score of 72.0 on a new EChat-eval benchmark for multi-label empathy, demonstrating improved empathetic responsiveness and efficient speech understanding without relying on massive, proprietary datasets.

View blog
Resources
GRPO-λλ: Credit Assignment improves LLM Reasoning
Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving their reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO's ability to assign fine-grained credit across token sequences. In this work, we present GRPO-λ\lambda, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks. We approximate learning from λ\lambda-return with a reformulation of eligibility traces using token-level log-probabilities applied after each sequence generation, and a novel critic-free approximation of the temporal-difference error. We introduce a few variations for the weighting of the λ\lambda-return, and their applications to the eligibility-trace, where all the variations provide significant gains over GRPO. We compare GRPO-λ\lambda against GRPO by training models from 1.5B to 7B parameters on 44 different math reasoning datasets. The training plots demonstrate 30-40% improved performance during RL training on both LLaMA-3.1 and Qwen-2.5 architectures. Finally, we show that with GRPO-λ\lambda, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over 33 points and a 4.54.5 points improvement on the 7B model.
View blog
Resources
Masked Diffusion Models as Energy Minimization
We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations--kinetic, conditional kinetic, and geodesic energy--are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.
View blog
Resources
PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning
Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model's ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model's planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.
View blog
Resources
DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

DeepFM integrates Factorization Machines (FM) and Deep Neural Networks (DNN) into a single, end-to-end trainable model to predict Click-Through Rate (CTR) by simultaneously capturing both low-order and high-order feature interactions without manual feature engineering. The model consistently outperformed nine baseline models on large datasets, achieving up to 0.48% higher AUC than Wide & Deep variants on a commercial app store dataset.

View blog
Resources15
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction

DiffKV introduces a framework for large language models that optimizes Key-Value (KV) cache memory through differentiated compression and an on-GPU parallel compaction manager. This approach achieves substantial memory compression and throughput improvements with minimal accuracy degradation, particularly for complex reasoning and long-context generation tasks.

View blog
Resources19
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.
View blog
Resources5
AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training
02 Jul 2025

AsyncFlow introduces an asynchronous streaming reinforcement learning framework for efficient large language model post-training, achieving an average throughput gain of 1.59x and up to 2.03x peak improvement over state-of-the-art baselines on Huawei Ascend NPU clusters. The framework maintains algorithmic stability with negligible differences in reward scores compared to synchronous methods.

View blog
Resources19
There are no more papers matching your filters at the moment.