alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Big Data InstituteCentral South University

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

18 Jul 2025

Fudan University Central South University

Researchers from Harbin Institute of Technology and collaborating institutions provide a systematic survey of Long Chain-of-Thought (Long CoT) in Large Language Models, establishing a formal distinction from Short CoT. The survey proposes a novel taxonomy based on deep reasoning, extensive exploration, and feasible reflection, and analyzes key phenomena observed in advanced reasoning models.

#agents #chain-of-thought #computer-science

Paper thumbnail

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

27 Aug 2025

Northeastern University Sun Yat-Sen University logo

Sun Yat-Sen University

RoboTwin 2.0 introduces a scalable simulation framework and benchmark designed to generate high-quality, domain-randomized data for robust bimanual robotic manipulation, addressing limitations in existing synthetic datasets. Policies trained with RoboTwin 2.0 data achieved a 24.4% improvement in real-world success rates for few-shot learning and 21.0% for zero-shot generalization on unseen backgrounds.

#computer-science #artificial-intelligence #computation-and-language

Resources 1,514

Paper thumbnail

AI4Research: A Survey of Artificial Intelligence for Scientific Research

05 Aug 2025

jiaqi-wang920

jiaqi wang

Fudan University Central South University

Researchers from Harbin Institute of Technology and collaborators present a systematic survey of Artificial Intelligence for Scientific Research (AI4Research), defining its scope, proposing a comprehensive taxonomy across the entire research lifecycle, and identifying critical future directions. The study clarifies the distinction between AI4Research and AI4Science, demonstrating AI's growing capabilities from scientific comprehension to peer review, while highlighting significant challenges in achieving ethical, explainable, and fully autonomous systems.

#agentic-frameworks #agents #computer-science

Paper thumbnail

The Denario project: Deep knowledge AI agents for scientific discovery

30 Oct 2025

Google DeepMind University of Cambridge logo

University of Cambridge

We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at this https URL. A Denario demo can also be run directly on the web at this https URL, and the full app will be deployed on the cloud.

#agent-based-systems #agentic-frameworks #cloud-computing

Paper thumbnail

Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks

09 Oct 2025

Shanghai Artificial Intelligence Laboratory Fudan University logo

Fudan University

MUSE, an agent framework from the Shanghai Artificial Intelligence Laboratory and collaborators, enables Large Language Models to learn continuously from experience and self-evolve for complex, long-horizon real-world tasks. It achieved a new state-of-the-art performance of 51.78% partial completion score on the challenging TheAgentCompany (TAC) benchmark, surpassing previous methods by nearly 20%.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Vision-centric Token Compression in Large Language Model

05 Dec 2025

jp-w

JP W

Nanjing University of Science and Technology Central South University

Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making token compression indispensable. We introduce Vision Centric Token Compression (Vist), a slow-fast compression framework that mirrors human reading: the fast path renders distant tokens into images, letting a frozen, lightweight vision encoder skim the low-salience context; the slow path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions-just as skilled reader gloss over function words. On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by 7.6% on average over benchmarks like TriviaQA, NQ, PopQA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. The project is at this https URL.

#computer-science #computation-and-language #computer-vision-and-pattern-recognition

Paper thumbnail

AutoPR: Let's Automate Your Academic Promotion!

15 Oct 2025

ByteDance Central South University

As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.

#agentic-frameworks #agents #computer-science

Paper thumbnail

See the Text: From Tokenization to Visual Reading

21 Oct 2025

Nanjing University of Science and Technology Central South University

SEETOK proposes a vision-centric tokenization method that converts text into images for Large Language Models (LLMs), enabling them to "read" text visually. This approach reduces token counts by 4.43x and FLOPs by 70.5%, demonstrating improved multilingual fairness, translation quality, and robustness to text perturbations, while maintaining or exceeding performance on language understanding tasks.

#computer-science #computation-and-language #computer-vision-and-pattern-recognition

Paper thumbnail

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

26 Oct 2025

jiaqi-wang920

jiaqi wang

Shanghai AI Laboratory

National University of Singapore

Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent MCoT methods fall into two categories: (i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved image-text outputs. Despite advances in both approaches, the mechanisms driving these improvements are not fully understood. To fill this gap, we first reveal that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format, depending only on clarity and conciseness of expression. Furthermore, to explore visual thoughts systematically, we define four distinct forms of visual thought expressions and analyze them comprehensively. Our findings demonstrate that these forms differ in clarity and conciseness, yielding varying levels of MCoT improvement. Additionally, we explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers, enabling more advanced visual information transmission. We hope that the visual thoughts can inspire further breakthroughs for future MCoT research.

#chain-of-thought #computer-science #computation-and-language

Paper thumbnail

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

04 Nov 2025

University of Oxford

University of Science and Technology of China

The VCode project introduces a multimodal coding benchmark that requires Vision-Language Models to translate natural images into Scalable Vector Graphics (SVG) code, providing a symbolic and executable visual representation. The proposed VCoder framework, which employs iterative revision and external visual tools, improves state-of-the-art VLMs by 12.3 CodeVQA points on this challenging task.

#agents #computer-science #computation-and-language

Paper thumbnail

Glance: Accelerating Diffusion Models with 1 Sample

02 Dec 2025

Wuhan University

National University of Singapore

Glance introduces a phase-aware acceleration framework for diffusion models, achieving up to 5x faster inference, enabling high-quality image generation in 8-10 steps compared to 50. This acceleration is accomplished with remarkably low training costs, utilizing only a single training sample and less than one GPU-hour of training while preserving visual quality and generalization.

#computer-science #computer-vision-and-pattern-recognition #few-shot-learning

Paper thumbnail

A Survey: Learning Embodied Intelligence from Physical Simulators and World Models

03 Sep 2025

Chinese Academy of Sciences

Shanghai Jiao Tong University

This survey provides a comprehensive review of advancements in embodied artificial intelligence (AI) from 2018 to 2025, focusing on the synergistic integration of physical simulators and world models. It proposes a five-level grading standard for intelligent robots (IR-L0 to IR-L4) and analyzes how these technologies collectively bridge the simulation-to-reality gap and enhance robot autonomy, adaptability, and generalization in complex tasks.

#computer-science #robotics

Paper thumbnail

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

22 Nov 2024

Beihang University

VideoEspresso is a large-scale dataset featuring over 200,000 question-answer pairs with detailed Chain-of-Thought annotations for fine-grained video reasoning. It enables Large Vision Language Models to better understand temporal dynamics and specific spatial-temporal relationships through a novel core frame selection strategy, outperforming existing methods in various video reasoning tasks.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles

13 Jun 2025

Shanghai Artificial Intelligence Laboratory

Shanghai Jiao Tong University

SlowFast Sampling accelerates diffusion-based Large Language Models (dLLMs) through a dynamic two-stage strategy guided by principles of token certainty and convergence. The method achieves up to 15.63x inference speedup for LLaDA on GPQA, which extends to 34.22x when combined with dLLM-Cache, enabling dLLMs to surpass LLaMA3 8B's throughput while preserving generation quality.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

17 Oct 2025

Shanghai Artificial Intelligence Laboratory Fudan University logo

Fudan University

EVOLVER enables large language model agents to autonomously learn and improve from their own experiences by distilling raw interaction trajectories into strategic principles. This framework achieves an average Exact Match score of 0.382 across seven complex question-answering benchmarks, surpassing various state-of-the-art baselines.

#agentic-frameworks #agents #computer-science

Paper thumbnail

A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges

19 Aug 2025

Shanghai Jiao Tong University Central South University

Researchers from Shanghai Jiao Tong University and Central South University provide the first systematic survey of LLM-based deep search agents, classifying existing work by search paradigms, optimization methods, application areas, and evaluation strategies. This work consolidates understanding of the rapidly evolving field, identifies current limitations, and outlines future research directions for autonomous information seeking systems.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

13 Nov 2025

yongliang-shen

Yongliang Shen

Zhejiang University Central South University

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), transforming these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: using only 1,272 unlabeled data, GUI-RCPO achieves 3-6% accuracy improvements across various architectures on ScreenSpot benchmarks. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more data-efficient GUI agents.

#agents #computer-science #artificial-intelligence

Paper thumbnail

RE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflection

10 Oct 2025

Shanghai Artificial Intelligence Laboratory Fudan University logo

Fudan University

RE-Searcher is an agentic search framework for large language models that integrates goal-oriented planning and self-reflection to enhance robustness in complex information environments. The method consistently achieves state-of-the-art performance on various question-answering datasets and substantially reduces performance degradation caused by noisy external search queries.

#adversarial-robustness #agentic-frameworks #agents

Paper thumbnail

Generative Physical AI in Vision: A Survey

19 Apr 2025

Sungkyunkwan University Central South University

This survey systematically reviews and categorizes generative models in computer vision that produce physically plausible outputs, establishing a taxonomy of explicit and implicit physics-aware generation methods. It details six paradigms for integrating physical simulation, revealing a trend towards functional realism, and identifies current challenges in evaluation metrics.

#computer-science #computer-vision-security #artificial-intelligence

Paper thumbnail

Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models

10 Oct 2025

Shanghai Jiao Tong University Central South University

Researchers from Harbin Institute of Technology formalize the Parallel–Sequential Contradiction (PSC), demonstrating that Diffusion Large Language Models (DLLMs) primarily exhibit superficial parallel reasoning and revert to autoregressive-like behavior when tackling complex Long Chain-of-Thought tasks. The study introduces parallel-encouraging prompting and diffusion early stopping, which effectively enhance DLLM reasoning capabilities.

#chain-of-thought #computer-science #computation-and-language

Paper thumbnail

There are no more papers matching your filters at the moment.