alphaXiv

1,933

08 Dec 2025

causal-inference computer-science computation-and-language

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

This research disentangles the causal effects of pre-training, mid-training, and reinforcement learning (RL) on language model reasoning using a controlled synthetic task framework. It establishes that RL extends reasoning capabilities only under specific conditions of pre-training exposure and data calibration, with mid-training playing a crucial role in bridging training stages and improving generalization.

400

09 Dec 2025

agentic-frameworks agents computer-science

Towards a Science of Scaling Agent Systems

The paper empirically investigates the performance of multi-agent LLM systems across diverse agentic tasks and architectures, revealing that benefits are highly contingent on task structure rather than universal. It establishes a quantitative scaling principle, achieving 87% accuracy in predicting optimal agent architectures for unseen tasks based on model capability, task properties, and measured coordination dynamics.

162

474

08 Dec 2025

agents computer-science computation-and-language

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

The Native Parallel Reasoner (NPR) framework allows Large Language Models to autonomously acquire and deploy genuine parallel reasoning capabilities, without relying on external teacher models. Experiments show NPR improves accuracy by up to 24.5% over baselines and delivers up to 4.6 times faster inference, maintaining 100% parallel execution across various benchmarks.

351

09 Dec 2025

computer-science computer-vision-and-pattern-recognition generative-models

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Alibaba Group

Tsinghua University

The Chinese University of Hong Kong

The University of Hong Kong

Wan-Move presents a framework for motion-controllable video generation that utilizes latent trajectory guidance to directly edit image condition features within a pre-trained image-to-video model. This method yields superior visual quality and precise motion adherence compared to state-of-the-art academic approaches and rivals commercial solutions, while also establishing MoveBench, a new comprehensive evaluation benchmark.

113

09 Dec 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

Terrain Diffusion introduces a diffusion-based framework for generating infinite, real-time procedural terrain, delivering highly realistic, boundless virtual worlds with seed-consistency and constant-time random access. The system achieves competitive FID scores and real-time generation latency on consumer hardware, demonstrating its practical applicability.

229

09 Dec 2025

attention-mechanisms autonomous-vehicles computer-science

Astra: General Interactive World Model with Autoregressive Denoising

Tsinghua University Kuaishou Technology

Astra, a collaborative effort from Tsinghua University and Kuaishou Technology, introduces an interactive general world model using an autoregressive denoising framework to generate real-world futures with precise action interactions. The model achieves superior performance in instruction following and visual fidelity across diverse simulation scenarios while efficiently extending a pre-trained video diffusion backbone.

4,812

06 Dec 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

The Universal Weight Subspace Hypothesis

This paper presents the Universal Weight Subspace Hypothesis, demonstrating empirically that deep neural networks trained across diverse tasks and modalities converge to shared low-dimensional parametric subspaces. This convergence enables significant memory savings, such as up to 100x for Vision Transformers and LLaMA models, and 19x for LoRA adapters, while preserving model performance and enhancing efficiency in model merging and adaptation.

248

08 Dec 2025

attention-mechanisms computer-science artificial-intelligence

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Apple

Apple researchers introduced FAE (Feature Auto-Encoder), a minimalist framework using a single attention layer and a double-decoder architecture to adapt high-dimensional self-supervised visual features into compact, generation-friendly latent spaces. FAE achieves competitive FID scores on ImageNet (1.29) and MS-COCO (6.90) for image generation while preserving semantic understanding capabilities of the original pre-trained encoders.

10 Dec 2025

computer-science computer-vision-and-pattern-recognition generative-models

StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

University of Toronto

Beijing Jiaotong University Visual Intelligence + X International Joint Laboratory Dzine AI

The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at this https URL.

134

10 Dec 2025

agents computer-science artificial-intelligence

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Astribot

The Astribot Team developed Lumo-1, a Vision-Language-Action (VLA) model that explicitly integrates structured reasoning with physical actions to achieve purposeful robotic control on their Astribot S1 bimanual mobile manipulator. This system exhibits superior generalization to novel objects and instructions, improves reasoning-action consistency through reinforcement learning, and outperforms state-of-the-art baselines in complex, long-horizon, and dexterous tasks.

373

08 Dec 2025

attention-mechanisms computer-science computation-and-language

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

Fudan University Shanghai Innovation Institute

Researchers from Fudan University and Shanghai Innovation Institute introduced RoPE++, an extension of Rotary Position Embeddings that re-incorporates the previously discarded imaginary component of attention scores to improve long-context modeling in Large Language Models. This method consistently outperforms standard RoPE on various benchmarks and offers significant KV-cache and parameter efficiency.

10 Dec 2025

agents autonomous-vehicles chain-of-thought

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

ByteDance

UniUGP presents a unified framework for end-to-end autonomous driving, integrating scene understanding, future video generation, and trajectory planning through a hybrid expert architecture. This approach enhances interpretability with Chain-of-Thought reasoning and demonstrates state-of-the-art performance in challenging long-tail scenarios and multimodal capabilities across various benchmarks.

09 Dec 2025

agentic-frameworks agents computer-science

A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows

This work presents a comprehensive engineering guide for designing and deploying production-grade agentic AI workflows, offering nine best practices demonstrated through a multimodal news-to-media generation case study. The approach improves system determinism, reliability, and responsible AI integration, reducing issues like hallucination and enabling scalable, maintainable deployments.

248

08 Dec 2025

computer-science computer-vision-and-pattern-recognition

LongCat-Image Technical Report

Meituan

Meituan's LongCat-Image introduces an open-source, bilingual foundation model for image generation and editing, achieving state-of-the-art performance with a compact 6B parameter architecture. The model establishes new industry standards for Chinese character rendering, reaching 90.7% accuracy on a custom benchmark, and demonstrates robust image editing capabilities, often outperforming larger models.

309

181

08 Dec 2025

causal-inference computer-science artificial-intelligence

Large Causal Models from Large Language Models

Adobe

The DEMOCRITUS system establishes a new framework for building large causal models (LCMs) by extracting and structuring textual knowledge from Large Language Models (LLMs) across diverse domains. It leverages a Geometric Transformer to embed and organize vast causal claims into coherent, navigable manifolds, which, unlike raw LLM outputs, exhibit global causal coherence and interpretable local structures.

108

09 Dec 2025

attention-mechanisms computer-science artificial-intelligence

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Huazhong University of Science and Technology Horizon Robotics

InfiniteVL, a collaboration between Huazhong University of Science and Technology and Horizon Robotics, introduces a hybrid Vision-Language Model that synergizes linear and sparse attention to enable unlimited multimodal input processing with constant latency and memory footprint. The model achieves performance competitive with Transformer-based VLMs on diverse benchmarks, including information-intensive tasks, while demonstrating significant inference speedups and robust real-time streaming capabilities.

09 Dec 2025

computer-science robotics

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

Shanghai AI Laboratory

Tsinghua University

Zhejiang University

The University of Hong Kong

The DualVLN framework addresses Vision-Language Navigation by asynchronously integrating a slow VLM for high-level reasoning and a fast diffusion policy for real-time local control. This "Ground Slow, Move Fast" approach yields enhanced generalization and robust performance, including dynamic obstacle avoidance, across various simulated and real-world robotic platforms.

456

156

08 Dec 2025

agentic-frameworks agents computer-science

The Adoption and Usage of AI Agents: Early Evidence from Perplexity

Harvard University Perplexity

Researchers from Harvard University and Perplexity conducted a large-scale field study on the real-world adoption and usage of general-purpose AI agents, leveraging hundreds of millions of user interactions with Perplexity's Comet AI-powered browser and its integrated Comet Assistant. The study provides foundational evidence on who uses these agents, their usage intensity, and a detailed breakdown of use cases via a novel hierarchical taxonomy.

113

08 Dec 2025

computer-science computer-vision-and-pattern-recognition

DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving

Wuhan University Huazhong University of Science & Technology Horizon Robotics

DiffusionDriveV2 integrates reinforcement learning with an anchor-based truncated diffusion model to produce a diverse range of consistently high-quality trajectories for end-to-end autonomous driving. This approach achieves state-of-the-art performance on NAVSIM v1 and v2 benchmarks, with a PDMS of 91.2 on v1 and an EPDMS of 85.5 on v2.

09 Dec 2025

computer-science computer-vision-and-pattern-recognition efficient-transformers

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Google DeepMind

University College London

University of Oxford

Researchers from Google DeepMind, University College London, and the University of Oxford developed D4RT, a unified feedforward model for reconstructing dynamic 4D scenes, encompassing depth, spatio-temporal correspondence, and camera parameters, from video using a single, flexible querying interface. The model achieved state-of-the-art accuracy across various 4D reconstruction and tracking benchmarks, with 3D tracking throughput 18-300 times faster and pose estimation over 100 times faster than prior methods.

2,756

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Towards a Science of Scaling Agent Systems

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

Astra: General Interactive World Model with Autoregressive Denoising

The Universal Weight Subspace Hypothesis

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows

LongCat-Image Technical Report

Large Causal Models from Large Language Models

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

The Adoption and Usage of AI Agents: Early Evidence from Perplexity

DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Events

AI for Law

Personalize Your Feed