alphaXiv

1,552

08 Dec 2025

causal-inference computer-science computation-and-language

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

This research disentangles the causal effects of pre-training, mid-training, and reinforcement learning (RL) on language model reasoning using a controlled synthetic task framework. It establishes that RL extends reasoning capabilities only under specific conditions of pre-training exposure and data calibration, with mid-training playing a crucial role in bridging training stages and improving generalization.

273

09 Dec 2025

computer-science computer-vision-and-pattern-recognition generative-models

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Alibaba Group

Tsinghua University

The Chinese University of Hong Kong

The University of Hong Kong

Wan-Move presents a framework for motion-controllable video generation that utilizes latent trajectory guidance to directly edit image condition features within a pre-trained image-to-video model. This method yields superior visual quality and precise motion adherence compared to state-of-the-art academic approaches and rivals commercial solutions, while also establishing MoveBench, a new comprehensive evaluation benchmark.

4,618

06 Dec 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

The Universal Weight Subspace Hypothesis

This paper presents the Universal Weight Subspace Hypothesis, demonstrating empirically that deep neural networks trained across diverse tasks and modalities converge to shared low-dimensional parametric subspaces. This convergence enables significant memory savings, such as up to 100x for Vision Transformers and LLaMA models, and 19x for LoRA adapters, while preserving model performance and enhancing efficiency in model merging and adaptation.

369

08 Dec 2025

agents computer-science computation-and-language

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

The Native Parallel Reasoner (NPR) framework allows Large Language Models to autonomously acquire and deploy genuine parallel reasoning capabilities, without relying on external teacher models. Experiments show NPR improves accuracy by up to 24.5% over baselines and delivers up to 4.6 times faster inference, maintaining 100% parallel execution across various benchmarks.

174

09 Dec 2025

agentic-frameworks agents computer-science

Towards a Science of Scaling Agent Systems

The paper empirically investigates the performance of multi-agent LLM systems across diverse agentic tasks and architectures, revealing that benefits are highly contingent on task structure rather than universal. It establishes a quantitative scaling principle, achieving 87% accuracy in predicting optimal agent architectures for unseen tasks based on model capability, task properties, and measured coordination dynamics.

162

342

08 Dec 2025

attention-mechanisms computer-science computation-and-language

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

Fudan University Shanghai Innovation Institute

Researchers from Fudan University and Shanghai Innovation Institute introduced RoPE++, an extension of Rotary Position Embeddings that re-incorporates the previously discarded imaginary component of attention scores to improve long-context modeling in Large Language Models. This method consistently outperforms standard RoPE on various benchmarks and offers significant KV-cache and parameter efficiency.

191

09 Dec 2025

attention-mechanisms autonomous-vehicles computer-science

Astra: General Interactive World Model with Autoregressive Denoising

Tsinghua University Kuaishou Technology

Astra, a collaborative effort from Tsinghua University and Kuaishou Technology, introduces an interactive general world model using an autoregressive denoising framework to generate real-world futures with precise action interactions. The model achieves superior performance in instruction following and visual fidelity across diverse simulation scenarios while efficiently extending a pre-trained video diffusion backbone.

209

08 Dec 2025

attention-mechanisms computer-science artificial-intelligence

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Apple

Apple researchers introduced FAE (Feature Auto-Encoder), a minimalist framework using a single attention layer and a double-decoder architecture to adapt high-dimensional self-supervised visual features into compact, generation-friendly latent spaces. FAE achieves competitive FID scores on ImageNet (1.29) and MS-COCO (6.90) for image generation while preserving semantic understanding capabilities of the original pre-trained encoders.

156

08 Dec 2025

causal-inference computer-science artificial-intelligence

Large Causal Models from Large Language Models

Adobe

The DEMOCRITUS system establishes a new framework for building large causal models (LCMs) by extracting and structuring textual knowledge from Large Language Models (LLMs) across diverse domains. It leverages a Geometric Transformer to embed and organize vast causal claims into coherent, navigable manifolds, which, unlike raw LLM outputs, exhibit global causal coherence and interpretable local structures.

194

07 Dec 2025

agentic-frameworks agents computer-science

WisPaper: Your AI Scholar Search Engine

WisPaper introduces an AI-powered scholar search engine that unifies academic literature discovery, management, and continuous tracking within a single platform. Its core Deep Search component, powered by the WisModel agent, achieved 94.8% semantic similarity in query understanding and 93.70% overall accuracy in paper-criteria matching, demonstrating superior performance over leading commercial LLMs, especially in nuanced judgments.

100

10 Dec 2025

agents computer-science artificial-intelligence

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Astribot

The Astribot Team developed Lumo-1, a Vision-Language-Action (VLA) model that explicitly integrates structured reasoning with physical actions to achieve purposeful robotic control on their Astribot S1 bimanual mobile manipulator. This system exhibits superior generalization to novel objects and instructions, improves reasoning-action consistency through reinforcement learning, and outperforms state-of-the-art baselines in complex, long-horizon, and dexterous tasks.

09 Dec 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

Terrain Diffusion introduces a diffusion-based framework for generating infinite, real-time procedural terrain, delivering highly realistic, boundless virtual worlds with seed-consistency and constant-time random access. The system achieves competitive FID scores and real-time generation latency on consumer hardware, demonstrating its practical applicability.

218

08 Dec 2025

computer-science computer-vision-and-pattern-recognition

LongCat-Image Technical Report

Meituan

Meituan's LongCat-Image introduces an open-source, bilingual foundation model for image generation and editing, achieving state-of-the-art performance with a compact 6B parameter architecture. The model establishes new industry standards for Chinese character rendering, reaching 90.7% accuracy on a custom benchmark, and demonstrates robust image editing capabilities, often outperforming larger models.

309

127

08 Dec 2025

agentic-frameworks agents computer-science

The Adoption and Usage of AI Agents: Early Evidence from Perplexity

Harvard University Perplexity

Researchers from Harvard University and Perplexity conducted a large-scale field study on the real-world adoption and usage of general-purpose AI agents, leveraging hundreds of millions of user interactions with Perplexity's Comet AI-powered browser and its integrated Comet Assistant. The study provides foundational evidence on who uses these agents, their usage intensity, and a detailed breakdown of use cases via a novel hierarchical taxonomy.

09 Dec 2025

computer-science robotics

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

Shanghai AI Laboratory

Tsinghua University

Zhejiang University

The University of Hong Kong

The DualVLN framework addresses Vision-Language Navigation by asynchronously integrating a slow VLM for high-level reasoning and a fast diffusion policy for real-time local control. This "Ground Slow, Move Fast" approach yields enhanced generalization and robust performance, including dynamic obstacle avoidance, across various simulated and real-world robotic platforms.

456

09 Dec 2025

attention-mechanisms computer-science artificial-intelligence

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Huazhong University of Science and Technology Horizon Robotics

InfiniteVL, a collaboration between Huazhong University of Science and Technology and Horizon Robotics, introduces a hybrid Vision-Language Model that synergizes linear and sparse attention to enable unlimited multimodal input processing with constant latency and memory footprint. The model achieves performance competitive with Transformer-based VLMs on diverse benchmarks, including information-intensive tasks, while demonstrating significant inference speedups and robust real-time streaming capabilities.

195

07 Dec 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Researchers from Microsoft Research Asia, Xi'an Jiaotong University, and Fudan University developed VideoVLA, a robot manipulator that repurposes large pre-trained video generation models. This system jointly predicts future video states and corresponding actions, achieving enhanced generalization capabilities for novel objects and skills in both simulated and real-world environments.

09 Dec 2025

computer-science computer-vision-and-pattern-recognition efficient-transformers

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Google DeepMind

University College London

University of Oxford

Researchers from Google DeepMind, University College London, and the University of Oxford developed D4RT, a unified feedforward model for reconstructing dynamic 4D scenes, encompassing depth, spatio-temporal correspondence, and camera parameters, from video using a single, flexible querying interface. The model achieved state-of-the-art accuracy across various 4D reconstruction and tracking benchmarks, with 3D tracking throughput 18-300 times faster and pose estimation over 100 times faster than prior methods.

2,756

194

08 Dec 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Relational Visual Similarity

Researchers from University of Wisconsin-Madison, UCLA, and Adobe Research introduce a computational framework for "relational visual similarity," which identifies image commonalities based on abstract logic rather than surface features. Their `relsim` model, trained on a novel dataset of images paired with anonymous group-derived captions, aligns significantly with human perception of relational similarity and outperforms existing attribute-based metrics in retrieval tasks.

10 Dec 2025

agents autonomous-vehicles chain-of-thought

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

ByteDance

UniUGP presents a unified framework for end-to-end autonomous driving, integrating scene understanding, future video generation, and trajectory planning through a hybrid expert architecture. This approach enhances interpretability with Chain-of-Thought reasoning and demonstrates state-of-the-art performance in challenging long-tail scenarios and multimodal capabilities across various benchmarks.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

The Universal Weight Subspace Hypothesis

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Towards a Science of Scaling Agent Systems

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

Astra: General Interactive World Model with Autoregressive Denoising

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Large Causal Models from Large Language Models

WisPaper: Your AI Scholar Search Engine

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

LongCat-Image Technical Report

The Adoption and Usage of AI Agents: Early Evidence from Perplexity

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Relational Visual Similarity

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Events

AI for Law

Personalize Your Feed