alphaXiv

History

Papers Benchmarks

representation-learning

273

09 Dec 2025

representation-learning computer-science computer-vision-and-pattern-recognition

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Alibaba Group

Tsinghua University

The Chinese University of Hong Kong

The University of Hong Kong

Wan-Move presents a framework for motion-controllable video generation that utilizes latent trajectory guidance to directly edit image condition features within a pre-trained image-to-video model. This method yields superior visual quality and precise motion adherence compared to state-of-the-art academic approaches and rivals commercial solutions, while also establishing MoveBench, a new comprehensive evaluation benchmark.

4,618

06 Dec 2025

representation-learning computer-science artificial-intelligence

The Universal Weight Subspace Hypothesis

This paper presents the Universal Weight Subspace Hypothesis, demonstrating empirically that deep neural networks trained across diverse tasks and modalities converge to shared low-dimensional parametric subspaces. This convergence enables significant memory savings, such as up to 100x for Vision Transformers and LLaMA models, and 19x for LoRA adapters, while preserving model performance and enhancing efficiency in model merging and adaptation.

342

08 Dec 2025

representation-learning attention-mechanisms computer-science

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

Fudan University Shanghai Innovation Institute

Researchers from Fudan University and Shanghai Innovation Institute introduced RoPE++, an extension of Rotary Position Embeddings that re-incorporates the previously discarded imaginary component of attention scores to improve long-context modeling in Large Language Models. This method consistently outperforms standard RoPE on various benchmarks and offers significant KV-cache and parameter efficiency.

209

08 Dec 2025

representation-learning attention-mechanisms computer-science

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Apple

Apple researchers introduced FAE (Feature Auto-Encoder), a minimalist framework using a single attention layer and a double-decoder architecture to adapt high-dimensional self-supervised visual features into compact, generation-friendly latent spaces. FAE achieves competitive FID scores on ImageNet (1.29) and MS-COCO (6.90) for image generation while preserving semantic understanding capabilities of the original pre-trained encoders.

156

08 Dec 2025

representation-learning causal-inference computer-science

Large Causal Models from Large Language Models

Adobe

The DEMOCRITUS system establishes a new framework for building large causal models (LCMs) by extracting and structuring textual knowledge from Large Language Models (LLMs) across diverse domains. It leverages a Geometric Transformer to embed and organize vast causal claims into coherent, navigable manifolds, which, unlike raw LLM outputs, exhibit global causal coherence and interpretable local structures.

09 Dec 2025

representation-learning computer-science artificial-intelligence

Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

Terrain Diffusion introduces a diffusion-based framework for generating infinite, real-time procedural terrain, delivering highly realistic, boundless virtual worlds with seed-consistency and constant-time random access. The system achieves competitive FID scores and real-time generation latency on consumer hardware, demonstrating its practical applicability.

09 Dec 2025

representation-learning computer-science computer-vision-and-pattern-recognition

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Google DeepMind

University College London

University of Oxford

Researchers from Google DeepMind, University College London, and the University of Oxford developed D4RT, a unified feedforward model for reconstructing dynamic 4D scenes, encompassing depth, spatio-temporal correspondence, and camera parameters, from video using a single, flexible querying interface. The model achieved state-of-the-art accuracy across various 4D reconstruction and tracking benchmarks, with 3D tracking throughput 18-300 times faster and pose estimation over 100 times faster than prior methods.

2,756

194

08 Dec 2025

representation-learning computer-science artificial-intelligence

Relational Visual Similarity

Researchers from University of Wisconsin-Madison, UCLA, and Adobe Research introduce a computational framework for "relational visual similarity," which identifies image commonalities based on abstract logic rather than surface features. Their `relsim` model, trained on a novel dataset of images paired with anonymous group-derived captions, aligns significantly with human perception of relational similarity and outperforms existing attribute-based metrics in retrieval tasks.

10 Dec 2025

representation-learning computer-science computer-vision-and-pattern-recognition

StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

University of Toronto

Beijing Jiaotong University Visual Intelligence + X International Joint Laboratory Dzine AI

The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at this https URL.

07 Dec 2025

representation-learning computer-science computer-vision-and-pattern-recognition

MeshSplatting: Differentiable Rendering with Opaque Meshes

University of Toronto

University of British Columbia

University of Maryland Simon Fraser University University of Liège

Adobe University of Li`ege

MeshSplatting generates connected, opaque, and colored triangle meshes from images using differentiable rendering, enabling direct integration of neurally reconstructed scenes into traditional 3D graphics pipelines. The method achieves a +0.69 dB PSNR improvement over MiLo on the Mip-NeRF360 dataset and trains 2x faster while requiring 2.5x less memory.

09 Dec 2025

representation-learning agents computer-science

Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform

Visionary introduces a WebGPU-powered platform for 3D Gaussian Splatting (3DGS) that enables real-time, client-side rendering and inference for dynamic and generative 3DGS models. The platform demonstrates up to 135x speedup compared to WebGL-based viewers, while maintaining or improving visual quality and ensuring robust depth-aware composition.

08 Dec 2025

representation-learning attention-mechanisms computer-science

Group Representational Position Encoding

The paper introduces Group Representational Position Encoding (GRAPE), a unified group-theoretic framework that re-conceptualizes and unifies existing positional encoding mechanisms like RoPE and ALiBi. It provides a principled design space for new encodings, demonstrating improved training stability and superior zero-shot performance in large language models.

114

08 Dec 2025

representation-learning computer-science computer-vision-and-pattern-recognition

Distribution Matching Variational AutoEncoder

Peking University

Tencent UCAS

A new framework, Distribution Matching Variational AutoEncoder (DMVAE), explicitly aligns a VAE's aggregate latent distribution with a pre-defined reference distribution using score-based matching. The approach achieves a state-of-the-art gFID of 1.82 on ImageNet 256x256, demonstrating superior training efficiency for downstream generative models, particularly when utilizing Self-Supervised Learning features as the reference.

09 Dec 2025

representation-learning computer-science computer-vision-and-pattern-recognition

TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

University of Science and Technology of China

The Chinese University of Hong Kong Xiamen University

The University of Hong Kong

HKUST Macau University of Science and Technology

Researchers at HKUST developed TrackingWorld, a framework for dense, world-centric 3D tracking of nearly all pixels in monocular videos, effectively disentangling camera and object motion. This method integrates foundation models with a novel optimization pipeline to track objects, including newly emerging ones, demonstrating superior camera pose estimation and 3D depth consistency, achieving, for example, an Abs Rel depth error of 0.218 on Sintel compared to 0.636 from baselines.

09 Dec 2025

representation-learning computer-science artificial-intelligence

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

Multimodal Large Language Models (MLLMs) exhibit substantial cross-modal inconsistency, producing different answers for semantically identical information presented across image, text, and mixed modalities. This problem persists even with perfect Optical Character Recognition (OCR), revealing an inherent reasoning challenge where text inputs generally achieve higher accuracy than image inputs.

08 Dec 2025

representation-learning computer-science computer-vision-and-pattern-recognition

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

University of Copenhagen

Closing the Train-Test Gap in World Models for Gradient-Based Planning

The University of Texas at Austin

University of Texas at Austin

Columbia University

Researchers from Columbia University and NYU introduced Online World Modeling (OWM) and Adversarial World Modeling (AWM) to mitigate the train-test gap in world models for gradient-based planning (GBP). These methods enabled GBP to achieve performance comparable to or better than search-based planning algorithms like CEM, while simultaneously reducing computation time by an order of magnitude across various robotic tasks.

103

08 Dec 2025

representation-learning computer-science artificial-intelligence

WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling

University College London

The University of Texas at Austin

Adobe

WorldReel develops a unified, feed-forward 4D generator that integrates geometry, motion, and appearance directly into a latent diffusion model, yielding videos with explicit 4D scene representations. The model achieves state-of-the-art photorealism and significantly improves geometric consistency and dynamic range, particularly for complex scenes with moving cameras.

09 Dec 2025

representation-learning computer-science artificial-intelligence

WonderZoom: Multi-Scale 3D World Generation

Stanford University

We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to "zoom into" a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly outperforms state-of-the-art video and 3D models in both quality and alignment, enabling multi-scale 3D world creation from a single image. We show video results and an interactive viewer of generated multi-scale 3D worlds in this https URL

08 Dec 2025

representation-learning computer-science computer-vision-and-pattern-recognition

Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

Zhejiang University Zhejiang University of Technology

Researchers at Zhejiang University developed LIVINGSWAP, a high-fidelity video face swapping framework designed for cinematic quality by directly leveraging complete source video attributes and employing keyframe conditioning. The system outperforms existing methods on new cinematic benchmarks and reduces manual editing effort by approximately 40 times.

4,215

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

The Universal Weight Subspace Hypothesis

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Large Causal Models from Large Language Models

Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Relational Visual Similarity

StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

MeshSplatting: Differentiable Rendering with Opaque Meshes

Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform

Group Representational Position Encoding

Distribution Matching Variational AutoEncoder

TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Closing the Train-Test Gap in World Models for Gradient-Based Planning

WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling

WonderZoom: Multi-Scale 3D World Generation

Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

Events

AI for Law

Personalize Your Feed