alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Evaluating Gemini Robotics Policies in a Veo World Simulator

11 Dec 2025

Google DeepMind

Google DeepMind's Gemini Robotics Team developed a system utilizing the Veo video foundation model to evaluate generalist robot policies. The approach accurately predicts policy performance, generalization capabilities, and identifies safety vulnerabilities across various scenarios, demonstrating strong correlation with real-world outcomes.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

08 Dec 2025

Apple researchers introduced FAE (Feature Auto-Encoder), a minimalist framework using a single attention layer and a double-decoder architecture to adapt high-dimensional self-supervised visual features into compact, generation-friendly latent spaces. FAE achieves competitive FID scores on ImageNet (1.29) and MS-COCO (6.90) for image generation while preserving semantic understanding capabilities of the original pre-trained encoders.

#attention-mechanisms #computer-science #artificial-intelligence

Paper thumbnail

Stronger Normalization-Free Transformers

11 Dec 2025

Researchers from Princeton, NYU, and Carnegie Mellon introduced Dynamic erf (Derf), a point-wise function that replaces traditional normalization layers in Transformers. Derf consistently achieved higher accuracy or lower error rates across various modalities, including an average of 0.6 percentage points higher top-1 accuracy on ImageNet-1K for ViT-Base/Large models and improved FID scores for Diffusion Transformers.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

11 Dec 2025

ReMe introduces a dynamic procedural memory framework for LLM agents, enabling continuous learning and adaptation through a closed-loop system of experience acquisition, reuse, and refinement. This framework allows smaller language models to achieve performance comparable to or surpassing larger, memory-less models, demonstrating a memory-scaling effect and improving agent robustness.

#agentic-frameworks #agents #computer-science

Paper thumbnail

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

12 Dec 2025

Tsinghua University Kuaishou Technology

Researchers from Tsinghua University and Kuaishou Technology developed SVG-T2I, a text-to-image latent diffusion model that bypasses the need for a Variational Autoencoder by directly utilizing Visual Foundation Model (VFM) features (DINOv3). The model achieves competitive performance on large-scale T2I benchmarks, scoring 0.75 on GenEval and 85.78 on DPG-Bench, demonstrating the efficacy of VFM representations for high-fidelity image generation.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

What matters for Representation Alignment: Global Information or Spatial Structure?

11 Dec 2025

New York University Adobe logo

A collaborative study from Adobe Research, Australian National University, and New York University investigates the drivers of representation alignment in diffusion models, revealing that the spatial structure of pretrained vision encoder features is more critical for high-quality image generation than global semantic understanding, which challenges a prevalent assumption. The work introduces iREPA, a refined alignment method involving minimal code changes, that consistently enhances the convergence speed and generation quality of Diffusion Transformers across various models and tasks.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

11 Dec 2025

Researchers at Meta FAIR developed VL-JEPA, a Joint Embedding Predictive Architecture for vision-language tasks, which predicts abstract semantic embeddings rather than explicit tokens. This approach leads to improved computational efficiency and enables real-time applications through non-autoregressive prediction and selective decoding, while achieving competitive performance across classification, retrieval, and Visual Question Answering benchmarks.

#computer-science #computer-vision-and-pattern-recognition #embedding-methods

Paper thumbnail

E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

11 Dec 2025

Harvard University Carnegie Mellon University logo

Carnegie Mellon University

E-RayZer introduces a self-supervised framework for 3D reconstruction, employing explicit 3D Gaussian Splatting to learn camera parameters and scene geometry directly from unlabeled multi-view images. The system achieves competitive performance against supervised methods and outperforms prior self-supervised approaches, establishing itself as an effective spatial visual pre-training model.

#computer-science #computer-vision-and-pattern-recognition #geometric-deep-learning

Paper thumbnail

Bidirectional Normalizing Flow: From Data to Noise and Back

11 Dec 2025

Tsinghua University MIT logo

Bidirectional Normalizing Flow (BiFlow) introduces a method to learn an approximate inverse for Normalizing Flows, allowing for highly efficient and high-fidelity image generation. This approach achieves a state-of-the-art FID of 2.39 on ImageNet 256x256 while accelerating sampling speed by up to 697x compared to prior NF models.

#computer-science #computer-vision-and-pattern-recognition #machine-learning

Paper thumbnail

Towards a Science of Scaling Agent Systems

09 Dec 2025

The paper empirically investigates the performance of multi-agent LLM systems across diverse agentic tasks and architectures, revealing that benefits are highly contingent on task structure rather than universal. It establishes a quantitative scaling principle, achieving 87% accuracy in predicting optimal agent architectures for unseen tasks based on model capability, task properties, and measured coordination dynamics.

#agentic-frameworks #agents #computer-science

Paper thumbnail

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

12 Dec 2025

Fudan University Stanford University logo

Stanford University

Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

12 Dec 2025

Researchers from Shanghai AI Laboratory and collaborators developed Intern-S1-MO, a multi-agent deep learning system addressing the context length limitations of large reasoning models for complex mathematical problems. This system achieved human-gold-medalist-level performance in competitive mathematics through a hierarchical reasoning framework, lemma-based memory, and an online reinforcement learning approach.

#active-learning #agents #chain-of-thought

Paper thumbnail

DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

12 Dec 2025

DentalGPT, a specialized multimodal large language model for dentistry, demonstrates an average accuracy of 67.1% across various dental Visual Question Answering and classification benchmarks. The model incorporates the largest curated multimodal dental dataset and utilizes a two-stage training strategy to improve both visual understanding and complex reasoning abilities, outperforming significantly larger general-purpose models.

#ai-for-health #computer-science #artificial-intelligence

Paper thumbnail

Decoupled Q-Chunking

12 Dec 2025

Decoupled Q-chunking (DQC) introduces an approach to reinforcement learning that separates the temporal horizon of the value critic from the policy's action chunks, mitigating bootstrapping bias while maintaining policy reactivity. This method establishes new state-of-the-art performance across six challenging long-horizon offline goal-conditioned RL environments.

#agents #computer-science #artificial-intelligence

Paper thumbnail

An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

12 Dec 2025

Imperial College London King’s College London logo

King’s College London

A comprehensive survey offers a structured guide to Vision-Language-Action (VLA) models, systematically dissecting their foundational components, tracing their historical evolution, and critically analyzing five core challenges with actionable future directions for the field.

#computer-science #robotics

Paper thumbnail

Any4D: Unified Feed-Forward Metric 4D Reconstruction

11 Dec 2025

Carnegie Mellon University

Any4D, developed by Carnegie Mellon University researchers, introduces a unified feed-forward multi-modal transformer for dense, metric-scale 4D reconstruction of dynamic scenes. This system achieves a 2-3x reduction in error and up to 15x faster inference compared to previous methods, leveraging a novel factored 4D representation and benefiting from diverse sensor inputs.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

08 Dec 2025

This research disentangles the causal effects of pre-training, mid-training, and reinforcement learning (RL) on language model reasoning using a controlled synthetic task framework. It establishes that RL extends reasoning capabilities only under specific conditions of pre-training exposure and data calibration, with mid-training playing a crucial role in bridging training stages and improving generalization.

#causal-inference #computer-science #computation-and-language

Paper thumbnail

FutureX: Enhance End-to-End Autonomous Driving via Latent Chain-of-Thought World Model

12 Dec 2025

Xidian University CUHK-Shenzhen

Researchers from CUHK-Shenzhen and Xpeng Motors developed FutureX, a framework that enhances end-to-end autonomous driving by integrating latent Chain-of-Thought (CoT) reasoning with an adaptive world model. This approach yielded up to 6.6 points improvement in Predictive Driver Model Score on NAVSIM and significant gains on CARLA, while maintaining real-time efficiency.

#agents #chain-of-thought #computer-science

Paper thumbnail

PersonaLive! Expressive Portrait Image Animation for Live Streaming

12 Dec 2025

University of Macau

Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

10 Dec 2025

Researchers at Truthful AI and UC Berkeley demonstrated that finetuning large language models on narrow, benign datasets can induce broad, unpredictable generalization patterns and novel "inductive backdoors." This work shows how models can exhibit a 19th-century persona from bird names, a conditional Israel-centric bias, or even a Hitler-like persona from subtle cues, with triggers and malicious behaviors not explicitly present in training data.

#adversarial-attacks #computer-science #artificial-intelligence

Paper thumbnail

There are no more papers matching your filters at the moment.