alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Evaluating Gemini Robotics Policies in a Veo World Simulator

11 Dec 2025

Google DeepMind

Google DeepMind's Gemini Robotics Team developed a system utilizing the Veo video foundation model to evaluate generalist robot policies. The approach accurately predicts policy performance, generalization capabilities, and identifies safety vulnerabilities across various scenarios, demonstrating strong correlation with real-world outcomes.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Towards a Science of Scaling Agent Systems

09 Dec 2025

The paper empirically investigates the performance of multi-agent LLM systems across diverse agentic tasks and architectures, revealing that benefits are highly contingent on task structure rather than universal. It establishes a quantitative scaling principle, achieving 87% accuracy in predicting optimal agent architectures for unseen tasks based on model capability, task properties, and measured coordination dynamics.

#agentic-frameworks #agents #computer-science

Paper thumbnail

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

08 Dec 2025

This research disentangles the causal effects of pre-training, mid-training, and reinforcement learning (RL) on language model reasoning using a controlled synthetic task framework. It establishes that RL extends reasoning capabilities only under specific conditions of pre-training exposure and data calibration, with mid-training playing a crucial role in bridging training stages and improving generalization.

#causal-inference #computer-science #computation-and-language

Paper thumbnail

DeepCode: Open Agentic Coding

08 Dec 2025

The University of Hong Kong

DeepCode presents a multi-stage agentic framework for autonomously generating executable code repositories from scientific papers, achieving a 73.5% replication score on the PaperBench Code-Dev benchmark and exceeding PhD-level human expert performance.

#agentic-frameworks #agents #computer-science

Resources 11,697

Paper thumbnail

What matters for Representation Alignment: Global Information or Spatial Structure?

11 Dec 2025

New York University Adobe logo

A collaborative study from Adobe Research, Australian National University, and New York University investigates the drivers of representation alignment in diffusion models, revealing that the spatial structure of pretrained vision encoder features is more critical for high-quality image generation than global semantic understanding, which challenges a prevalent assumption. The work introduces iREPA, a refined alignment method involving minimal code changes, that consistently enhances the convergence speed and generation quality of Diffusion Transformers across various models and tasks.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Closing the Train-Test Gap in World Models for Gradient-Based Planning

10 Dec 2025

The University of Texas at Austin

University of Texas at Austin

Researchers from Columbia University and NYU introduced Online World Modeling (OWM) and Adversarial World Modeling (AWM) to mitigate the train-test gap in world models for gradient-based planning (GBP). These methods enabled GBP to achieve performance comparable to or better than search-based planning algorithms like CEM, while simultaneously reducing computation time by an order of magnitude across various robotic tasks.

#computer-science #machine-learning #robotics

Paper thumbnail

Bidirectional Normalizing Flow: From Data to Noise and Back

11 Dec 2025

Tsinghua University MIT logo

Bidirectional Normalizing Flow (BiFlow) introduces a method to learn an approximate inverse for Normalizing Flows, allowing for highly efficient and high-fidelity image generation. This approach achieves a state-of-the-art FID of 2.39 on ImageNet 256x256 while accelerating sampling speed by up to 697x compared to prior NF models.

#computer-science #computer-vision-and-pattern-recognition #machine-learning

Paper thumbnail

Stronger Normalization-Free Transformers

11 Dec 2025

Researchers from Princeton, NYU, and Carnegie Mellon introduced Dynamic erf (Derf), a point-wise function that replaces traditional normalization layers in Transformers. Derf consistently achieved higher accuracy or lower error rates across various modalities, including an average of 0.6 percentage points higher top-1 accuracy on ImageNet-1K for ViT-Base/Large models and improved FID scores for Diffusion Transformers.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

10 Dec 2025

Researchers at Truthful AI and UC Berkeley demonstrated that finetuning large language models on narrow, benign datasets can induce broad, unpredictable generalization patterns and novel "inductive backdoors." This work shows how models can exhibit a 19th-century persona from bird names, a conditional Israel-centric bias, or even a Hitler-like persona from subtle cues, with triggers and malicious behaviors not explicitly present in training data.

#adversarial-attacks #computer-science #artificial-intelligence

Paper thumbnail

A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows

09 Dec 2025

This work presents a comprehensive engineering guide for designing and deploying production-grade agentic AI workflows, offering nine best practices demonstrated through a multimodal news-to-media generation case study. The approach improves system determinism, reliability, and responsible AI integration, reducing issues like hallucination and enabling scalable, maintainable deployments.

#agentic-frameworks #agents #computer-science

Paper thumbnail

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

11 Dec 2025

Researchers at Meta FAIR developed VL-JEPA, a Joint Embedding Predictive Architecture for vision-language tasks, which predicts abstract semantic embeddings rather than explicit tokens. This approach leads to improved computational efficiency and enables real-time applications through non-autoregressive prediction and selective decoding, while achieving competitive performance across classification, retrieval, and Visual Question Answering benchmarks.

#computer-science #computer-vision-and-pattern-recognition #embedding-methods

Paper thumbnail

E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

11 Dec 2025

Harvard University Carnegie Mellon University logo

Carnegie Mellon University

E-RayZer introduces a self-supervised framework for 3D reconstruction, employing explicit 3D Gaussian Splatting to learn camera parameters and scene geometry directly from unlabeled multi-view images. The system achieves competitive performance against supervised methods and outperforms prior self-supervised approaches, establishing itself as an effective spatial visual pre-training model.

#computer-science #computer-vision-and-pattern-recognition #geometric-deep-learning

Paper thumbnail

Any4D: Unified Feed-Forward Metric 4D Reconstruction

11 Dec 2025

Carnegie Mellon University

Any4D, developed by Carnegie Mellon University researchers, introduces a unified feed-forward multi-modal transformer for dense, metric-scale 4D reconstruction of dynamic scenes. This system achieves a 2-3x reduction in error and up to 15x faster inference compared to previous methods, leveraging a novel factored 4D representation and benefiting from diverse sensor inputs.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

09 Dec 2025

Alibaba Group Tsinghua University logo

Tsinghua University

Wan-Move presents a framework for motion-controllable video generation that utilizes latent trajectory guidance to directly edit image condition features within a pre-trained image-to-video model. This method yields superior visual quality and precise motion adherence compared to state-of-the-art academic approaches and rivals commercial solutions, while also establishing MoveBench, a new comprehensive evaluation benchmark.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

11 Dec 2025

ReMe introduces a dynamic procedural memory framework for LLM agents, enabling continuous learning and adaptation through a closed-loop system of experience acquisition, reuse, and refinement. This framework allows smaller language models to achieve performance comparable to or surpassing larger, memory-less models, demonstrating a memory-scaling effect and improving agent robustness.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

10 Dec 2025

Researchers from Stanford and Carnegie Mellon Universities introduce ARTEMIS, a multi-agent scaffold designed for real-world penetration testing, demonstrating that advanced AI agents can achieve performance comparable to or exceeding most human cybersecurity professionals in a live enterprise environment. The ARTEMIS A1 configuration achieved a total score of 95.2, outperforming 9 out of 10 human participants, and operated at a significantly lower cost of $18.21 per hour.

#agentic-frameworks #agents #ai-for-cybersecurity

Paper thumbnail

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

08 Dec 2025

The Native Parallel Reasoner (NPR) framework allows Large Language Models to autonomously acquire and deploy genuine parallel reasoning capabilities, without relying on external teacher models. Experiments show NPR improves accuracy by up to 24.5% over baselines and delivers up to 4.6 times faster inference, maintaining 100% parallel execution across various benchmarks.

#agents #computer-science #computation-and-language

Paper thumbnail

StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

11 Dec 2025

University of Toronto Beijing Jiaotong University logo

Beijing Jiaotong University

StereoWorld introduces an end-to-end diffusion-based framework to convert monocular videos into high-fidelity, geometry-aware stereo videos. It leverages a pretrained video generative model, a large-scale IPD-aligned dataset, and explicit disparity and depth supervision to produce superior visual quality, temporal stability, and geometric accuracy for immersive XR experiences.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

11 Dec 2025

UCLA University of Stuttgart

SpaceDrive, a framework developed by Mercedes-Benz AG and the University of Tübingen, enhances Vision-Language Models (VLMs) for autonomous driving by integrating explicit 3D spatial awareness through universal positional encodings and a regression-based coordinate decoder. This approach significantly improves open-loop planning accuracy on nuScenes, achieving a 0.32m L2 error and 0.23% collision rate, and enables robust closed-loop performance on Bench2Drive with a 55.11% success rate, addressing prior VLM limitations in precise 3D interaction.

#autonomous-vehicles #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

11 Dec 2025

Intern-S1-MO, a multi-agent AI system from Shanghai AI Laboratory, extends Large Reasoning Models' capabilities for long-horizon mathematical reasoning using lemma-based memory management and hierarchical reinforcement learning. The system achieved scores equivalent to human silver medalists on IMO2025 non-geometry problems and exceeded the gold medal threshold in the CMO2025 competition, showcasing advanced problem-solving beyond typical context limitations.

#active-learning #agents #chain-of-thought

Paper thumbnail

There are no more papers matching your filters at the moment.