alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Annenberg School for Communication and JournalismUSC

Robot Learning from a Physical World Model

10 Nov 2025

Google DeepMind Stanford University logo

Stanford University

A new framework, PhysWorld, enables robots to learn and execute complex manipulation tasks in a zero-shot manner by generating physically feasible actions from task-conditioned videos. It rebuilds the physical world from generated visual data, leading to an 82% average success rate in real-world tasks and significantly reducing grasping failures from 18% to 3%.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

17 Feb 2025

University of California, San Diego NVIDIA logo

NaVILA introduces a two-level Vision-Language-Action model that enables legged robots to navigate complex real-world environments by interpreting natural language instructions, leveraging training from human touring videos. The framework demonstrates high success rates on physical quadrupedal and humanoid robots and achieves substantial performance gains on established Vision-Language Navigation benchmarks.

#computer-science #computer-vision-and-pattern-recognition #robotics

Paper thumbnail

DataComp-LM: In search of the next generation of training sets for language models

21 Apr 2025

alon-albalak

Alon Albalak

alexandros-dimakis

Alexandros Dimakis

sunny900

Sunny Sanyal

University of Washington UCLA logo

DataComp-LM introduces a standardized, large-scale benchmark for evaluating language model training data curation strategies, complete with an openly released corpus, framework, and models. Its DCLM-BASELINE 7B model, trained on carefully filtered Common Crawl data, achieves 64% MMLU 5-shot accuracy, outperforming previous open-data state-of-the-art models while requiring substantially less compute.

#computer-science #computation-and-language #machine-learning

Paper thumbnail

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

13 Oct 2025

Researchers from Meta Superintelligence Labs, MIT, USC, and UCLA introduce Sandwiched Policy Gradient (SPG), a reinforcement learning algorithm that effectively fine-tunes Masked Diffusion Language Models (dLLMs) by addressing their intractable log-likelihood. SPG achieved state-of-the-art results on mathematical and logical reasoning benchmarks, improving accuracy by up to 27.0% on Sudoku and 18.4% on Countdown tasks.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning

08 Oct 2025

UC Berkeley Stanford University logo

Stanford University

Amazon FAR researchers developed ResMimic, a two-stage residual learning framework that transforms general motion tracking into precise humanoid whole-body loco-manipulation. The system achieved a 92.5% average task success rate in simulation, significantly outperforming baselines, and enabled a Unitree G1 robot to carry heavy and irregularly shaped objects using whole-body contact in real-world scenarios.

#computer-science #machine-learning #robotics

Paper thumbnail

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

20 Mar 2024

xiaogeng-liu

Xiaogeng Liu

University of California, Davis USC

Researchers from University of Wisconsin–Madison, USC, and UC Davis developed AutoDAN, an automated approach that generates semantically meaningful jailbreak prompts to bypass Large Language Model safety mechanisms. AutoDAN achieves higher attack success rates, demonstrating over 10% improvement on Llama2 compared to token-level attacks, while producing prompts with low perplexity, successfully bypassing perplexity-based defenses.

#adversarial-attacks #adversarial-robustness #computer-science

Paper thumbnail

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

18 Apr 2025

benjamin-feuer

Benjamin Feuer

New York University NVIDIA logo

LiveBench introduces a dynamic, contamination-limited benchmark for large language models, utilizing frequently updated, objectively scorable questions from recent sources across 18 diverse tasks. It demonstrates that even leading models achieve overall scores below 70%, revealing distinct strengths and weaknesses and offering a more reliable evaluation than benchmarks susceptible to data leakage or subjective judging.

#computer-science #continual-learning #artificial-intelligence

Paper thumbnail

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

25 Sep 2025

Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance:

\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})

for orthogonal

\mathbf{Q}

. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence

\mu = 1/\sqrt{n}

--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete

\{+1, -1\}

entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving

O(n \log n)

computational complexity with only

\frac{n \log n}{2}

learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. For LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 37.3 for QuIP. \href{this https URL}{Codes} are available.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

06 Aug 2025

Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Towards Understanding Camera Motions in Any Video

29 Aug 2025

zhiqiu-lin

Zhiqiu Lin

CMU UMass Amherst

CameraBench, a new benchmark and dataset developed in collaboration with cinematographers, is introduced to evaluate and improve computational models' understanding of camera motions in videos. Fine-tuning large vision-language models on this high-quality dataset significantly boosts their performance in classifying camera movements and answering related questions, often matching or exceeding geometric methods.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning

26 Apr 2025

baoxiongjia

BAOXIONG JIA

University of Illinois at Urbana-Champaign UCLA logo

RoboVerse introduces a unified robotics platform combining high-fidelity simulation environments, a large-scale synthetic dataset, and standardized benchmarks for imitation and reinforcement learning, enabling cross-simulator integration and improved sim-to-real transfer through its METASIM infrastructure and diverse data generation approaches.

#computer-science #robotics

Resources 1,477

Paper thumbnail

The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos

05 Dec 2025

University of Southern California University of Michigan logo

University of Michigan

Researchers at Peking University, NVIDIA, USC, and the University of Michigan developed Dynapo, a training-free framework that leverages large vision-language models and advanced segmentation models to robustly identify dynamic objects in casual videos. This approach significantly enhances the accuracy of camera pose estimation, depth reconstruction, and 4D trajectory optimization in dynamic scenes.

#computer-science #computer-vision-and-pattern-recognition #geometric-deep-learning

Paper thumbnail

Large Spatial Model: End-to-end Unposed Images to Semantic 3D

30 Oct 2024

Kairun Wen

The Large Spatial Model (LSM) introduces an end-to-end, feed-forward framework that directly reconstructs semantic 3D scenes from unposed image pairs, bypassing traditional multi-stage pipelines. It achieves real-time performance with 0.108 seconds per scene reconstruction time and 100+ FPS for novel view synthesis, depth prediction, and open-vocabulary semantic segmentation.

#computer-science #computer-vision-security #computer-vision-and-pattern-recognition

Paper thumbnail

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

22 May 2024

University of Toronto Carnegie Mellon University logo

Carnegie Mellon University

Researchers at Carnegie Mellon University and collaborators developed LOGRA, a low-rank gradient projection algorithm, and the LOGIX software for scalable data valuation on Large Language Models. This system achieved a 6,500x throughput increase and 5x memory reduction, enabling practical influence function computations on billion-parameter LLMs with billion-token datasets.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Martian World Model: Controllable Video Synthesis with Physically Accurate 3D Reconstructions

05 Dec 2025

chenyu-you189

Chenyu You

Stanford University

Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.

#computer-science #computer-vision-and-pattern-recognition #fine-tuning

Paper thumbnail

Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions

30 Jun 2025

Researchers at Apple Inc. and USC developed WBM, a foundation model that learns rich representations from high-level behavioral data derived from wearable devices, significantly improving health predictions across 57 diverse tasks. The model, particularly when combined with physiological sensor data, achieved superior performance by leveraging complementary health insights for more accurate and comprehensive health monitoring.

#ai-for-health #computer-science #artificial-intelligence

Paper thumbnail

Instructional Fingerprinting of Large Language Models

03 Apr 2024

UW-Madison UCLA logo

Researchers developed INSTRUCTIONALFINGERPRINT (IF), a method to embed hidden identifiers in Large Language Models using lightweight instruction tuning, ensuring their persistence even after extensive fine-tuning and enabling robust intellectual property protection.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

A Survey on Bias and Fairness in Machine Learning

25 Jan 2022

With the widespread use of AI systems and applications in our everyday lives, it is important to take fairness issues into consideration while designing and engineering these types of systems. Such systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that the decisions do not reflect discriminatory behavior toward certain groups or populations. We have recently seen work in machine learning, natural language processing, and deep learning that addresses such challenges in different subdomains. With the commercialization of these systems, researchers are becoming aware of the biases that these applications can contain and have attempted to address them. In this survey we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined in order to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and how they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.

#computer-science #machine-learning #explainable-ai

Paper thumbnail

Birth of a Painting: Differentiable Brushstroke Reconstruction

17 Nov 2025

Painting embodies a unique form of visual storytelling, where the creation process is as significant as the final artwork. Although recent advances in generative models have enabled visually compelling painting synthesis, most existing methods focus solely on final image generation or patch-based process simulation, lacking explicit stroke structure and failing to produce smooth, realistic shading. In this work, we present a differentiable stroke reconstruction framework that unifies painting, stylized texturing, and smudging to faithfully reproduce the human painting-smudging loop. Given an input image, our framework first optimizes single- and dual-color Bezier strokes through a parallel differentiable paint renderer, followed by a style generation module that synthesizes geometry-conditioned textures across diverse painting styles. We further introduce a differentiable smudge operator to enable natural color blending and shading. Coupled with a coarse-to-fine optimization strategy, our method jointly optimizes stroke geometry, color, and texture under geometric and semantic guidance. Extensive experiments on oil, watercolor, ink, and digital paintings demonstrate that our approach produces realistic and expressive stroke reconstructions, smooth tonal transitions, and richly stylized appearances, offering a unified model for expressive digital painting creation. See our project page for more demos: this https URL.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

FORGE: Force-Guided Exploration for Robust Contact-Rich Manipulation under Uncertainty

02 Jan 2025

We present FORGE, a method for sim-to-real transfer of force-aware manipulation policies in the presence of significant pose uncertainty. During simulation-based policy learning, FORGE combines a force threshold mechanism with a dynamics randomization scheme to enable robust transfer of the learned policies to the real robot. At deployment, FORGE policies, conditioned on a maximum allowable force, adaptively perform contact-rich tasks while avoiding aggressive and unsafe behaviour, regardless of the controller gains. Additionally, FORGE policies predict task success, enabling efficient termination and autonomous tuning of the force threshold. We show that FORGE can be used to learn a variety of robust contact-rich policies, including the forceful insertion of snap-fit connectors. We further demonstrate the multistage assembly of a planetary gear system, which requires success across three assembly tasks: nut threading, insertion, and gear meshing. Project website can be accessed at this https URL

#computer-science #robotics

Paper thumbnail

There are no more papers matching your filters at the moment.