alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Center of Excellence for Generative AI KAUST logo

KAUST

Comparing Spectral Bias and Robustness For Two-Layer Neural Networks: SGD vs Adaptive Random Fourier Features

01 Feb 2024

KAUST RWTH-Aachen

Researchers from KAUST and RWTH Aachen University investigated the impact of training algorithms on neural network properties, demonstrating that Adaptive Random Fourier Features (ARFF) enables two-layer networks to achieve spectral unbiasedness and enhanced robustness against adversarial noise compared to Stochastic Gradient Descent (SGD). ARFF-trained networks showed spectral bias approaching zero and improved resilience, especially when using noisy validation data during training.

#computer-science #machine-learning #statistics

Paper thumbnail

Harnessing Temporal Causality for Advanced Temporal Action Detection

26 Jul 2024

lin-sui

Lin Sui

CausalTAD presents a temporal action detection model that explicitly leverages temporal causality through a Hybrid Causal Block, which integrates causal Mamba and causal attention mechanisms. This approach achieved state-of-the-art performance, including first-place rankings in the Ego4D Moment Query and EPIC-Kitchens Challenges 2024.

#attention-mechanisms #causal-inference #computer-science

Paper thumbnail

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

01 Dec 2025

University of Waterloo Meta logo

Tuna, a native unified multimodal model developed by Meta BizAI, introduces a novel cascaded VAE and representation encoder to construct a single, continuous visual representation for both understanding and generation. This architecture achieves state-of-the-art performance across diverse image and video understanding, generation, and editing benchmarks, often outperforming larger, specialized models while utilizing smaller LLM decoders.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

13 Oct 2025

Tsinghua University Peking University logo

Peking University

NVIDIA researchers and collaborators developed SANA-Video, an efficient video generation model that uses a Block Linear Diffusion Transformer to enable high-resolution, long-duration video synthesis with significantly reduced computational costs. The model reduces training cost to 1% of MovieGen and achieves 16x faster inference than existing small diffusion models, generating a 5-second 720p video in 36 seconds on an H100 GPU.

#attention-mechanisms #computer-science #artificial-intelligence

Resources 4,530

Paper thumbnail

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

18 Apr 2025

jiayi-pan

Jiayi Pan

mingchen-zhuge

Mingchen Zhuge

xingyao-wang887

Xingyao Wang

University of Illinois at Urbana-Champaign UC Berkeley logo

OpenHands is an open-source platform facilitating the development, evaluation, and deployment of generalist AI agents that interact with digital environments by writing code, using command lines, and browsing the web. Its CodeAct agent achieved competitive performance across 15 diverse benchmarks, including software engineering, web browsing, and general assistance tasks, without task-specific modifications.

#agent-based-systems #computer-science #artificial-intelligence

Paper thumbnail

Agent-as-a-Judge: Evaluate Agents with Agents

16 Oct 2024

yunyangx

Yunyang Xiong

mingchen-zhuge

Mingchen Zhuge

The Agent-as-a-Judge framework and the DevAI benchmark provide a scalable and reliable method for evaluating AI agents, demonstrating a superior alignment with human judgment compared to LLM-as-a-Judge. This approach achieves substantial cost and time reductions in evaluating complex, real-world AI application development tasks.

#agent-based-systems #computer-science #artificial-intelligence

Paper thumbnail

OASIS: Open Agent Social Interaction Simulations with One Million Agents

23 Mar 2025

shuyue hu

prateek-gupta

Prateek Gupta

Shanghai Artificial Intelligence Laboratory Imperial College London logo

Imperial College London

OASIS presents an open agent social interaction simulator capable of scaling to one million LLM-based agents, designed to mimic real-world social media platforms. The platform successfully replicates and investigates complex social phenomena like information propagation, group polarization, and herd effects, providing a testbed for understanding emergent behaviors at unprecedented scales.

#agent-based-systems #computer-science #computation-and-language

Paper thumbnail

OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

11 Jun 2025

yuzhou-nie

yuzhou nie

University of Cambridge CUHK logo

This research introduces WORKFORCE, a modular multi-agent inference architecture that decouples planning from execution, and OPTIMIZED WORKFORCE LEARNING (OWL), a training paradigm focused on a domain-agnostic planner. The system achieved 69.70% accuracy on the GAIA benchmark, setting a new open-source state-of-the-art and outperforming commercial baselines like OpenAI's Deep Research.

#agentic-frameworks #agents #computer-science

Resources 16,773

Paper thumbnail

3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models

01 May 2023

KAUST TU Munich

Researchers from KAUST and TU Munich introduce 3DShape2VecSet, a 3D shape representation that uses a fixed-size set of latent vectors for neural fields, enabling high-fidelity 3D shape generation with diffusion models. This approach achieves superior reconstruction quality, with a mean IoU of 0.965, and generates shapes with lower Surface-FPD (0.76) compared to previous methods, also demonstrating the first text-conditioned 3D shape generation using diffusion models.

#computer-science #computer-vision-and-pattern-recognition #graphics

Paper thumbnail

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

22 Apr 2025

Shanghai AI Lab Hugging Face logo

ReflectionFlow introduces an inference-time framework for text-to-image diffusion models, employing iterative self-reflection and textual feedback to refine generated images. This approach improves image generation quality, particularly for complex prompts, and provides interpretable visual reasoning through reflection chains.

#computer-science #computer-vision-and-pattern-recognition #few-shot-learning

Paper thumbnail

Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models

24 Sep 2025

yimeng-chen

Yimeng Chen

KAUST NNAISENSE

The WriteHERE framework introduces a heterogeneous recursive planning approach that enables large language models to generate adaptive, high-quality long-form text by dynamically interleaving retrieval, reasoning, and composition tasks. This method consistently outperforms existing baselines in narrative and technical report generation, showing improved performance in content quality and length scalability, including reports exceeding 10,000 words.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Visual Agents as Fast and Slow Thinkers

10 Feb 2025

z-w512

Z W

Achieving human-level intelligence requires refining cognitive distinctions between System 1 and System 2 thinking. While contemporary AI, driven by large language models, demonstrates human-like traits, it falls short of genuine cognition. Transitioning from structured benchmarks to real-world scenarios presents challenges for visual agents, often leading to inaccurate and overly confident responses. To address the challenge, we introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents. FaST employs a switch adapter to dynamically select between System 1/2 modes, tailoring the problem-solving approach to different task complexity. It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data. With this novel design, we advocate a flexible system, hierarchical reasoning capabilities, and a transparent decision-making pipeline, all of which contribute to its ability to emulate human-like cognitive processes in visual intelligence. Empirical results demonstrate that FaST outperforms various well-known baselines, achieving 80.8% accuracy over VQA^{v2} for visual question answering and 48.7% GIoU score over ReasonSeg for reasoning segmentation, demonstrate FaST's superior performance. Extensive testing validates the efficacy and robustness of FaST's core components, showcasing its potential to advance the development of cognitive visual agents in AI systems. The code is available at ttps://github.com/GuangyanS/Sys2-LLaVA.

#computer-science #machine-learning #model-interpretation

Paper thumbnail

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

23 Feb 2023

ZoeDepth, developed by researchers from King Abdullah University of Science and Technology and Intel, is a framework for single-image depth estimation that combines relative and metric depth estimation to achieve both strong generalization across diverse environments and accurate absolute depth measurements. It demonstrates a 21% improvement in relative absolute error on the NYU Depth v2 benchmark and unprecedented zero-shot generalization, with some metrics improving by up to 976% on unseen datasets.

#computer-science #computer-vision-security #computer-vision-and-pattern-recognition

Resources 2,712

Paper thumbnail

Triangle Splatting for Real-Time Radiance Field Rendering

25 May 2025

silvio-giancola

Silvio Giancola

University of Toronto Google DeepMind logo

Google DeepMind

This work introduces Triangle Splatting, a differentiable rendering approach that optimizes unstructured 3D triangles to reconstruct photorealistic scenes from images. The method achieves state-of-the-art visual fidelity, notably improving perceptual quality over prior splatting techniques, and renders at thousands of frames per second, outperforming implicit methods by orders of magnitude.

#computer-science #computer-vision-and-pattern-recognition #neural-rendering

Paper thumbnail

On Path to Multimodal Generalist: General-Level and General-Bench

07 May 2025

wthu120851

Wentao

wang-yaoting

Wang Yaoting

A comprehensive framework called General-Level introduces a 5-level taxonomy for evaluating multimodal large language models (MLLMs), accompanied by General-Bench - a large-scale benchmark testing diverse modalities and tasks, revealing that even leading models like GPT-4V achieve only Level-3 capabilities while demonstrating limited cross-modal synergy.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

Drop-Muon: Update Less, Converge Faster

02 Oct 2025

Shenzhen University KAUST logo

Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise

(L^0, L^1)

-smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to

1.4\times

faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.

#computer-science #machine-learning #efficient-transformers

Paper thumbnail

HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction

22 Aug 2025

KAUST NAVER LABS Europe

Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks con taining diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.

#computer-science #computer-vision-and-pattern-recognition #geometric-deep-learning

Paper thumbnail

ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

07 Oct 2025

Purdue University KAUST logo

SHAPEGEN4D directly synthesizes high-quality dynamic 3D meshes and their textures from monocular video, leveraging adapted 3D generative models to achieve superior geometric accuracy (e.g., Chamfer distance of 0.1220) and robust temporal consistency over prior methods.

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

26 Oct 2025

Researchers at KAUST developed PHYSGYM, an interactive benchmark to evaluate large language models' physics discovery capabilities under controlled prior knowledge. Experiments revealed that while prior information generally improves performance, abundant context can sometimes impede discovery, highlighting a conflict between pattern-matching and mechanistic inference.

#active-learning #agentic-frameworks #agents

Paper thumbnail

MoEUT: Mixture-of-Experts Universal Transformers

13 Oct 2024

robert-csordas

Róbert Csordás

Harvard University Stanford University logo

Stanford University

MoEUT enables Universal Transformers to scale efficiently to large language models by integrating Mixture-of-Experts with novel architectural components like layer grouping and peri-layernorm. The architecture slightly outperforms parameter-matched standard Transformers on language modeling tasks while demonstrating improved compute and memory efficiency.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

There are no more papers matching your filters at the moment.