Center of Excellence for Generative AIKAUST logoKAUST
Comparing Spectral Bias and Robustness For Two-Layer Neural Networks: SGD vs Adaptive Random Fourier Features
01 Feb 2024

Researchers from KAUST and RWTH Aachen University investigated the impact of training algorithms on neural network properties, demonstrating that Adaptive Random Fourier Features (ARFF) enables two-layer networks to achieve spectral unbiasedness and enhanced robustness against adversarial noise compared to Stochastic Gradient Descent (SGD). ARFF-trained networks showed spectral bias approaching zero and improved resilience, especially when using noisy validation data during training.

View blog
Resources
Harnessing Temporal Causality for Advanced Temporal Action Detection
26 Jul 2024

CausalTAD presents a temporal action detection model that explicitly leverages temporal causality through a Hybrid Causal Block, which integrates causal Mamba and causal attention mechanisms. This approach achieved state-of-the-art performance, including first-place rankings in the Ego4D Moment Query and EPIC-Kitchens Challenges 2024.

View blog
Resources289
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Tuna, a native unified multimodal model developed by Meta BizAI, introduces a novel cascaded VAE and representation encoder to construct a single, continuous visual representation for both understanding and generation. This architecture achieves state-of-the-art performance across diverse image and video understanding, generation, and editing benchmarks, often outperforming larger, specialized models while utilizing smaller LLM decoders.

View blog
Resources
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

NVIDIA researchers and collaborators developed SANA-Video, an efficient video generation model that uses a Block Linear Diffusion Transformer to enable high-resolution, long-duration video synthesis with significantly reduced computational costs. The model reduces training cost to 1% of MovieGen and achieves 16x faster inference than existing small diffusion models, generating a 5-second 720p video in 36 seconds on an H100 GPU.

View blog
Resources4,530
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
18 Apr 2025

OpenHands is an open-source platform facilitating the development, evaluation, and deployment of generalist AI agents that interact with digital environments by writing code, using command lines, and browsing the web. Its CodeAct agent achieved competitive performance across 15 diverse benchmarks, including software engineering, web browsing, and general assistance tasks, without task-specific modifications.

View blog
Resources
Agent-as-a-Judge: Evaluate Agents with Agents
16 Oct 2024

The Agent-as-a-Judge framework and the DevAI benchmark provide a scalable and reliable method for evaluating AI agents, demonstrating a superior alignment with human judgment compared to LLM-as-a-Judge. This approach achieves substantial cost and time reductions in evaluating complex, real-world AI application development tasks.

View blog
Resources85
OASIS: Open Agent Social Interaction Simulations with One Million Agents

OASIS presents an open agent social interaction simulator capable of scaling to one million LLM-based agents, designed to mimic real-world social media platforms. The platform successfully replicates and investigates complex social phenomena like information propagation, group polarization, and herd effects, providing a testbed for understanding emergent behaviors at unprecedented scales.

View blog
Resources658
OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
11 Jun 2025

This research introduces WORKFORCE, a modular multi-agent inference architecture that decouples planning from execution, and OPTIMIZED WORKFORCE LEARNING (OWL), a training paradigm focused on a domain-agnostic planner. The system achieved 69.70% accuracy on the GAIA benchmark, setting a new open-source state-of-the-art and outperforming commercial baselines like OpenAI's Deep Research.

View blog
Resources16,773
3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models
01 May 2023

Researchers from KAUST and TU Munich introduce 3DShape2VecSet, a 3D shape representation that uses a fixed-size set of latent vectors for neural fields, enabling high-fidelity 3D shape generation with diffusion models. This approach achieves superior reconstruction quality, with a mean IoU of 0.965, and generates shapes with lower Surface-FPD (0.76) compared to previous methods, also demonstrating the first text-conditioned 3D shape generation using diffusion models.

View blog
Resources442
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

ReflectionFlow introduces an inference-time framework for text-to-image diffusion models, employing iterative self-reflection and textual feedback to refine generated images. This approach improves image generation quality, particularly for complex prompts, and provides interpretable visual reasoning through reflection chains.

View blog
Resources196
Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models
24 Sep 2025

The WriteHERE framework introduces a heterogeneous recursive planning approach that enables large language models to generate adaptive, high-quality long-form text by dynamically interleaving retrieval, reasoning, and composition tasks. This method consistently outperforms existing baselines in narrative and technical report generation, showing improved performance in content quality and length scalability, including reports exceeding 10,000 words.

View blog
Resources388
Visual Agents as Fast and Slow Thinkers
10 Feb 2025
Achieving human-level intelligence requires refining cognitive distinctions between System 1 and System 2 thinking. While contemporary AI, driven by large language models, demonstrates human-like traits, it falls short of genuine cognition. Transitioning from structured benchmarks to real-world scenarios presents challenges for visual agents, often leading to inaccurate and overly confident responses. To address the challenge, we introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents. FaST employs a switch adapter to dynamically select between System 1/2 modes, tailoring the problem-solving approach to different task complexity. It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data. With this novel design, we advocate a flexible system, hierarchical reasoning capabilities, and a transparent decision-making pipeline, all of which contribute to its ability to emulate human-like cognitive processes in visual intelligence. Empirical results demonstrate that FaST outperforms various well-known baselines, achieving 80.8% accuracy over VQA^{v2} for visual question answering and 48.7% GIoU score over ReasonSeg for reasoning segmentation, demonstrate FaST's superior performance. Extensive testing validates the efficacy and robustness of FaST's core components, showcasing its potential to advance the development of cognitive visual agents in AI systems. The code is available at ttps://github.com/GuangyanS/Sys2-LLaVA.
View blog
Resources113
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
23 Feb 2023

ZoeDepth, developed by researchers from King Abdullah University of Science and Technology and Intel, is a framework for single-image depth estimation that combines relative and metric depth estimation to achieve both strong generalization across diverse environments and accurate absolute depth measurements. It demonstrates a 21% improvement in relative absolute error on the NYU Depth v2 benchmark and unprecedented zero-shot generalization, with some metrics improving by up to 976% on unseen datasets.

View blog
Resources2,712
Triangle Splatting for Real-Time Radiance Field Rendering
25 May 2025

This work introduces Triangle Splatting, a differentiable rendering approach that optimizes unstructured 3D triangles to reconstruct photorealistic scenes from images. The method achieves state-of-the-art visual fidelity, notably improving perceptual quality over prior splatting techniques, and renders at thousands of frames per second, outperforming implicit methods by orders of magnitude.

View blog
Resources772
On Path to Multimodal Generalist: General-Level and General-Bench
07 May 2025

A comprehensive framework called General-Level introduces a 5-level taxonomy for evaluating multimodal large language models (MLLMs), accompanied by General-Bench - a large-scale benchmark testing diverse modalities and tasks, revealing that even leading models like GPT-4V achieve only Level-3 capabilities while demonstrating limited cross-modal synergy.

View blog
Resources19
Drop-Muon: Update Less, Converge Faster
Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise (L0,L1)(L^0, L^1)-smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to 1.4×1.4\times faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.
View blog
Resources
HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction
Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks con taining diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.
View blog
Resources
ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

SHAPEGEN4D directly synthesizes high-quality dynamic 3D meshes and their textures from monocular video, leveraging adapted 3D generative models to achieve superior geometric accuracy (e.g., Chamfer distance of 0.1220) and robust temporal consistency over prior methods.

View blog
Resources
PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors
26 Oct 2025

Researchers at KAUST developed PHYSGYM, an interactive benchmark to evaluate large language models' physics discovery capabilities under controlled prior knowledge. Experiments revealed that while prior information generally improves performance, abundant context can sometimes impede discovery, highlighting a conflict between pattern-matching and mechanistic inference.

View blog
Resources87
MoEUT: Mixture-of-Experts Universal Transformers
13 Oct 2024

MoEUT enables Universal Transformers to scale efficiently to large language models by integrating Mixture-of-Experts with novel architectural components like layer grouping and peri-layernorm. The architecture slightly outperforms parameter-matched standard Transformers on language modeling tasks while demonstrating improved compute and memory efficiency.

View blog
Resources85
There are no more papers matching your filters at the moment.