alphaXiv

History

Papers Benchmarks

Google DeepMind

14,157

29 Sep 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Video models are zero-shot learners and reasoners

Google DeepMind

Google DeepMind's research reveals that large generative video models, specifically Veo 3, exhibit emergent zero-shot learning and reasoning capabilities across a broad spectrum of visual tasks, from perception to complex reasoning. This demonstrates that such models can function as general-purpose vision foundation models, performing well on quantitative benchmarks like achieving an OIS F1-score of 0.77 for edge detection and a 78% pass@10 rate for 5x5 maze solving.

1,318

25,425

28 Aug 2025

computer-science computation-and-language information-retrieval

On the Theoretical Limitations of Embedding-Based Retrieval

Google DeepMind

Johns Hopkins University

This research formally proves that single-vector embedding models possess fundamental theoretical and practical limitations in representing complex, combinatorial relevance definitions, irrespective of model size or training data. It demonstrates that these constraints manifest in realistic scenarios, causing state-of-the-art models to struggle on specially designed stress-test datasets.

563

4,530

29 Sep 2025

agents computer-science artificial-intelligence

Training Agents Inside of Scalable World Models

Google DeepMind

Google DeepMind's Dreamer 4 introduces a scalable and efficient world model that enables learning complex control tasks by reinforcement learning entirely within imagination. It achieves the first-ever offline acquisition of diamonds in Minecraft, reaching a 0.7% success rate while performing real-time interactive inference at 21 frames per second on a single GPU.

36,214

25 Oct 2025

computer-science computation-and-language machine-learning

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

KAIST

Google DeepMind

Université de Montréal

Google Research Google Cloud

Sungnyun Kim

Sangmin Bae

Mixture-of-Recursions (MoR) introduces a unified framework for language models that combines parameter efficiency, adaptive computation, and efficient KV caching. It achieves strong performance, outperforming vanilla Transformers with nearly 50% fewer parameters and reducing training FLOPs by 25%, while increasing inference throughput by up to 2.06x.

458

25,703

05 Sep 2024

computer-science machine-learning robotics

OpenVLA: An Open-Source Vision-Language-Action Model

Google DeepMind

UC Berkeley

Stanford University

MIT Toyota Research Institute Physical Intelligence

OpenVLA introduces a fully open-source, 7B-parameter Vision-Language-Action model that sets a new state of the art for generalist robot manipulation, outperforming larger closed-source models by 16.5% absolute success rate. The model also demonstrates effective and efficient fine-tuning strategies for adapting to new robot setups and tasks on commodity hardware.

2,361

2,195

27 Oct 2024

computer-science continual-learning machine-learning

Learning Continually by Spectral Regularization

Google DeepMind

Stanford University

University of Alberta

Researchers introduce spectral regularization, a method that maintains neural network plasticity and trainability by explicitly controlling the spectral norms of layer weights. This technique consistently improved performance in diverse continual supervised and reinforcement learning tasks while demonstrating robustness across various non-stationarities and reduced hyperparameter sensitivity.

3,238

16 Jun 2025

agentic-frameworks agents computer-science

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Google DeepMind

AlphaEvolve, from Google DeepMind, combines large language models with an evolutionary search framework to autonomously discover novel algorithms and optimize code. This system identified a faster procedure for 4x4 complex matrix multiplication, improved state-of-the-art for several open mathematical problems, and delivered tangible optimizations for Google's computing ecosystem, including recovering 0.7% of fleet-wide compute resources.

1,408

04 Dec 2025

agents computer-science continual-learning

SIMA 2: A Generalist Embodied Agent for Virtual Worlds

Google DeepMind

Google DeepMind developed SIMA 2, a generalist embodied agent powered by a Gemini Flash-Lite model, capable of understanding and acting in diverse 3D virtual worlds. It substantially doubles the task success rate of its predecessor SIMA 1, generalizes to unseen commercial games and photorealistic environments, and demonstrates autonomous skill acquisition through a Gemini-based self-improvement mechanism.

1,320

03 Nov 2025

agents chain-of-thought computer-science

Towards Robust Mathematical Reasoning

Google DeepMind

Georgia Institute of Technology

Microsoft

Seoul National University

Brown University

MIT

Google DeepMind developed IMO-Bench, a benchmark suite designed to assess advanced mathematical reasoning in large language models through problem-solving, proof writing, and proof grading tasks. The Gemini Deep Think (IMO Gold) model achieved 80.0% accuracy on robustified problems and 65.7% on challenging proof-writing tasks.

5,648

29 Mar 2022

computer-science computation-and-language machine-learning

Training Compute-Optimal Large Language Models

Google DeepMind

Google DeepMind research establishes new compute-optimal scaling laws for large language models, demonstrating that model size and training data should scale equally for efficient training. The resulting 70B parameter Chinchilla model, trained on 1.4 trillion tokens, achieved lower perplexity and outperformed larger models like Gopher and GPT-3 on various benchmarks, validating the new scaling approach.

11,035

18 Apr 2024

computer-science artificial-intelligence machine-learning

Mastering Diverse Domains through World Models

University of Toronto

Google DeepMind

DreamerV3 is an algorithm that achieves state-of-the-art performance across a wide range of reinforcement learning domains using a single, fixed set of hyperparameters. This algorithm is the first to reliably collect diamonds in Minecraft from scratch without human data or curricula, demonstrating broad applicability.

1,502

1,276

17 Sep 2025

analysis-of-pdes mathematics physics

Discovery of Unstable Singularities

Google DeepMind

New York University

Stanford University

Brown University École Polytechnique Fédérale de Lausanne

This research details the first systematic discovery of new families of unstable singularities in canonical fluid systems, achieving unprecedented numerical accuracy including near double-float machine precision for specific solutions. It also reveals empirical asymptotic formulas relating blow-up rates to instability orders, advancing the understanding of fundamental mathematical challenges in fluid dynamics.

20,090

25 Mar 2025

computer-science artificial-intelligence computation-and-language

Gemma 3 Technical Report

Google DeepMind

Google DeepMind introduces Gemma 3, an open-source language model family that combines multimodal capabilities with 128K token context windows through an interleaved local/global attention architecture, enabling competitive performance with larger closed-source models while running on consumer-grade hardware.

12,883

20 Feb 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Google DeepMind

Google DeepMind's SigLIP 2 introduces a family of multilingual vision-language encoders, integrating decoder-based pretraining, self-supervised losses, and active data curation to enhance semantic understanding, localization, and dense features. It consistently outperforms previous SigLIP models and other open-source baselines across zero-shot classification, retrieval, dense prediction, and localization tasks, while also reducing representation bias.

2,648

1,218

06 Nov 2025

agents computer-science continual-learning

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

Google DeepMind

Research from Google DeepMind and Stanford University demonstrates that current parametric AI systems lack the ability for latent learning, struggling to flexibly reuse implicitly acquired information for new, uncued tasks. Implementing an episodic memory-like retrieval mechanism consistently and significantly improves generalization across various benchmarks, suggesting it complements parametric learning by enabling on-demand, in-context reuse of past experiences.

10,441

03 Dec 2023

computer-science artificial-intelligence computation-and-language

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Google DeepMind

Princeton University

Researchers from Princeton University and Google DeepMind developed Tree of Thoughts (ToT), a framework enabling large language models to perform deliberate problem-solving by explicitly exploring and evaluating multiple reasoning paths. This approach, which allows LLMs to self-generate and self-evaluate intermediate thoughts, achieved a 74% success rate on the Game of 24 and improved coherence in creative writing tasks compared to traditional Chain-of-Thought methods.

5,080

1,165

30 Sep 2025

attention-mechanisms chain-of-thought computer-science

Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

Google DeepMind

University of Waterloo

Harvard University

University of Chicago

MIT

Researchers at Harvard University, Google DeepMind, and collaborating institutions reverse-engineered successful Implicit Chain-of-Thought (ICoT) Transformers to understand why standard models fail at multi-digit multiplication. They discovered that ICoT models establish long-range dependencies through attention trees for partial product caching and represent digits using Fourier bases, findings that led to a simple auxiliary loss intervention enabling a standard Transformer to achieve 99% accuracy on 4x4 multiplication.

2,583

17 Jan 2024

computer-science artificial-intelligence computation-and-language

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Google DeepMind

Generalized Knowledge Distillation (GKD) is a framework for distilling auto-regressive language models by training on the student's self-generated output sequences to mitigate train-inference distribution mismatch. GKD consistently improves the performance of smaller student models, achieving gains (e.g., 2.1x on summarization, 1.9x on reasoning) compared to supervised fine-tuning across various tasks, and demonstrates compatibility with RL fine-tuning for improved model alignment.

9,647

01 Nov 2024

computer-science computer-vision-and-pattern-recognition generative-models

Autoregressive Image Generation without Vector Quantization

Google DeepMind

Tsinghua University

MIT

Tianhong Li

A new method enables autoregressive models to generate images using continuous-valued representations, bypassing the need for vector quantization. It integrates diffusion processes to model per-token probability distributions, achieving competitive FID scores as low as 1.55 on ImageNet 256x256 while maintaining fast generation speeds of less than 0.3 seconds per image.

1,691

59,288

26 May 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Google DeepMind

New York University

UC Berkeley HKU

Simon Zhai

Peter Tong

A comparative study by researchers from Hong Kong University, UC Berkeley, NYU, and Google DeepMind empirically demonstrates that Reinforcement Learning (RL) promotes generalization to novel rules and visual inputs, while Supervised Fine-Tuning (SFT) tends to induce memorization, particularly in complex reasoning tasks for foundation models like Llama-3.2-Vision-11B. RL improved out-of-distribution performance by up to +61.1% on visual tasks and also enhanced underlying visual recognition capabilities.

200

There are no more papers matching your filters at the moment.

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Video models are zero-shot learners and reasoners

On the Theoretical Limitations of Embedding-Based Retrieval

Training Agents Inside of Scalable World Models

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

OpenVLA: An Open-Source Vision-Language-Action Model

Learning Continually by Spectral Regularization

AlphaEvolve: A coding agent for scientific and algorithmic discovery

SIMA 2: A Generalist Embodied Agent for Virtual Worlds

Towards Robust Mathematical Reasoning

Training Compute-Optimal Large Language Models

Mastering Diverse Domains through World Models

Discovery of Unstable Singularities

Gemma 3 Technical Report

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Autoregressive Image Generation without Vector Quantization

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Personalize Your Feed