alphaXiv

History

Papers Benchmarks

LIXIP Paris

29 Sep 2025

computer-science artificial-intelligence computation-and-language

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

University of Freiburg

Northeastern University

Carnegie Mellon University Institute of Science Tokyo University of Montreal IP Paris NASK Salesforce Montreal Institute for Learning Algorithms LAION Juelich Supercomputing Center Open-Ψ(Open-Sci) Collective Research Center Juelich DeepTensor AB Ontocord Detomo Inc ELLIS Institute Tuebingen RSS Lab cole Polytechnique

Victor May

MixtureVitae introduces an open, web-scale pretraining dataset that minimizes legal and ethical risks by using permissive-first text sources, augmented with high-quality instruction and reasoning data. Models trained on this corpus achieve performance competitive with those trained on non-permissive data, and demonstrate an order-of-magnitude improvement in math and coding abilities over other permissive datasets.

20 Oct 2025

computer-science artificial-intelligence machine-learning

Latent Discrete Diffusion Models

Inria Sakana AI IP Paris PSL Research University `Ecole Polytechnique

Latent Discrete Diffusion Models (LDDMs) address the factorization bottleneck in masked discrete diffusion by coupling a discrete token diffusion with a co-evolving continuous latent channel. This approach improves unconditional generation quality, particularly at low sampling budgets, on tasks like text generation by enabling more coherent and consistent outputs.

29 Nov 2025

computer-science computer-vision-and-pattern-recognition generative-models

What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards

Stony Brook University LIX

Researchers from Stony Brook University and LIX, École Polytechnique, developed NewtonRewards, a post-training framework that integrates verifiable, rule-based rewards to instill Newtonian physics into video generation models. This method significantly improved physical plausibility, motion smoothness, and temporal coherence, reducing velocity RMSE by +5.87% and acceleration RMSE by +8.46% on a custom NewtonBench-60K dataset.

163

09 Jun 2025

computer-science machine-learning data-curation

Flowing Datasets with Wasserstein over Wasserstein Gradient Flows

Université Paris-Saclay IP Paris ENSAE CREST Laboratoire de Mathématique d’Orsay

Many applications in machine learning involve data represented as probability distributions. The emergence of such data requires radically novel techniques to design tractable gradient flows on probability distributions over this type of (infinite-dimensional) objects. For instance, being able to flow labeled datasets is a core task for applications ranging from domain adaptation to transfer learning or dataset distillation. In this setting, we propose to represent each class by the associated conditional distribution of features, and to model the dataset as a mixture distribution supported on these classes (which are themselves probability distributions), meaning that labeled datasets can be seen as probability distributions over probability distributions. We endow this space with a metric structure from optimal transport, namely the Wasserstein over Wasserstein (WoW) distance, derive a differential structure on this space, and define WoW gradient flows. The latter enables to design dynamics over this space that decrease a given objective functional. We apply our framework to transfer learning and dataset distillation tasks, leveraging our gradient flow construction as well as novel tractable functionals that take the form of Maximum Mean Discrepancies with Sliced-Wasserstein based kernels between probability distributions.

719

07 Feb 2025

active-learning autonomous-vehicles computer-science

NextBestPath: Efficient 3D Mapping of Unseen Environments

CNRS École Normale Supérieure

Inria IP Paris PSL Research University Univ. Gustave Eiffel École nationale des ponts et chaussées

The NextBestPath (NBP) method introduces a deep learning approach for active 3D mapping that predicts long-term exploration goals, optimizing for surface coverage gain while simultaneously generating obstacle maps for path planning. This approach achieves a 6.23 absolute gain in completion ratio over the state-of-the-art ANM model on the MP3D dataset and outperforms baselines on the new AiMDoom dataset.

07 Nov 2025

computer-science computer-vision-and-pattern-recognition data-curation

Dense Motion Captioning

CNRS University of Trento IP Paris Univ. Gustave Eiffel Ecole des Ponts

Researchers from the University of Trento and LIGM introduce Dense Motion Captioning (DMC), a new task for 3D human motion understanding that requires localizing and describing multiple atomic actions within complex, untrimmed motion sequences. They present CompMo, a large-scale dataset with precise temporal and textual annotations, and DEMO, a novel LLM-based model that significantly outperforms baselines in dense captioning quality and temporal localization.

06 Oct 2025

computer-science computer-vision-and-pattern-recognition graphics

Pulp Motion: Framing-aware multimodal camera and human motion generation

CNRS IRISA

Inria LIX `Ecole Polytechnique Univ-Rennes

Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space. In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories. We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution. We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear transform from the human and camera latents to a framing latent. We then introduce auxiliary sampling, which exploits this linear transform to steer generation toward a coherent framing modality. To support this task, we also introduce the PulpMotion dataset, a human-motion and camera-trajectory dataset with rich captions, and high-quality human motions. Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent human-camera motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task. Code, models and data are available in our \href{this https URL}{project page}.

17 Oct 2025

computation statistics

A Computable Measure of Suboptimality for Entropy-Regularised Variational Objectives

The Alan Turing Institute IP Paris Newcastle University ENSAE CREST

Several emerging post-Bayesian methods target a probability distribution for which an entropy-regularised variational objective is minimised. This increased flexibility introduces a computational challenge, as one loses access to an explicit unnormalised density for the target. To mitigate this difficulty, we introduce a novel measure of suboptimality called 'gradient discrepancy', and in particular a 'kernel' gradient discrepancy (KGD) that can be explicitly computed. In the standard Bayesian context, KGD coincides with the kernel Stein discrepancy (KSD), and we obtain a novel characterisation of KSD as measuring the size of a variational gradient. Outside this familiar setting, KGD enables novel sampling algorithms to be developed and compared, even when unnormalised densities cannot be obtained. To illustrate this point several novel algorithms are proposed and studied, including a natural generalisation of Stein variational gradient descent, with applications to mean-field neural networks and predictively oriented posteriors presented. On the theoretical side, our principal contribution is to establish sufficient conditions for desirable properties of KGD, such as continuity and convergence control.

21 Oct 2025

computer-science computer-vision-and-pattern-recognition machine-learning

Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling

IP Paris ATHENA RC Archimedes/Athena RC LIX * National and Kapodistrian University of Athens `Ecole Polytechnique cole Polytechnique

Continuous Normalizing Flows (CNFs) enable elegant generative modeling but remain bottlenecked by slow sampling: producing a single sample requires solving a nonlinear ODE with hundreds of function evaluations. Recent approaches such as Rectified Flow and OT-CFM accelerate sampling by straightening trajectories, yet the learned dynamics remain nonlinear black boxes, limiting both efficiency and interpretability. We propose a fundamentally different perspective: globally linearizing flow dynamics via Koopman theory. By lifting Conditional Flow Matching (CFM) into a higher-dimensional Koopman space, we represent its evolution with a single linear operator. This yields two key benefits. First, sampling becomes one-step and parallelizable, computed in closed form via the matrix exponential. Second, the Koopman operator provides a spectral blueprint of generation, enabling novel interpretability through its eigenvalues and modes. We derive a practical, simulation-free training objective that enforces infinitesimal consistency with the teacher's dynamics and show that this alignment preserves fidelity along the full generative path, distinguishing our method from boundary-only distillation. Empirically, our approach achieves competitive sample quality with dramatic speedups, while uniquely enabling spectral analysis of generative flows.

119

29 Dec 2024

computer-science computer-vision-security artificial-intelligence

AKiRa: Augmentation Kit on Rays for optical video generation

LIX Univ Rennes, IRISA, Inria, CNRS École Polytechnique IP Paris ":

Xi Wang

Robin Courant

Recent advances in text-conditioned video diffusion have greatly improved video quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic camera motion, zoom, distorted lens and focus shifts. These motion and optical aspects are crucial for adding controllability and cinematic elements to generation frameworks, ultimately resulting in visual content that draws focus, enhances mood, and guides emotions according to filmmakers' controls. In this paper, we aim to close the gap between controllable video generation and camera optics. To achieve this, we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework that builds and trains a camera adapter with a complex camera model over an existing video generation backbone. It enables fine-tuned control over camera motion as well as complex optical parameters (focal length, distortion, aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh. Extensive experiments demonstrate AKiRa's effectiveness in combining and composing camera optics while outperforming all state-of-the-art methods. This work sets a new landmark in controlled and optically enhanced video generation, paving the way for future optical video generation methods.

116

29 Apr 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

CNRS

UC Berkeley INRIA Paris INRAE IP Paris CNES IRD UPS Ecole des Ponts IGN CESBIO LIX LIGM ENSG Univ de Toulouse LASTIG UGE `Ecole Polytechnique

Determining the location of an image anywhere on Earth is a complex visual task, which makes it particularly relevant for evaluating computer vision algorithms. Yet, the absence of standard, large-scale, open-access datasets with reliably localizable images has limited its potential. To address this issue, we introduce OpenStreetView-5M, a large-scale, open-access dataset comprising over 5.1 million geo-referenced street view images, covering 225 countries and territories. In contrast to existing benchmarks, we enforce a strict train/test separation, allowing us to evaluate the relevance of learned geographical features beyond mere memorization. To demonstrate the utility of our dataset, we conduct an extensive benchmark of various state-of-the-art image encoders, spatial representations, and training strategies. All associated codes and models can be found at this https URL.

172

168

02 Oct 2025

computer-science computer-vision-and-pattern-recognition

How far can we go with ImageNet for Text-to-Image generation?

CNRS IP Paris Univ. Gustave Eiffel LIX AMIAD Ecole Nationale des Ponts et Chauss ees `Ecole Polytechnique ":

Recent text-to-image (T2I) generation models have achieved remarkable sucess by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can achieve capabilities of models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +6% overall score over SD-XL on GenEval and +5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. We also show that ImageNet pretrained models can be finetuned on task specific datasets (like for high resolution aesthetic applications) with good results, indicating that ImageNet is sufficient for acquiring general capabilities. This opens the way for more reproducible research as ImageNet is widely available and the proposed standardized training setup only requires 500 hours of H100 to train a text-to-image model.

21 Nov 2025

computer-science computer-vision-and-pattern-recognition generative-models

Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition

Inria IP Paris

Adobe UCA CityUHK

We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.

29 Aug 2025

bayesian-deep-learning computer-science computer-vision-and-pattern-recognition

From stability of Langevin diffusion to convergence of proximal MCMC for non-log-concave sampling

CNRS

Inria Télécom Paris IP Paris PSL University Univ. Bordeaux Bordeaux INP ENS

We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity. In many context, e.g. imaging inverse problems, potentials are non-convex and non-smooth. Proximal Stochastic Gradient Langevin Algorithm (PSGLA) is a popular algorithm to handle such potentials. It combines the forward-backward optimization algorithm with a ULA step. Our main stability result combined with properties of the Moreau envelope allows us to derive the first proof of convergence of the PSGLA for non-convex potentials. We empirically validate our methodology on synthetic data and in the context of imaging inverse problems. In particular, we observe that PSGLA exhibits faster convergence rates than Stochastic Gradient Langevin Algorithm for posterior sampling while preserving its restoration properties.

13 Nov 2025

mathematics optimization-and-control

On the Moreau envelope properties of weakly convex functions

CNRS

Inria Télécom Paris IP Paris Univ. Bordeaux Bordeaux INP IMB LTCI

In this document, we present the main properties satisfied by the Moreau envelope of weakly convex functions. The Moreau envelope has been introduced in convex optimization to regularize convex functionals while preserving their global minimizers. However, the Moreau envelope is also defined for the more general class of weakly convex function and can be a useful tool for optimization in this context. The main properties of the Moreau envelope have been demonstrated for convex functions and are generalized to weakly convex function in various works. This document summarizes the vast literature on the properties of the Moreau envelope and provides the associated proofs.

26 Sep 2025

computer-science computer-vision-and-pattern-recognition machine-learning

Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance

CNRS IP Paris École des Ponts UGE cole Polytechnique

Few-shot image classification remains challenging due to the limited availability of labeled examples. Recent approaches have explored generating synthetic training data using text-to-image diffusion models, but often require extensive model fine-tuning or external information sources. We present a novel training-free approach, called DIPSY, that leverages IP-Adapter for image-to-image translation to generate highly discriminative synthetic images using only the available few-shot examples. DIPSY introduces three key innovations: (1) an extended classifier-free guidance scheme that enables independent control over positive and negative image conditioning; (2) a class similarity-based sampling strategy that identifies effective contrastive examples; and (3) a simple yet effective pipeline that requires no model fine-tuning or external captioning and filtering. Experiments across ten benchmark datasets demonstrate that our approach achieves state-of-the-art or comparable performance, while eliminating the need for generative model adaptation or reliance on external tools for caption generation and image filtering. Our results highlight the effectiveness of leveraging dual image prompting with positive-negative guidance for generating class-discriminative features, particularly for fine-grained classification tasks.

21 Aug 2025

computer-science computer-vision-and-pattern-recognition data-curation

Text-Driven 3D Hand Motion Generation from Sign Language Data

CNRS

NVIDIA IP Paris Univ. Gustave Eiffel Ecole des Ponts LIGM

Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.

19 Oct 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

One-step Diffusion Models with Bregman Density Ratio Matching

CNRS Institut Polytechnique de Paris LIX AMIAD ´Ecole Nationale des Ponts et Chauss´ees `Ecole Polytechnique cole Polytechnique

Diffusion and flow models achieve high generative quality but remain computationally expensive due to slow multi-step sampling. Distillation methods accelerate them by training fast student generators, yet most existing objectives lack a unified theoretical foundation. In this work, we propose Di-Bregman, a compact framework that formulates diffusion distillation as Bregman divergence-based density-ratio matching. This convex-analytic view connects several existing objectives through a common lens. Experiments on CIFAR-10 and text-to-image generation demonstrate that Di-Bregman achieves improved one-step FID over reverse-KL distillation and maintains high visual fidelity compared to the teacher model. Our results highlight Bregman density-ratio matching as a practical and theoretically-grounded route toward efficient one-step diffusion generation.

20 Oct 2025

ai-for-health attention-mechanisms computer-science

The quest for the GRAph Level autoEncoder (GRALE)

Télécom Paris IP Paris `Ecole Polytechnique

Although graph-based learning has attracted a lot of attention, graph representation learning is still a challenging task whose resolution may impact key application fields such as chemistry or biology. To this end, we introduce GRALE, a novel graph autoencoder that encodes and decodes graphs of varying sizes into a shared embedding space. GRALE is trained using an Optimal Transport-inspired loss that compares the original and reconstructed graphs and leverages a differentiable node matching module, which is trained jointly with the encoder and decoder. The proposed attention-based architecture relies on Evoformer, the core component of AlphaFold, which we extend to support both graph encoding and decoding. We show, in numerical experiments on simulated and molecular data, that GRALE enables a highly general form of pre-training, applicable to a wide range of downstream tasks, from classification and regression to more complex tasks such as graph interpolation, editing, matching, and prediction.

114

10 Jan 2025

computer-science artificial-intelligence computation-and-language

Long Story Short: Story-level Video Understanding from 20K Short Films

MBZUAI

Inria IP Paris `Ecole Polytechnique

Xi Wang

Recent developments in vision-language models have significantly advanced video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often depict activities of one person in a single scene. Although existing movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos, and frequently encounter data leakage issues given the use of subtitles and other information about commercial movies during LLM pretraining. To address the above limitations, we propose Short-Films 20K (SF20K), the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering. Our extensive analysis of SF20K reveals minimal data leakage, emphasizes the need for long-term reasoning, and demonstrates the strong performance of recent VLMs. Finally, we show that instruction tuning on the SF20K-Train set substantially improves model performance, paving the way for future progress in long-term video understanding.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Latent Discrete Diffusion Models

What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards

Flowing Datasets with Wasserstein over Wasserstein Gradient Flows

NextBestPath: Efficient 3D Mapping of Unseen Environments

Dense Motion Captioning

Pulp Motion: Framing-aware multimodal camera and human motion generation

A Computable Measure of Suboptimality for Entropy-Regularised Variational Objectives

Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling

AKiRa: Augmentation Kit on Rays for optical video generation

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

How far can we go with ImageNet for Text-to-Image generation?

Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition

From stability of Langevin diffusion to convergence of proximal MCMC for non-log-concave sampling

On the Moreau envelope properties of weakly convex functions

Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance

Text-Driven 3D Hand Motion Generation from Sign Language Data

One-step Diffusion Models with Bregman Density Ratio Matching

The quest for the GRAph Level autoEncoder (GRALE)

Long Story Short: Story-level Video Understanding from 20K Short Films

Events

AI for Law

Personalize Your Feed