alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

CRESTENSAEIP Paris

Computational Optimal Transport

18 Mar 2020

marco-cuturi

Marco Cuturi

This survey provides a comprehensive review of Optimal Transport (OT) theory, with a focus on its computational methods and applications in data sciences. It highlights how entropic regularization, particularly through the Sinkhorn-Knopp algorithm, has made OT computationally feasible for large-scale problems, detailing various formulations and their use across machine learning, computer vision, and statistics.

Paper thumbnail

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

29 Sep 2025

victor-may187

Victor May

University of Freiburg Northeastern University logo

Northeastern University

MixtureVitae introduces an open, web-scale pretraining dataset that minimizes legal and ethical risks by using permissive-first text sources, augmented with high-quality instruction and reasoning data. Models trained on this corpus achieve performance competitive with those trained on non-permissive data, and demonstrate an order-of-magnitude improvement in math and coding abilities over other permissive datasets.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Latent Discrete Diffusion Models

20 Oct 2025

Inria Sakana AI

Latent Discrete Diffusion Models (LDDMs) address the factorization bottleneck in masked discrete diffusion by coupling a discrete token diffusion with a co-evolving continuous latent channel. This approach improves unconditional generation quality, particularly at low sampling budgets, on tasks like text generation by enabling more coherent and consistent outputs.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

Flowing Datasets with Wasserstein over Wasserstein Gradient Flows

09 Jun 2025

Université Paris-Saclay IP Paris

Many applications in machine learning involve data represented as probability distributions. The emergence of such data requires radically novel techniques to design tractable gradient flows on probability distributions over this type of (infinite-dimensional) objects. For instance, being able to flow labeled datasets is a core task for applications ranging from domain adaptation to transfer learning or dataset distillation. In this setting, we propose to represent each class by the associated conditional distribution of features, and to model the dataset as a mixture distribution supported on these classes (which are themselves probability distributions), meaning that labeled datasets can be seen as probability distributions over probability distributions. We endow this space with a metric structure from optimal transport, namely the Wasserstein over Wasserstein (WoW) distance, derive a differential structure on this space, and define WoW gradient flows. The latter enables to design dynamics over this space that decrease a given objective functional. We apply our framework to transfer learning and dataset distillation tasks, leveraging our gradient flow construction as well as novel tractable functionals that take the form of Maximum Mean Discrepancies with Sliced-Wasserstein based kernels between probability distributions.

#computer-science #machine-learning #data-curation

Paper thumbnail

NextBestPath: Efficient 3D Mapping of Unseen Environments

07 Feb 2025

CNRS École Normale Supérieure

The NextBestPath (NBP) method introduces a deep learning approach for active 3D mapping that predicts long-term exploration goals, optimizing for surface coverage gain while simultaneously generating obstacle maps for path planning. This approach achieves a 6.23 absolute gain in completion ratio over the state-of-the-art ANM model on the MP3D dataset and outperforms baselines on the new AiMDoom dataset.

#active-learning #autonomous-vehicles #computer-science

Paper thumbnail

Dense Motion Captioning

07 Nov 2025

CNRS University of Trento

Researchers from the University of Trento and LIGM introduce Dense Motion Captioning (DMC), a new task for 3D human motion understanding that requires localizing and describing multiple atomic actions within complex, untrimmed motion sequences. They present CompMo, a large-scale dataset with precise temporal and textual annotations, and DEMO, a novel LLM-based model that significantly outperforms baselines in dense captioning quality and temporal localization.

#computer-science #computer-vision-and-pattern-recognition #data-curation

Paper thumbnail

A Computable Measure of Suboptimality for Entropy-Regularised Variational Objectives

17 Oct 2025

The Alan Turing Institute IP Paris

Several emerging post-Bayesian methods target a probability distribution for which an entropy-regularised variational objective is minimised. This increased flexibility introduces a computational challenge, as one loses access to an explicit unnormalised density for the target. To mitigate this difficulty, we introduce a novel measure of suboptimality called 'gradient discrepancy', and in particular a 'kernel' gradient discrepancy (KGD) that can be explicitly computed. In the standard Bayesian context, KGD coincides with the kernel Stein discrepancy (KSD), and we obtain a novel characterisation of KSD as measuring the size of a variational gradient. Outside this familiar setting, KGD enables novel sampling algorithms to be developed and compared, even when unnormalised densities cannot be obtained. To illustrate this point several novel algorithms are proposed and studied, including a natural generalisation of Stein variational gradient descent, with applications to mean-field neural networks and predictively oriented posteriors presented. On the theoretical side, our principal contribution is to establish sufficient conditions for desirable properties of KGD, such as continuity and convergence control.

#computation #statistics

Paper thumbnail

Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling

21 Oct 2025

IP Paris ATHENA RC

Continuous Normalizing Flows (CNFs) enable elegant generative modeling but remain bottlenecked by slow sampling: producing a single sample requires solving a nonlinear ODE with hundreds of function evaluations. Recent approaches such as Rectified Flow and OT-CFM accelerate sampling by straightening trajectories, yet the learned dynamics remain nonlinear black boxes, limiting both efficiency and interpretability. We propose a fundamentally different perspective: globally linearizing flow dynamics via Koopman theory. By lifting Conditional Flow Matching (CFM) into a higher-dimensional Koopman space, we represent its evolution with a single linear operator. This yields two key benefits. First, sampling becomes one-step and parallelizable, computed in closed form via the matrix exponential. Second, the Koopman operator provides a spectral blueprint of generation, enabling novel interpretability through its eigenvalues and modes. We derive a practical, simulation-free training objective that enforces infinitesimal consistency with the teacher's dynamics and show that this alignment preserves fidelity along the full generative path, distinguishing our method from boundary-only distillation. Empirically, our approach achieves competitive sample quality with dramatic speedups, while uniquely enabling spectral analysis of generative flows.

#computer-science #computer-vision-and-pattern-recognition #machine-learning

Paper thumbnail

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

29 Apr 2024

Determining the location of an image anywhere on Earth is a complex visual task, which makes it particularly relevant for evaluating computer vision algorithms. Yet, the absence of standard, large-scale, open-access datasets with reliably localizable images has limited its potential. To address this issue, we introduce OpenStreetView-5M, a large-scale, open-access dataset comprising over 5.1 million geo-referenced street view images, covering 225 countries and territories. In contrast to existing benchmarks, we enforce a strict train/test separation, allowing us to evaluate the relevance of learned geographical features beyond mere memorization. To demonstrate the utility of our dataset, we conduct an extensive benchmark of various state-of-the-art image encoders, spatial representations, and training strategies. All associated codes and models can be found at this https URL.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

How far can we go with ImageNet for Text-to-Image generation?

02 Oct 2025

Recent text-to-image (T2I) generation models have achieved remarkable sucess by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can achieve capabilities of models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +6% overall score over SD-XL on GenEval and +5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. We also show that ImageNet pretrained models can be finetuned on task specific datasets (like for high resolution aesthetic applications) with good results, indicating that ImageNet is sufficient for acquiring general capabilities. This opens the way for more reproducible research as ImageNet is widely available and the proposed standardized training setup only requires 500 hours of H100 to train a text-to-image model.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition

21 Nov 2025

We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

From stability of Langevin diffusion to convergence of proximal MCMC for non-log-concave sampling

29 Aug 2025

We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity. In many context, e.g. imaging inverse problems, potentials are non-convex and non-smooth. Proximal Stochastic Gradient Langevin Algorithm (PSGLA) is a popular algorithm to handle such potentials. It combines the forward-backward optimization algorithm with a ULA step. Our main stability result combined with properties of the Moreau envelope allows us to derive the first proof of convergence of the PSGLA for non-convex potentials. We empirically validate our methodology on synthetic data and in the context of imaging inverse problems. In particular, we observe that PSGLA exhibits faster convergence rates than Stochastic Gradient Langevin Algorithm for posterior sampling while preserving its restoration properties.

#bayesian-deep-learning #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

On the Moreau envelope properties of weakly convex functions

13 Nov 2025

In this document, we present the main properties satisfied by the Moreau envelope of weakly convex functions. The Moreau envelope has been introduced in convex optimization to regularize convex functionals while preserving their global minimizers. However, the Moreau envelope is also defined for the more general class of weakly convex function and can be a useful tool for optimization in this context. The main properties of the Moreau envelope have been demonstrated for convex functions and are generalized to weakly convex function in various works. This document summarizes the vast literature on the properties of the Moreau envelope and provides the associated proofs.

#mathematics #optimization-and-control

Paper thumbnail

Implicit Diffusion: Efficient Optimization through Stochastic Sampling

05 Mar 2025

Google DeepMind EPFL logo

We present a new algorithm to optimize distributions defined implicitly by parameterized stochastic diffusions. Doing so allows us to modify the outcome distribution of sampling processes by optimizing over their parameters. We introduce a general framework for first-order optimization of these processes, that performs jointly, in a single loop, optimization and sampling steps. This approach is inspired by recent advances in bilevel optimization and automatic implicit differentiation, leveraging the point of view of sampling as optimization over the space of probability distributions. We provide theoretical guarantees on the performance of our method, as well as experimental results demonstrating its effectiveness. We apply it to training energy-based models and finetuning denoising diffusions.

#computer-science #machine-learning #generative-models

Paper thumbnail

Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance

26 Sep 2025

Few-shot image classification remains challenging due to the limited availability of labeled examples. Recent approaches have explored generating synthetic training data using text-to-image diffusion models, but often require extensive model fine-tuning or external information sources. We present a novel training-free approach, called DIPSY, that leverages IP-Adapter for image-to-image translation to generate highly discriminative synthetic images using only the available few-shot examples. DIPSY introduces three key innovations: (1) an extended classifier-free guidance scheme that enables independent control over positive and negative image conditioning; (2) a class similarity-based sampling strategy that identifies effective contrastive examples; and (3) a simple yet effective pipeline that requires no model fine-tuning or external captioning and filtering. Experiments across ten benchmark datasets demonstrate that our approach achieves state-of-the-art or comparable performance, while eliminating the need for generative model adaptation or reliance on external tools for caption generation and image filtering. Our results highlight the effectiveness of leveraging dual image prompting with positive-negative guidance for generating class-discriminative features, particularly for fine-grained classification tasks.

#computer-science #computer-vision-and-pattern-recognition #machine-learning

Paper thumbnail

Text-Driven 3D Hand Motion Generation from Sign Language Data

21 Aug 2025

Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.

#computer-science #computer-vision-and-pattern-recognition #data-curation

Paper thumbnail

A reproducible comparative study of categorical kernels for Gaussian process regression, with new clustering-based nested kernels

02 Oct 2025

CNRS Institut Polytechnique de Paris

Designing categorical kernels is a major challenge for Gaussian process regression with continuous and categorical inputs. Despite previous studies, it is difficult to identify a preferred method, either because the evaluation metrics, the optimization procedure, or the datasets change depending on the study. In particular, reproducible code is rarely available. The aim of this paper is to provide a reproducible comparative study of all existing categorical kernels on many of the test cases investigated so far. We also propose new evaluation metrics inspired by the optimization community, which provide quantitative rankings of the methods across several tasks. From our results on datasets which exhibit a group structure on the levels of categorical inputs, it appears that nested kernels methods clearly outperform all competitors. When the group structure is unknown or when there is no prior knowledge of such a structure, we propose a new clustering-based strategy using target encodings of categorical variables. We show that on a large panel of datasets, which do not necessarily have a known group structure, this estimation strategy still outperforms other approaches while maintaining low computational cost.

#clustering-algorithms #computer-science #machine-learning

Paper thumbnail

The quest for the GRAph Level autoEncoder (GRALE)

20 Oct 2025

Télécom Paris IP Paris

Although graph-based learning has attracted a lot of attention, graph representation learning is still a challenging task whose resolution may impact key application fields such as chemistry or biology. To this end, we introduce GRALE, a novel graph autoencoder that encodes and decodes graphs of varying sizes into a shared embedding space. GRALE is trained using an Optimal Transport-inspired loss that compares the original and reconstructed graphs and leverages a differentiable node matching module, which is trained jointly with the encoder and decoder. The proposed attention-based architecture relies on Evoformer, the core component of AlphaFold, which we extend to support both graph encoding and decoding. We show, in numerical experiments on simulated and molecular data, that GRALE enables a highly general form of pre-training, applicable to a wide range of downstream tasks, from classification and regression to more complex tasks such as graph interpolation, editing, matching, and prediction.

#ai-for-health #attention-mechanisms #computer-science

Paper thumbnail

Long Story Short: Story-level Video Understanding from 20K Short Films

10 Jan 2025

xi-wang

Xi Wang

Recent developments in vision-language models have significantly advanced video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often depict activities of one person in a single scene. Although existing movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos, and frequently encounter data leakage issues given the use of subtitles and other information about commercial movies during LLM pretraining. To address the above limitations, we propose Short-Films 20K (SF20K), the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering. Our extensive analysis of SF20K reveals minimal data leakage, emphasizes the need for long-term reasoning, and demonstrates the strong performance of recent VLMs. Finally, we show that instruction tuning on the SF20K-Train set substantially improves model performance, paving the way for future progress in long-term video understanding.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Can Linear Probes Measure LLM Uncertainty?

05 Oct 2025

Effective Uncertainty Quantification (UQ) represents a key aspect for reliable deployment of Large Language Models (LLMs) in automated decision-making and beyond. Yet, for LLM generation with multiple choice structure, the state-of-the-art in UQ is still dominated by the naive baseline given by the maximum softmax score. To address this shortcoming, we demonstrate that taking a principled approach via Bayesian statistics leads to improved performance despite leveraging the simplest possible model, namely linear regression. More precisely, we propose to train multiple Bayesian linear models, each predicting the output of a layer given the output of the previous one. Based on the obtained layer-level posterior distributions, we infer the global uncertainty level of the LLM by identifying a sparse combination of distributional features, leading to an efficient UQ scheme. Numerical experiments on various LLMs show consistent improvement over state-of-the-art baselines.

#bayesian-deep-learning #computer-science #machine-learning

Paper thumbnail

There are no more papers matching your filters at the moment.