IP Paris
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

MixtureVitae introduces an open, web-scale pretraining dataset that minimizes legal and ethical risks by using permissive-first text sources, augmented with high-quality instruction and reasoning data. Models trained on this corpus achieve performance competitive with those trained on non-permissive data, and demonstrate an order-of-magnitude improvement in math and coding abilities over other permissive datasets.

View blog
Resources
Latent Discrete Diffusion Models
20 Oct 2025

Latent Discrete Diffusion Models (LDDMs) address the factorization bottleneck in masked discrete diffusion by coupling a discrete token diffusion with a co-evolving continuous latent channel. This approach improves unconditional generation quality, particularly at low sampling budgets, on tasks like text generation by enabling more coherent and consistent outputs.

View blog
Resources34
Dense Motion Captioning

Researchers from the University of Trento and LIGM introduce Dense Motion Captioning (DMC), a new task for 3D human motion understanding that requires localizing and describing multiple atomic actions within complex, untrimmed motion sequences. They present CompMo, a large-scale dataset with precise temporal and textual annotations, and DEMO, a novel LLM-based model that significantly outperforms baselines in dense captioning quality and temporal localization.

View blog
Resources1
NextBestPath: Efficient 3D Mapping of Unseen Environments

The NextBestPath (NBP) method introduces a deep learning approach for active 3D mapping that predicts long-term exploration goals, optimizing for surface coverage gain while simultaneously generating obstacle maps for path planning. This approach achieves a 6.23 absolute gain in completion ratio over the state-of-the-art ANM model on the MP3D dataset and outperforms baselines on the new AiMDoom dataset.

View blog
Resources38
Flowing Datasets with Wasserstein over Wasserstein Gradient Flows

Researchers from ENSAE, CREST, and IP Paris introduce a theoretically grounded framework for optimizing functionals on data represented as probability distributions over probability distributions, utilizing Wasserstein over Wasserstein (WoW) gradient flows. This method establishes a rigorous differential structure for these higher-order spaces and provides a tractable, particle-based approach for tasks like domain adaptation, dataset distillation, and transfer learning, showing improved accuracy and efficiency.

View blog
Resources1
Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition
21 Nov 2025
We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.
View blog
Resources
A Computable Measure of Suboptimality for Entropy-Regularised Variational Objectives
Several emerging post-Bayesian methods target a probability distribution for which an entropy-regularised variational objective is minimised. This increased flexibility introduces a computational challenge, as one loses access to an explicit unnormalised density for the target. To mitigate this difficulty, we introduce a novel measure of suboptimality called 'gradient discrepancy', and in particular a 'kernel' gradient discrepancy (KGD) that can be explicitly computed. In the standard Bayesian context, KGD coincides with the kernel Stein discrepancy (KSD), and we obtain a novel characterisation of KSD as measuring the size of a variational gradient. Outside this familiar setting, KGD enables novel sampling algorithms to be developed and compared, even when unnormalised densities cannot be obtained. To illustrate this point several novel algorithms are proposed and studied, including a natural generalisation of Stein variational gradient descent, with applications to mean-field neural networks and predictively oriented posteriors presented. On the theoretical side, our principal contribution is to establish sufficient conditions for desirable properties of KGD, such as continuity and convergence control.
View blog
Resources
OpenStreetView-5M: The Many Roads to Global Visual Geolocation
29 Apr 2024
Determining the location of an image anywhere on Earth is a complex visual task, which makes it particularly relevant for evaluating computer vision algorithms. Yet, the absence of standard, large-scale, open-access datasets with reliably localizable images has limited its potential. To address this issue, we introduce OpenStreetView-5M, a large-scale, open-access dataset comprising over 5.1 million geo-referenced street view images, covering 225 countries and territories. In contrast to existing benchmarks, we enforce a strict train/test separation, allowing us to evaluate the relevance of learned geographical features beyond mere memorization. To demonstrate the utility of our dataset, we conduct an extensive benchmark of various state-of-the-art image encoders, spatial representations, and training strategies. All associated codes and models can be found at this https URL.
View blog
Resources172
How far can we go with ImageNet for Text-to-Image generation?
02 Oct 2025
Recent text-to-image (T2I) generation models have achieved remarkable sucess by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can achieve capabilities of models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +6% overall score over SD-XL on GenEval and +5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. We also show that ImageNet pretrained models can be finetuned on task specific datasets (like for high resolution aesthetic applications) with good results, indicating that ImageNet is sufficient for acquiring general capabilities. This opens the way for more reproducible research as ImageNet is widely available and the proposed standardized training setup only requires 500 hours of H100 to train a text-to-image model.
View blog
Resources2
Long Story Short: Story-level Video Understanding from 20K Short Films
10 Jan 2025
Recent developments in vision-language models have significantly advanced video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often depict activities of one person in a single scene. Although existing movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos, and frequently encounter data leakage issues given the use of subtitles and other information about commercial movies during LLM pretraining. To address the above limitations, we propose Short-Films 20K (SF20K), the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering. Our extensive analysis of SF20K reveals minimal data leakage, emphasizes the need for long-term reasoning, and demonstrates the strong performance of recent VLMs. Finally, we show that instruction tuning on the SF20K-Train set substantially improves model performance, paving the way for future progress in long-term video understanding.
View blog
Resources
From stability of Langevin diffusion to convergence of proximal MCMC for non-log-concave sampling
29 Aug 2025
We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity. In many context, e.g. imaging inverse problems, potentials are non-convex and non-smooth. Proximal Stochastic Gradient Langevin Algorithm (PSGLA) is a popular algorithm to handle such potentials. It combines the forward-backward optimization algorithm with a ULA step. Our main stability result combined with properties of the Moreau envelope allows us to derive the first proof of convergence of the PSGLA for non-convex potentials. We empirically validate our methodology on synthetic data and in the context of imaging inverse problems. In particular, we observe that PSGLA exhibits faster convergence rates than Stochastic Gradient Langevin Algorithm for posterior sampling while preserving its restoration properties.
View blog
Resources
On the Moreau envelope properties of weakly convex functions
13 Nov 2025
In this document, we present the main properties satisfied by the Moreau envelope of weakly convex functions. The Moreau envelope has been introduced in convex optimization to regularize convex functionals while preserving their global minimizers. However, the Moreau envelope is also defined for the more general class of weakly convex function and can be a useful tool for optimization in this context. The main properties of the Moreau envelope have been demonstrated for convex functions and are generalized to weakly convex function in various works. This document summarizes the vast literature on the properties of the Moreau envelope and provides the associated proofs.
View blog
Resources
Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling
Continuous Normalizing Flows (CNFs) enable elegant generative modeling but remain bottlenecked by slow sampling: producing a single sample requires solving a nonlinear ODE with hundreds of function evaluations. Recent approaches such as Rectified Flow and OT-CFM accelerate sampling by straightening trajectories, yet the learned dynamics remain nonlinear black boxes, limiting both efficiency and interpretability. We propose a fundamentally different perspective: globally linearizing flow dynamics via Koopman theory. By lifting Conditional Flow Matching (CFM) into a higher-dimensional Koopman space, we represent its evolution with a single linear operator. This yields two key benefits. First, sampling becomes one-step and parallelizable, computed in closed form via the matrix exponential. Second, the Koopman operator provides a spectral blueprint of generation, enabling novel interpretability through its eigenvalues and modes. We derive a practical, simulation-free training objective that enforces infinitesimal consistency with the teacher's dynamics and show that this alignment preserves fidelity along the full generative path, distinguishing our method from boundary-only distillation. Empirically, our approach achieves competitive sample quality with dramatic speedups, while uniquely enabling spectral analysis of generative flows.
View blog
Resources
Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance
26 Sep 2025
Few-shot image classification remains challenging due to the limited availability of labeled examples. Recent approaches have explored generating synthetic training data using text-to-image diffusion models, but often require extensive model fine-tuning or external information sources. We present a novel training-free approach, called DIPSY, that leverages IP-Adapter for image-to-image translation to generate highly discriminative synthetic images using only the available few-shot examples. DIPSY introduces three key innovations: (1) an extended classifier-free guidance scheme that enables independent control over positive and negative image conditioning; (2) a class similarity-based sampling strategy that identifies effective contrastive examples; and (3) a simple yet effective pipeline that requires no model fine-tuning or external captioning and filtering. Experiments across ten benchmark datasets demonstrate that our approach achieves state-of-the-art or comparable performance, while eliminating the need for generative model adaptation or reliance on external tools for caption generation and image filtering. Our results highlight the effectiveness of leveraging dual image prompting with positive-negative guidance for generating class-discriminative features, particularly for fine-grained classification tasks.
View blog
Resources
The quest for the GRAph Level autoEncoder (GRALE)
Although graph-based learning has attracted a lot of attention, graph representation learning is still a challenging task whose resolution may impact key application fields such as chemistry or biology. To this end, we introduce GRALE, a novel graph autoencoder that encodes and decodes graphs of varying sizes into a shared embedding space. GRALE is trained using an Optimal Transport-inspired loss that compares the original and reconstructed graphs and leverages a differentiable node matching module, which is trained jointly with the encoder and decoder. The proposed attention-based architecture relies on Evoformer, the core component of AlphaFold, which we extend to support both graph encoding and decoding. We show, in numerical experiments on simulated and molecular data, that GRALE enables a highly general form of pre-training, applicable to a wide range of downstream tasks, from classification and regression to more complex tasks such as graph interpolation, editing, matching, and prediction.
View blog
Resources
Text-Driven 3D Hand Motion Generation from Sign Language Data
21 Aug 2025
Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.
View blog
Resources
Superconductivity in the two-dimensional Hubbard model revealed by neural quantum states
Whether the ground state of the square lattice Hubbard model exhibits superconductivity re- mains a major open question, central to understanding high temperature cuprate superconductors and ultra-cold fermions in optical lattices. Numerical studies have found evidence for stripe-ordered states and superconductivity at strong coupling but the phase diagram remains controversial. Here, we show that one can resolve the subtle energetics of metallic, superconducting, and stripe phases using a new class of neural quantum state (NQS) wavefunctions that extend hidden fermion de- terminant states to Pfaffians. We simulate several hundred electrons using fast Pfaffian algorithms allowing us to measure off-diagonal long range order. At strong coupling and low hole-doping, we find that a non-superconducting filled stripe phase prevails, while superconductivity coexisting with partially-filled stripes is stabilized by a negative next neighbor hopping t-prime, with |t-prime| > 0.1. At larger doping levels, we introduce momentum-space correlation functions to mitigate finite size effects that arise from weakly-bound pairs. These provide evidence for uniform d-wave superconductivity at U = 4, even when t-prime = 0. Our results highlight the potential of NQS approaches, and provide a fresh perspective on superconductivity in the square lattice Hubbard model.
View blog
Resources
Provable Convergence and Limitations of Geometric Tempering for Langevin Dynamics
07 Apr 2025
Geometric tempering is a popular approach to sampling from challenging multi-modal probability distributions by instead sampling from a sequence of distributions which interpolate, using the geometric mean, between an easier proposal distribution and the target distribution. In this paper, we theoretically investigate the soundness of this approach when the sampling algorithm is Langevin dynamics, proving both upper and lower bounds. Our upper bounds are the first analysis in the literature under functional inequalities. They assert the convergence of tempered Langevin in continuous and discrete-time, and their minimization leads to closed-form optimal tempering schedules for some pairs of proposal and target distributions. Our lower bounds demonstrate a simple case where the geometric tempering takes exponential time, and further reveal that the geometric tempering can suffer from poor functional inequalities and slow convergence, even when the target distribution is well-conditioned. Overall, our results indicate that geometric tempering may not help, and can even be harmful for convergence.
View blog
Resources
Dynamics of small bubbles in turbulence in non-dilute conditions
Turbulent flows laden with small bubbles are ubiquitous in many natural and industrial environments. From the point of view of numerical modeling, to be able to handle a very large number of small bubbles in direct numerical simulations, one traditionally relies on the one-way coupling paradigm. There, bubbles are passively advected and are non-interacting, implicitly assuming dilute conditions. Here, we study bubbles that are four-way coupled, where both the feedback on the fluid and excluded-volume interactions between bubbles are taken into account. We find that, while the back-reaction from the bubble phase onto the fluid phase remains energetically small under most circumstances, the excluded-volume interactions between bubbles can have a significant influence on the Lagrangian statistics of the bubble dynamics. We show that as the volume fraction of bubbles increases, the preferential concentration of bubbles in filamentary high-vorticity regions decreases as these strong vortical structures get filled up; this happens at a volume fraction of around one percent for Reλ=O(102)\textrm{Re}_\lambda=O(10^2). We furthermore study the influence on the Lagrangian velocity structure function as well as pair dispersion, and find that, while the mean dispersive behavior remains close to that obtained from one-way coupling simulations, some evident signatures of bubble collisions can be retrieved from the structure functions and the distribution of the dispersion, even at very small volume fractions. This work not only teaches us about the circumstances under which four-way coupling becomes important, but also opens up new directions towards probing and ultimately manipulating coherent vortical structures in small-scale turbulence using bubbles.
View blog
Resources
Order Matters: 3D Shape Generation from Sequential VR Sketches
VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD tools. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for generating 3D shapes from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work, generalizes effectively from synthetic to real sketches with minimal supervision, and performs well even on partial sketches. All data and models will be released open-source at this https URL.
View blog
Resources
There are no more papers matching your filters at the moment.