Laboratoire Kastler BrosselENS
18 Mar 2020
This survey provides a comprehensive review of Optimal Transport (OT) theory, with a focus on its computational methods and applications in data sciences. It highlights how entropic regularization, particularly through the Sinkhorn-Knopp algorithm, has made OT computationally feasible for large-scale problems, detailing various formulations and their use across machine learning, computer vision, and statistics.
366
Gabriel Peyré's course notes introduce Optimal Transport, a mathematical framework for comparing and manipulating probability distributions, balancing theoretical rigor with computational methods. The notes cover the Monge and Kantorovich formulations, the Wasserstein distance, and practical algorithms like Sinkhorn, enabling applications in machine learning such as distribution matching, generative modeling, and image processing.
32
Atlas, a retrieval-augmented language model, achieves state-of-the-art few-shot learning performance on knowledge-intensive tasks, often surpassing purely parametric LLMs with significantly fewer parameters. It demonstrates that effective few-shot learning can be accomplished by integrating external knowledge sources, reducing reliance on in-parameter memorization.
526
This paper demonstrates that 'lazy training,' where deep neural networks behave linearly, is a general property of differentiable models driven by scaling, and critically shows that this regime leads to degraded generalization performance in practical deep convolutional neural networks. The findings suggest that the success of deep learning in real-world tasks likely stems from a non-lazy regime involving substantial non-linear feature learning.
Addressing real-world optimization problems becomes particularly challenging when analytic objective functions or constraints are unavailable. While numerous studies have addressed the issue of unknown objectives, limited research has focused on scenarios where feasibility constraints are not given explicitly. Overlooking these constraints can lead to spurious solutions that are unrealistic in practice. To deal with such unknown constraints, we propose to perform optimization within the data manifold using diffusion models. To constrain the optimization process to the data manifold, we reformulate the original optimization problem as a sampling problem from the product of the Boltzmann distribution defined by the objective function and the data distribution learned by the diffusion model. Depending on the differentiability of the objective function, we propose two different sampling methods. For differentiable objectives, we propose a two-stage framework that begins with a guided diffusion process for warm-up, followed by a Langevin dynamics stage for further correction. For non-differentiable objectives, we propose an iterative importance sampling strategy using the diffusion model as the proposal distribution. Comprehensive experiments on a synthetic dataset, six real-world black-box optimization datasets, and a multi-objective molecule optimization dataset show that our method achieves better or comparable performance with previous state-of-the-art baselines.
We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity. In many context, e.g. imaging inverse problems, potentials are non-convex and non-smooth. Proximal Stochastic Gradient Langevin Algorithm (PSGLA) is a popular algorithm to handle such potentials. It combines the forward-backward optimization algorithm with a ULA step. Our main stability result combined with properties of the Moreau envelope allows us to derive the first proof of convergence of the PSGLA for non-convex potentials. We empirically validate our methodology on synthetic data and in the context of imaging inverse problems. In particular, we observe that PSGLA exhibits faster convergence rates than Stochastic Gradient Langevin Algorithm for posterior sampling while preserving its restoration properties.
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at this https URL ventural/covr.
Optimal Transport (OT) has recently emerged as a central tool in data sciences to compare in a geometrically faithful way point clouds and more generally probability distributions. The wide adoption of OT into existing data analysis and machine learning pipelines is however plagued by several shortcomings. This includes its lack of robustness to outliers, its high computational costs, the need for a large number of samples in high dimension and the difficulty to handle data in distinct spaces. In this review, we detail several recently proposed approaches to mitigate these issues. We insist in particular on unbalanced OT, which compares arbitrary positive measures, not restricted to probability distributions (i.e. their total mass can vary). This generalization of OT makes it robust to outliers and missing data. The second workhorse of modern computational OT is entropic regularization, which leads to scalable algorithms while lowering the sample complexity in high dimension. The last point presented in this review is the Gromov-Wasserstein (GW) distance, which extends OT to cope with distributions belonging to different metric spaces. The main motivation for this review is to explain how unbalanced OT, entropic regularization and GW can work hand-in-hand to turn OT into efficient geometric loss functions for data sciences.
Denoising diffusions are state-of-the-art generative models exhibiting remarkable empirical performance. They work by diffusing the data distribution into a Gaussian distribution and then learning to reverse this noising process to obtain synthetic datapoints. The denoising diffusion relies on approximations of the logarithmic derivatives of the noised data densities using score matching. Such models can also be used to perform approximate posterior simulation when one can only sample from the prior and likelihood. We propose a unifying framework generalising this approach to a wide class of spaces and leading to an original extension of score matching. We illustrate the resulting models on various applications.
Transformers are deep architectures that define "in-context mappings" which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In this work, we study in particular the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly address their expressivity, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens which becomes discrete for a finite number of these. The relevant notion of smoothness then corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens. Additionally, it operates with a fixed embedding dimension of tokens (this dimension does not increase with precision) and a fixed number of heads (proportional to the dimension). The use of MLPs between multi-head attention layers is also explicitly controlled. We consider both unmasked attentions (as used for the vision transformer) and masked causal attentions (as used for NLP and time series applications). We tackle the causal setting leveraging a space-time lifting to analyze causal attention as a mapping over probability distributions of tokens.
We consider the problem of minimizing a function over the manifold of orthogonal matrices. The majority of algorithms for this problem compute a direction in the tangent space, and then use a retraction to move in that direction while staying on the manifold. Unfortunately, the numerical computation of retractions on the orthogonal manifold always involves some expensive linear algebra operation, such as matrix inversion, exponential or square-root. These operations quickly become expensive as the dimension of the matrices grows. To bypass this limitation, we propose the landing algorithm which does not use retractions. The algorithm is not constrained to stay on the manifold but its evolution is driven by a potential energy which progressively attracts it towards the manifold. One iteration of the landing algorithm only involves matrix multiplications, which makes it cheap compared to its retraction counterparts. We provide an analysis of the convergence of the algorithm, and demonstrate its promises on large-scale and deep learning problems, where it is faster and less prone to numerical errors than retraction-based methods.
Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text generations and using it to infer (implicitly or explicitly) a reward model. Numerous methods have been proposed to learn the reward model and align a LM with it. However, the costly process of collecting human preferences has received little attention and could benefit from theoretical insights. This paper addresses this issue and aims to formalize the reward training model in RLHF. We frame the selection of an effective dataset as a simple regret minimization task, using a linear contextual dueling bandit method. Given the potentially large number of arms, this approach is more coherent than the best-arm identification setting. We then propose an offline framework for solving this problem. Under appropriate assumptions - linearity of the reward model in the embedding space, and boundedness of the reward parameter - we derive bounds on the simple regret. Finally, we provide a lower bound that matches our upper bound up to constant and logarithmic terms. To our knowledge, this is the first theoretical contribution in this area to provide an offline approach as well as worst-case guarantees.
Transformers exhibit compositional reasoning on sequences not observed during training, a capability often attributed to in-context learning (ICL) and skill composition. We investigate this phenomenon using the Random Hierarchy Model (RHM), a probabilistic context-free grammar that generates sequences through recursive rule application. Models are trained on subsets of sequences and evaluated across four generalization conditions: memorization, in-distribution generalization, out-of-distribution generalization with the same rules, and cross-layer transfer. Behaviorally, performance improves systematically with task complexity and the number of in-context examples, with out-of-distribution tasks requiring substantially more examples than in-distribution scenarios. Mechanistically, we identify a progressive emergence of layer specialization during training that correlates with generalization performance. Principal component analysis and attention pattern clustering reveal that transformers develop structured, hierarchically organized representations in specialized layers. These results demonstrate that transformers develop modular, interpretable mechanisms supporting compositional reasoning, linking internal algorithmic structure to observed behavioral capabilities.
Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in which noisy and reverberant audio segments are synthesized. We first evaluate its performance and demonstrate that the proposed multi-task regime is beneficial. We then present two scenarios illustrating how Brouhaha can be used on naturally noisy and reverberant data: 1) to investigate the errors made by a speaker diarization model (this http URL); and 2) to assess the reliability of an automatic speech recognition model (Whisper from OpenAI). Both our pipeline and a pretrained model are open source and shared with the speech community.
Gabriel Peyré, affiliated with CNRS and ENS, provides a comprehensive overview of the mathematical foundations enabling modern artificial intelligence, particularly focusing on analytical and probabilistic tools for neural network architectures and optimization. The article demonstrates how diverse mathematical disciplines underpin AI advancements while simultaneously showcasing how AI problems catalyze new mathematical development.
362
University of Washington logoUniversity of WashingtonCNRS logoCNRSUniversity of Toronto logoUniversity of TorontoUniversity of MississippiUniversity of CincinnatiCalifornia Institute of Technology logoCalifornia Institute of TechnologyUniversity of Cambridge logoUniversity of CambridgeINFN Sezione di NapoliMonash University logoMonash UniversityNational Central UniversityNational Astronomical Observatory of JapanVanderbilt UniversityUniversity of Notre Dame logoUniversity of Notre DameTel Aviv University logoTel Aviv UniversityUniversity College London logoUniversity College LondonNikhefGeorgia Institute of Technology logoGeorgia Institute of TechnologyUniversity of Science and Technology of China logoUniversity of Science and Technology of ChinaTsinghua University logoTsinghua UniversityThe Chinese University of Hong Kong logoThe Chinese University of Hong KongUniversity of MelbourneThe University of Texas at Austin logoThe University of Texas at AustinUniversity of WarsawPeking University logoPeking UniversityTexas A&M University logoTexas A&M UniversityUniversity of British Columbia logoUniversity of British ColumbiaNorthwestern University logoNorthwestern UniversityNASA Goddard Space Flight Center logoNASA Goddard Space Flight CenterLouisiana State UniversityUniversity of Florida logoUniversity of FloridaINFN Sezione di PisaRutherford Appleton LaboratoryUniversity of Minnesota logoUniversity of MinnesotaUniversity of Maryland logoUniversity of MarylandUniversity of Tokyo logoUniversity of TokyoIndian Institute of ScienceNational Taiwan Normal UniversityThe Pennsylvania State University logoThe Pennsylvania State UniversityRochester Institute of TechnologyGran Sasso Science InstituteSorbonne Université logoSorbonne UniversitéUniversity of Massachusetts AmherstAustralian National University logoAustralian National UniversityUniversity of AucklandCardiff UniversityUniversity of GlasgowLeibniz Universität HannoverUniversity of PortsmouthUniversidade Federal do ABCHigh Energy Accelerator Research Organization (KEK)Indian Institute of Technology MadrasUniversity of StrathclydeUniversità di GenovaUniversity of Alabama in HuntsvilleSyracuse UniversityUniversity of SannioRMIT UniversityInstituto Nacional de Pesquisas EspaciaisUniversità di CamerinoUniversitat de les Illes BalearsMaastricht UniversityUniversity of BirminghamUniversità di TriesteNational Cheng Kung UniversityAix Marseille UniversityKyushu UniversityUniversity of South CarolinaWashington State UniversityUniversity of OregonNational Tsing-Hua UniversityKindai UniversityThe University of Western AustraliaUniversidade de AveiroEötvös Loránd UniversityUniversitat Autònoma de BarcelonaSofia UniversityNicolaus Copernicus Astronomical CenterInstituto de Fisica Teorica UAM/CSICShanghai Astronomical ObservatoryNicolaus Copernicus UniversityINFN, Laboratori Nazionali di FrascatiUniversity of Western OntarioUniversità di Napoli Federico IIUniversity of California, Santa Cruz logoUniversity of California, Santa CruzEmbry-Riddle Aeronautical UniversityUniversity of Hawai’iUniversity of Electro-CommunicationsNational Chung Hsing UniversityMontana State UniversityInternational Centre for Theoretical SciencesINFN Sezione di PerugiaIstituto Nazionale di Alta MatematicaThe University of SheffieldUniversité de la Côte d’AzurPhysikalisch-Technische BundesanstaltInstitut de Física d’Altes Energies (IFAE)INFN - Sezione di PadovaUniversity of the Balearic IslandsLaboratoire Kastler BrosselUniversità di FirenzeUniversity of ToyamaIstituto Nazionale di OtticaINFN-Sezione di GenovaUniversiteit AntwerpenThe University of MississippiUniversity of SzegedUniversità di PerugiaINFN-Sezione di BolognaUniversità di CagliariVU AmsterdamInstitute for Cosmic Ray Research, University of TokyoINFN Sezione di Roma Tor VergataUniversité de Paris, CNRS, Astroparticule et Cosmologie,California State University, Los AngelesUniversità di SienaLIGO Livingston ObservatoryNational Center for High-Performance ComputingNCBJLaboratoire AstroParticule et Cosmologie - CNRSUniversità di Urbino Carlo BoUniversità degli Studi di SassariUniversità di Trento, INFN-TIFPAWigner RCP, RMKIINFN Sezione di CagliariRESCEU, University of TokyoUniv Lyon, ENS de Lyon, CNRS, Université Claude Bernard Lyon 1Universite de Nice, ARTEMIS, CNRS, Observatoire de la Cote d’AzurIstituto de Fısica Teórica, UAM/CSICAlbert-Einstein-Institut, HanoverAPC, AstroParticule et Cosmologie, CNRSGSSI, INFN, Laboratori Nazionali del Gran SassoNational Institute of Technology, Akashi CollegeLAPP, Universit´e Savoie Mont BlancUniversità di NapoliUniversità degli Studi di CamerinoThe University of Sheffield, Department of Physics and AstronomyUniversite de Paris* National and Kapodistrian University of AthensFriedrich-Schiller-Universität JenaUniversit Grenoble AlpesUniversit degli Studi di GenovaUniversit Libre de BruxellesUniversit di TrentoUniversit di SalernoUniversit degli Studi di PadovaUniversit de BordeauxUniversit di Roma La SapienzaUniversit Paris CitUniversit de StrasbourgUniversit de LyonUniversit di PisaINAF Osservatorio Astronomico di PadovaUniversit de MontpellierUniversit di Roma Tor VergataUniversit Di BolognaINAF ` Osservatorio Astronomico di TriesteINFN Sezione di Firenze
The ever-increasing number of detections of gravitational waves (GWs) from compact binaries by the Advanced LIGO and Advanced Virgo detectors allows us to perform ever-more sensitive tests of general relativity (GR) in the dynamical and strong-field regime of gravity. We perform a suite of tests of GR using the compact binary signals observed during the second half of the third observing run of those detectors. We restrict our analysis to the 15 confident signals that have false alarm rates 103yr1\leq 10^{-3}\, {\rm yr}^{-1}. In addition to signals consistent with binary black hole (BH) mergers, the new events include GW200115_042309, a signal consistent with a neutron star--BH merger. We find the residual power, after subtracting the best fit waveform from the data for each event, to be consistent with the detector noise. Additionally, we find all the post-Newtonian deformation coefficients to be consistent with the predictions from GR, with an improvement by a factor of ~2 in the -1PN parameter. We also find that the spin-induced quadrupole moments of the binary BH constituents are consistent with those of Kerr BHs in GR. We find no evidence for dispersion of GWs, non-GR modes of polarization, or post-merger echoes in the events that were analyzed. We update the bound on the mass of the graviton, at 90% credibility, to mg2.42×1023eV/c2m_g \leq 2.42 \times 10^{-23} \mathrm{eV}/c^2. The final mass and final spin as inferred from the pre-merger and post-merger parts of the waveform are consistent with each other. The studies of the properties of the remnant BHs, including deviations of the quasi-normal mode frequencies and damping times, show consistency with the predictions of GR. In addition to considering signals individually, we also combine results from the catalog of GW signals to calculate more precise population constraints. We find no evidence in support of physics beyond GR.
A rigorous proof of convergence for Reservoir Computing to a deterministic recurrent kernel in the infinite-width limit, establishing an O(1/√N) rate, is presented. Additionally, Structured Reservoir Computing (SRC) is introduced, reducing the computational complexity of the recurrent step from O(N^2) to O(N log N) and demonstrating comparable performance on chaotic time series prediction tasks at significantly larger reservoir sizes.
Researchers from Inria, ENS, CNRS, and PSL introduce WARI and SMS, two new evaluation measures for time series segmentation, alongside a formal typology of segmentation errors. These measures enhance the interpretability of segmentation quality by accounting for temporal error positions and specific error types, providing diagnostic insights into algorithm performance.
Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations. Our code, datasets and trained models are available at this https URL.
118
Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley-Terry model, which may not accurately capture the complexities of real human preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF) offers a more direct alternative by framing the problem as finding a Nash equilibrium of a game defined by these preferences. In this work, we introduce Nash Mirror Prox (NashMP\mathtt{Nash-MP}), an online NLHF algorithm that leverages the Mirror Prox optimization scheme to achieve fast and stable convergence to the Nash equilibrium. Our theoretical analysis establishes that Nash-MP exhibits last-iterate linear convergence towards the β\beta-regularized Nash equilibrium. Specifically, we prove that the KL-divergence to the optimal policy decreases at a rate of order (1+2β)N/2(1+2\beta)^{-N/2}, where NN is a number of preference queries. We further demonstrate last-iterate linear convergence for the exploitability gap and uniformly for the span semi-norm of log-probabilities, with all these rates being independent of the size of the action space. Furthermore, we propose and analyze an approximate version of Nash-MP where proximal steps are estimated using stochastic policy gradients, making the algorithm closer to applications. Finally, we detail a practical implementation strategy for fine-tuning large language models and present experiments that demonstrate its competitive performance and compatibility with existing methods.
There are no more papers matching your filters at the moment.