Ecole Normale SupérieurePSL University
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at this https URL.
Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to 3131 other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across 100100 regression datasets and is competitive to the best methods across 200200 classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.
27
Symbolic regression, the task of predicting the mathematical expression of a function from the observation of its values, is a difficult task which usually involves a two-step procedure: predicting the "skeleton" of the expression up to the choice of numerical constants, then fitting the constants by optimizing a non-convex loss function. The dominant approach is genetic programming, which evolves candidates by iterating this subroutine a large number of times. Neural networks have recently been tasked to predict the correct skeleton in a single try, but remain much less powerful. In this paper, we challenge this two-step procedure, and task a Transformer to directly predict the full mathematical expression, constants included. One can subsequently refine the predicted constants by feeding them to the non-convex optimizer as an informed initialization. We present ablations to show that this end-to-end approach yields better results, sometimes even without the refinement step. We evaluate our model on problems from the SRBench benchmark and show that our model approaches the performance of state-of-the-art genetic programming with several orders of magnitude faster inference.
Researchers from DeepMind and collaborators introduce Continuous Diffusion for Categorical Data (CDCD), a framework that applies continuous diffusion models to discrete data like text by embedding tokens in Euclidean space. The framework, which enables capabilities such as efficient sampling and classifier-free guidance, achieves favorable MAUVE scores on text generation tasks while demonstrating areas for improvement in machine translation.
2
Researchers at ENS and Inria introduced HowTo100M, a dataset of 136 million video clips derived from narrated instructional videos, alongside a powerful text-video embedding model. The model, trained on this weakly supervised data, achieved state-of-the-art results in text-to-video retrieval and action localization across various benchmarks, demonstrating robust transferability and significantly reducing the need for manual annotations by achieving SOTA with only 20% of MSR-VTT data.
MindEye introduces a sophisticated framework for reconstructing and retrieving viewed images from human fMRI activity, achieving state-of-the-art accuracy in both semantic and perceptual details. The framework integrates deep MLPs, a diffusion prior, and novel contrastive learning techniques to translate brain signals into high-fidelity visual representations and enable fine-grained image retrieval.
Researchers at MIT's CSAIL developed Particle Guidance, a framework to enhance the diversity and sample efficiency of diffusion models by jointly guiding a set of particles with a time-evolving potential. The method improved mode recovery in synthetic tests, boosted both recall and precision in molecular conformer generation, and enhanced text-to-image diversity while maintaining sample quality without retraining the base model.
66
Gonçalves et al. conducted the first comprehensive theoretical investigation of both spinless and spinful charge excitations in moiré Fractional Chern Insulators, using large-scale exact diagonalization. The study explains the experimentally observed hierarchy of activation gaps in twisted MoTe₂, finding that spinful gaps are consistently larger than spinless ones, and reveals that FCI quasiparticles exhibit significant energy dispersion.
Addressing real-world optimization problems becomes particularly challenging when analytic objective functions or constraints are unavailable. While numerous studies have addressed the issue of unknown objectives, limited research has focused on scenarios where feasibility constraints are not given explicitly. Overlooking these constraints can lead to spurious solutions that are unrealistic in practice. To deal with such unknown constraints, we propose to perform optimization within the data manifold using diffusion models. To constrain the optimization process to the data manifold, we reformulate the original optimization problem as a sampling problem from the product of the Boltzmann distribution defined by the objective function and the data distribution learned by the diffusion model. Depending on the differentiability of the objective function, we propose two different sampling methods. For differentiable objectives, we propose a two-stage framework that begins with a guided diffusion process for warm-up, followed by a Langevin dynamics stage for further correction. For non-differentiable objectives, we propose an iterative importance sampling strategy using the diffusion model as the proposal distribution. Comprehensive experiments on a synthetic dataset, six real-world black-box optimization datasets, and a multi-objective molecule optimization dataset show that our method achieves better or comparable performance with previous state-of-the-art baselines.
Child-centered long-form recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, the first self-supervised speech representation model trained on 13,000 hours of multilingual child-centered long-form recordings spanning over 40 languages. We evaluate BabyHuBERT on speaker segmentation, identifying when target children speak versus female adults, male adults, or other children -- a fundamental preprocessing step for analyzing naturalistic language experiences. BabyHuBERT achieves F1-scores from 52.1% to 74.4% across six diverse datasets, consistently outperforming W2V2-LL4300 (trained on English long-forms) and standard HuBERT (trained on clean adult speech). Notable improvements include 13.2 absolute F1 points over HuBERT on Vanuatu and 15.9 points on Solomon Islands corpora, demonstrating effectiveness on underrepresented languages. By sharing code and models, BabyHuBERT serves as a foundation model for child speech research, enabling fine-tuning on diverse downstream tasks.
3
We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity. In many context, e.g. imaging inverse problems, potentials are non-convex and non-smooth. Proximal Stochastic Gradient Langevin Algorithm (PSGLA) is a popular algorithm to handle such potentials. It combines the forward-backward optimization algorithm with a ULA step. Our main stability result combined with properties of the Moreau envelope allows us to derive the first proof of convergence of the PSGLA for non-convex potentials. We empirically validate our methodology on synthetic data and in the context of imaging inverse problems. In particular, we observe that PSGLA exhibits faster convergence rates than Stochastic Gradient Langevin Algorithm for posterior sampling while preserving its restoration properties.
Model and hyperparameter selection are critical but challenging in machine learning, typically requiring expert intuition or expensive automated search. We investigate whether large language models (LLMs) can act as in-context meta-learners for this task. By converting each dataset into interpretable metadata, we prompt an LLM to recommend both model families and hyperparameters. We study two prompting strategies: (1) a zero-shot mode relying solely on pretrained knowledge, and (2) a meta-informed mode augmented with examples of models and their performance on past tasks. Across synthetic and real-world benchmarks, we show that LLMs can exploit dataset metadata to recommend competitive models and hyperparameters without search, and that improvements from meta-informed prompting demonstrate their capacity for in-context meta-learning. These results highlight a promising new role for LLMs as lightweight, general-purpose assistants for model selection and hyperparameter optimization.
5
Optimal Transport (OT) has recently emerged as a central tool in data sciences to compare in a geometrically faithful way point clouds and more generally probability distributions. The wide adoption of OT into existing data analysis and machine learning pipelines is however plagued by several shortcomings. This includes its lack of robustness to outliers, its high computational costs, the need for a large number of samples in high dimension and the difficulty to handle data in distinct spaces. In this review, we detail several recently proposed approaches to mitigate these issues. We insist in particular on unbalanced OT, which compares arbitrary positive measures, not restricted to probability distributions (i.e. their total mass can vary). This generalization of OT makes it robust to outliers and missing data. The second workhorse of modern computational OT is entropic regularization, which leads to scalable algorithms while lowering the sample complexity in high dimension. The last point presented in this review is the Gromov-Wasserstein (GW) distance, which extends OT to cope with distributions belonging to different metric spaces. The main motivation for this review is to explain how unbalanced OT, entropic regularization and GW can work hand-in-hand to turn OT into efficient geometric loss functions for data sciences.
Understanding the relationships among genes, compounds, and their interactions in living organisms remains limited due to technological constraints and the complexity of biological data. Deep learning has shown promise in exploring these relationships using various data types. However, transcriptomics, which provides detailed insights into cellular states, is still underused due to its high noise levels and limited data availability. Recent advancements in transcriptomics sequencing provide new opportunities to uncover valuable insights, especially with the rise of many new foundation models for transcriptomics, yet no benchmark has been made to robustly evaluate the effectiveness of these rising models for perturbation analysis. This article presents a novel biologically motivated evaluation framework and a hierarchy of perturbation analysis tasks for comparing the performance of pretrained foundation models to each other and to more classical techniques of learning from transcriptomics data. We compile diverse public datasets from different sequencing techniques and cell lines to assess models performance. Our approach identifies scVI and PCA to be far better suited models for understanding biological perturbations in comparison to existing foundation models, especially in their application in real-world scenarios.
Efficient learning of quantum state properties is both a fundamental and practical problem in quantum information theory. Classical shadows have emerged as an efficient method for estimating properties of unknown quantum states, with rigorous statistical guarantees, by performing randomized measurement on a few number of copies. With the advent of photonic technologies, formulating efficient learning algorithms for such platforms comes out as a natural problem. Here, we introduce a classical shadow protocol for learning photonic quantum states via randomized passive linear optical transformations and photon-number measurement. We show that this scheme is efficient for a large class of observables of interest. We experimentally demonstrate our findings on a twelve-mode photonic integrated quantum processing unit. Our protocol allows for scalable learning of a wide range of photonic state properties and paves the way to applying the already rich variety of applications of classical shadows to photonic platforms.
LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24×\times faster. The code is available at this https URL
2
Scaling relations between galactic parameters represent key pieces of evidence for investigating the processes of galaxy formation and evolution. In most studies, these relations have been obtained for large portions of the galaxies (i.e., on kpc scales), but it is also important to evaluate these relations in smaller scales. In this work, we used optical data cubes of a subsample of nearby galaxies of the DIVING 3D survey. These allowed us to analyze the scaling relations involving stellar velocity dispersion, stellar population age, and stellar population metallicity at the nuclear and circumnuclear regions of galaxies. We detected correlations between the stellar velocity dispersion and the age, metallicity, and total stellar mass. These correlations are independent of galaxy inclinations, considering all morphological types, nuclear activity, and the presence or absence of galactic bars. We detected, for the first time, a correlation between the stellar velocity dispersion and stellar metallicity in the nuclear regions of galaxies. It is found to be qualitatively consistent with the well-known stellar mass-metallicity relation. We also noted that barred galaxies tend to show younger and less metal-rich stellar populations than unbarred galaxies in the central regions, which may be a consequence of the bar triggering star formation in the nuclear regions of these objects. Some active galactic nuclei (AGNs) in our sample are positioned above the observed correlation between stellar velocity dispersion and stellar population age, suggesting that their nuclear stellar populations are younger than expected. This may be a consequence of positive AGN feedback, triggering star formation. Conversely, starburst galaxies do not show nuclear stellar populations at ages over one billion years.
We present a multiphase, resolved study of the galactic wind extending from the nearby starburst galaxy NGC 4666. For this we use VLT/MUSE observations from the GECKOS program and HI data from the WALLABY survey. We identify both ionised and HI gas in a biconical structure extending to at least zz\sim8 kpc from the galaxy disk, with increasing velocity offsets above the midplane in both phases, consistent with a multiphase wind. The measured electron density, using [SII], differs significantly from standard expectations of galactic winds. We find electron density declines from the galaxy centre to 2\sim2 kpc, then rises again, remaining high (100300\sim100-300 cm3^{-3}) out to \sim5 kpc. We find that HI dominates the mass loading. The total HI mass outflow rate (above z >2z~>2 kpc) is between 513 M yr15-13~M_{\odot}~\rm yr^{-1}, accounting for uncertainties from disk-blurring and group interactions. The total ionised mass outflow rate (traced by Hα\alpha) is between 0.5 M yr10.5~M_{\odot}~\rm yr^{-1} and 5 M yr15~M_{\odot}~\rm yr^{-1}, depending on ne(z)n_e(z) assumptions. From ALMA/ACA observations, we place an upper-limit on CO flux in the outflow which correlates to 2.9 M yr1\lesssim2.9~M_{\odot}~\rm yr^{-1}. We also show that the entire outflow is not limited to the bicone, but a secondary starburst at the edge generates a more widespread outflow, which should be included in simulations. The cool gas in NGC 4666 wind has insufficient velocity to escape the halo of a galaxy of its mass, especially because most of the mass is present in the slower atomic phase. This strong biconical wind contributes to gas cycling around the galaxy.
Neural Ordinary Differential Equations (Neural ODEs) are the continuous analog of Residual Neural Networks (ResNets). We investigate whether the discrete dynamics defined by a ResNet are close to the continuous one of a Neural ODE. We first quantify the distance between the ResNet's hidden state trajectory and the solution of its corresponding Neural ODE. Our bound is tight and, on the negative side, does not go to 0 with depth N if the residual functions are not smooth with depth. On the positive side, we show that this smoothness is preserved by gradient descent for a ResNet with linear residual functions and small enough initial loss. It ensures an implicit regularization towards a limit Neural ODE at rate 1 over N, uniformly with depth and optimization time. As a byproduct of our analysis, we consider the use of a memory-free discrete adjoint method to train a ResNet by recovering the activations on the fly through a backward pass of the network, and show that this method theoretically succeeds at large depth if the residual functions are Lipschitz with the input. We then show that Heun's method, a second order ODE integration scheme, allows for better gradient estimation with the adjoint method when the residual functions are smooth with depth. We experimentally validate that our adjoint method succeeds at large depth, and that Heun method needs fewer layers to succeed. We finally use the adjoint method successfully for fine-tuning very deep ResNets without memory consumption in the residual layers.
We present a sample of six F200W and three F277W dropout sources identified as 16161616 candidates present mass-weighted ages around 30 Myr, and attenuations \mathrm{A(V)}<0.1 mag. Their average stellar mass is M107M\mathrm{M}_\bigstar\sim10^{7}\,\mathrm{M}_\odot, implying a stellar-to-baryon mass fraction around 10% if the emissivity increases with redshift, or significantly higher otherwise. Three candidates present very blue UV spectral slopes (β3\beta\sim-3) compatible with Pop III young (10\lesssim10 Myr) stars and/or high escape fractions of ionizing photons; the rest have β2.5\beta\sim-2.5 similar to z=1012z=10-12 samples.
There are no more papers matching your filters at the moment.