Institute of Science and Technology Austria (ISTA)
This paper argues that the choice of optimizer qualitatively alters the properties of learned solutions in deep neural networks, acting as a powerful mechanism for encoding inductive biases beyond architecture and data. Illustrative experiments demonstrate that non-diagonal preconditioners reduce catastrophic forgetting in continual learning by leading to more localized representations, and it reinterprets sparsity-inducing reparameterizations as optimizer designs.
The MARLIN kernel provides an efficient method for batched inference of 4-bit quantized Large Language Models (LLMs). It achieves near-optimal speedups across various batch sizes by effectively managing memory bandwidth, dequantization, and Tensor Core utilization.
Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of ad-hoc algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.
The population of the Little Red Dots (LRDs) may represent a key phase of supermassive black hole (SMBH) growth. A cocoon of dense excited gas is emerging as key component to explain the most striking properties of LRDs, such as strong Balmer breaks and Balmer absorption, as well as the weak IR emission. To dissect the structure of LRDs, we analyze new deep JWST/NIRSpec PRISM and G395H spectra of FRESCO-GN-9771, one of the most luminous known LRDs at z=5.5z=5.5. These reveal a strong Balmer break, broad Balmer lines and very narrow [O III] emission. We unveil a forest of optical [Fe II] lines, which we argue is emerging from a dense (nH=10910n_{\rm H}=10^{9-10} cm3^{-3}) warm layer with electron temperature Te7000T_{\rm e}\approx7000 K. The broad wings of Hα\alpha and Hβ\beta have an exponential profile due to electron scattering in this same layer. The high Hα:Hβ:Hγ\rm H\alpha:H\beta:H\gamma flux ratio of 10.4:1:0.14\approx10.4:1:0.14 is an indicator of collisional excitation and resonant scattering dominating the Balmer line emission. A narrow Hγ\gamma component, unseen in the other two Balmer lines due to outshining by the broad components, could trace the ISM of a normal host galaxy with a star formation rate 5\sim5 M_{\odot} yr1^{-1}. The warm layer is mostly opaque to Balmer transitions, producing a characteristic P-Cygni profile in the line centers suggesting outflowing motions. This same layer is responsible for shaping the Balmer break. The broad-band spectrum can be reasonably matched by a simple photoionized slab model that dominates the λ>1500\lambda>1500 Å continuum and a low mass (108\sim10^8 M_{\odot}) galaxy that could explain the narrow [O III], with only subdominant contribution to the UV continuum. Our findings indicate that Balmer lines are not directly tracing gas kinematics near the SMBH and that the BH mass scale is likely much lower than virial indicators suggest.
JWST has revealed an abundance of supermassive black holes (BHs) in the early Universe, and yet the lowest mass seed black holes that gave rise to these populations remain elusive. Here we present a systematic search for broad-line Active Galactic Nuclei (AGNs) in some of the faintest high-zz galaxies surveyed yet by combining ultra-deep JWST/NIRSpec G395M spectroscopy with the strong lensing aid in Abell S1063. By employing the profile of the [OIII]λ5007\lambda 5007 emission lines as a template for narrow-line components and carefully cross-validating with mock observations, we identify a sample of ten broad-line AGNs at $4.5
Differentially private gradient descent (DP-GD) is a popular algorithm to train deep learning models with provable guarantees on the privacy of the training data. In the last decade, the problem of understanding its performance cost with respect to standard GD has received remarkable attention from the research community, which formally derived upper bounds on the excess population risk RPR_{P} in different learning settings. However, existing bounds typically degrade with over-parameterization, i.e., as the number of parameters pp gets larger than the number of training samples nn -- a regime which is ubiquitous in current deep-learning practice. As a result, the lack of theoretical insights leaves practitioners without clear guidance, leading some to reduce the effective number of trainable parameters to improve performance, while others use larger models to achieve better results through scale. In this work, we show that in the popular random features model with quadratic loss, for any sufficiently large pp, privacy can be obtained for free, i.e., RP=o(1)\left|R_{P} \right| = o(1), not only when the privacy parameter ε\varepsilon has constant order, but also in the strongly private setting $\varepsilon = o(1)$. This challenges the common wisdom that over-parameterization inherently hinders performance in private learning.
Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in O(n3)O(n^3) time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's NN-point algorithm based on Fast Fourier Transform (FFT) in O(n2log(n))O(n^2 \log(n)) time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to 25%25\% across different model sizes. Our code is available at \href{this https URL}{\texttt{this https URL}}.
10
Beta Pictoris is an A-type star hosting a complex planetary system with two massive gas giants and a prominent debris disk. Variable absorption lines in its stellar spectrum have been interpreted as signatures of exocomets (comet-like bodies transiting the star). Stellar flybys can gravitationally perturb objects in the outer comet reservoir, altering their orbits and potentially injecting them into the inner system, thereby triggering exocomet showers. We aim to assess the contribution of stellar flybys to the observed exocomet activity by reconstructing the stellar encounter history of beta Pictoris in the past and future. We used Gaia DR3 data, supplemented with radial velocities from complementary spectroscopic surveys, to compile a catalogue of stars currently within 80 pc of beta Pictoris. Their orbits were integrated backward and forward in time in an axisymmetric Galactic potential (Gala package) to identify encounters within 2 pc of the system. We identified 99 416 stars within 80 pc of beta Pictoris at present with resolved kinematics. Among these, 49 stars (including the eight components of five binaries) encounter beta Pictoris within 2 pc between -1.5 Myr and +2 Myr. For four of the binaries, the centre-of-mass trajectories also pass within 2 pc. We estimate the sample to be more than 60 % complete within 0.5 Myr of the present. Despite beta Pictoris being the eponym of its famous moving group, none of the identified encounters involved its moving group members; all are unrelated field stars. We find no encounter capable of shaping observed disc structures, although stellar flybys may contribute to the long-term evolution of a potential Oort Cloud. Our catalogue constitutes the most complete reconstruction of the beta Pictoris encounter history to date and provides a robust foundation for future dynamical simulations.
The detection of strong Balmer breaks and absorption features in Little Red Dots (LRDs) suggests they host AGN embedded within dense gas envelopes, potentially powered by super-Eddington accretion. We present GLIMPSE-17775, a luminous (Lbol1045L_{\rm bol}\sim10^{45} erg s1^{-1}) LRD at z=3.501z=3.501 behind Abell S1063 (μ2\mu\sim2), observed with deep JWST/NIRCam and a \sim20 hr (80 hr de-lensed) NIRSpec/G395M spectrum. The data reveal 40+ emission and absorption features, including a rich forest of low-ionization FeII lines and numerous broad hydrogen recombination transitions. We use this depth to test the dense-gas interpretation through five independent diagnostics. Nearly all permitted lines show exponential wings with consistent FWHM, the signature of Thomson scattering requiring ne108n_e\gtrsim10^8 cm3^{-3}. Adopting this width yields MBH106.7MM_{\rm BH}\sim10^{6.7}M_\odot, a factor of ten lower than Gaussian fits, and λEdd1.8\lambda_{\rm Edd}\sim1.8. Additional diagnostics support the same picture: a pronounced Balmer break (fν,4050/fν,3670=2.0±0.1f_{\nu,4050}/f_{\nu,3670}=2.0\pm0.1), enhanced HeI λ7065\lambda7065 and λ10830\lambda10830 with P-Cygni absorption, Bowen-fluorescent OI λ8446\lambda8446-λ11290\lambda11290 emission requiring Lyβ\beta pumping, and 16 FeII lines matching fluorescence models. These features indicate a dense (n108n\sim10^8 cm3^{-3}), partially ionized cocoon where scattering and fluorescence dominate line formation, providing strong evidence that at least some LRDs are powered by super-Eddington black-hole growth in the early Universe.
We propose that black holes are \emph{soliton-esque} objects, where gravitational collapse is balanced by quantum vacuum dispersion, modeled via R+αR2R+\alpha R^{2} gravity. Classical singularities are replaced by oscillating, finite-radius cores, thereby evading static no-go theorems. The event horizon is replaced by the \textit{Lamarina}, a surface of maximum redshift whose surface geometry yields Hawking-like radiation with corrections. The Raychaudhuri equations impose a Dyson-type ceiling on the maximum radiated power (Pc5/G)(P_{\infty} \lesssim c^{5}/G), while effective field theory matching dictates a universal minimum Lamarina radius set by the dispersion scale.
Continual learning is a subfield of machine learning, which aims to allow machine learning models to continuously learn on new data, by accumulating knowledge without forgetting what was learned in the past. In this work, we take a step back, and ask: "Why should one care about continual learning in the first place?". We set the stage by examining recent continual learning papers published at four major machine learning conferences, and show that memory-constrained settings dominate the field. Then, we discuss five open problems in machine learning, and even though they might seem unrelated to continual learning at first sight, we show that continual learning will inevitably be part of their solution. These problems are model editing, personalization and specialization, on-device learning, faster (re-)training and reinforcement learning. Finally, by comparing the desiderata from these unsolved problems and the current assumptions in continual learning, we highlight and discuss four future directions for continual learning research. We hope that this work offers an interesting perspective on the future of continual learning, while displaying its potential value and the paths we have to pursue in order to make it successful. This work is the result of the many discussions the authors had at the Dagstuhl seminar on Deep Continual Learning, in March 2023.
Large-scale deep learning models are known to memorize parts of the training set. In machine learning theory, memorization is often framed as interpolation or label fitting, and classical results show that this can be achieved when the number of parameters pp in the model is larger than the number of training samples nn. In this work, we consider memorization from the perspective of data reconstruction, demonstrating that this can be achieved when pp is larger than dndn, where dd is the dimensionality of the data. More specifically, we show that, in the random features model, when pdnp \gg dn, the subspace spanned by the training samples in feature space gives sufficient information to identify the individual samples in input space. Our analysis suggests an optimization method to reconstruct the dataset from the model parameters, and we demonstrate that this method performs well on various architectures (random features, two-layer fully-connected and deep residual networks). Our results reveal a law of data reconstruction, according to which the entire training dataset can be recovered as pp exceeds the threshold dndn.
1
This is the first part of a general description in terms of mass transport for time-evolving interacting particles systems, at a mesoscopic level. Beyond kinetic theory, our framework naturally applies in biology, computer vision, and engineering. The central object of our study is a new discrepancy d\mathsf d between two probability distributions in position and velocity states, which is reminiscent of the 22-Wasserstein distance, but of second-order nature. We construct d\mathsf d in two steps. First, we optimise over transport plans. The cost function is given by the minimal acceleration between two coupled states on a fixed time horizon TT. Second, we further optimise over the time horizon T>0T>0. We prove the existence of optimal transport plans and maps, and study two time-continuous characterisations of d\mathsf d. One is given in terms of dynamical transport plans. The other one -- in the spirit of the Benamou--Brenier formula -- is formulated as the minimisation of an action of the acceleration field, constrained by Vlasov's equations. Equivalence of static and dynamical formulations of d\mathsf d holds true. While part of this result can be derived from recent, parallel developments in optimal control between measures, we give an original proof relying on two new ingredients: Galilean regularisation of Vlasov's equations and a kinetic Monge--Mather shortening principle. Finally, we establish a first-order differential calculus in the geometry induced by d\mathsf d, and identify solutions to Vlasov's equations with curves of measures satisfying a certain d\mathsf d-absolute continuity condition. One consequence is an explicit formula for the d\mathsf d-derivative of such curves.
In many scientific experiments, the data annotating cost constraints the pace for testing novel hypotheses. Yet, modern machine learning pipelines offer a promising solution, provided their predictions yield correct conclusions. We focus on Prediction-Powered Causal Inferences (PPCI), i.e., estimating the treatment effect in an unlabeled target experiment, relying on training data with the same outcome annotated but potentially different treatment or effect modifiers. We first show that conditional calibration guarantees valid PPCI at population level. Then, we introduce a sufficient representation constraint transferring validity across experiments, which we propose to enforce in practice in Deconfounded Empirical Risk Minimization, our new model-agnostic training objective. We validate our method on synthetic and real-world scientific data, solving impossible problem instances for Empirical Risk Minimization even with standard invariance constraints. In particular, for the first time, we achieve valid causal inference on a scientific experiment with complex recording and no human annotations, fine-tuning a foundational model on our similar annotated experiment.
Differentially private (DP) linear regression has received significant attention in the recent theoretical literature, with several works aimed at obtaining improved error rates. A common approach is to set the clipping constant much larger than the expected norm of the per-sample gradients. While simplifying the analysis, this is however in sharp contrast with what empirical evidence suggests to optimize performance. Our work bridges this gap between theory and practice: we provide sharper rates for DP stochastic gradient descent (DP-SGD) by crucially operating in a regime where clipping happens frequently. Specifically, we consider the setting where the data is multivariate Gaussian, the number of training samples nn is proportional to the input dimension dd, and the algorithm guarantees constant-order zero concentrated DP. Our method relies on establishing a deterministic equivalent for the trajectory of DP-SGD in terms of a family of ordinary differential equations (ODEs). As a consequence, the risk of DP-SGD is bounded between two ODEs, with upper and lower bounds matching for isotropic data. By studying these ODEs when n/dn / d is large enough, we demonstrate the optimality of aggressive clipping, and we uncover the benefits of decaying learning rate and private noise scheduling.
This research provides a theoretical framework explaining how token embeddings in attention mechanisms learn to encode token importance based on data statistics. It demonstrates that embeddings rapidly capture a token's predictive value, and the special `⟨cls⟩` token's embedding converges to a max-margin solution that provably selects important tokens for classification, supported by empirical validation on synthetic and real datasets.
The physical nature of Little Red Dots (LRDs) - a population of compact, red galaxies revealed by JWST - remains unclear. Photometric samples are constructed from varying selection criteria with limited spectroscopic follow-up available to test intrinsic spectral shapes and prevalence of broad emission lines. We use the RUBIES survey, a large spectroscopic program with wide color-morphology coverage and homogeneous data quality, to systematically analyze the emission-line kinematics, spectral shapes, and morphologies of \sim1500 galaxies at z > 3.1. We identify broad Balmer lines via a novel fitting approach that simultaneously models NIRSpec/PRISM and G395M spectra, yielding 80 broad-line sources with 28 (35%) at z > 6. A large subpopulation naturally emerges from the broad Balmer line sources, with 36 exhibiting `v-shaped' UV-to-optical continua and a dominant point source component in the rest-optical; we define these as spectroscopic LRDs, constituting the largest such sample to date. Strikingly, the spectroscopic LRD population is largely recovered when either a broad line or rest-optical point source is required in combination with a v-shaped continuum, suggesting an inherent link between these three defining characteristics. We compare the spectroscopic LRD sample to published photometric searches. Although these selections have high accuracy, down to \rm F444W<26.5, only 50-62% of the RUBIES LRDs were previously identified. The remainder were missed due to a mixture of faint rest-UV photometry, comparatively blue rest-optical colors, or highly uncertain photometric redshifts. Our findings highlight that well-selected spectroscopic campaigns are essential for robust LRD identification, while photometric criteria require refinement to capture the full population.
We propose Scalable Mechanistic Neural Network (S-MNN), an enhanced neural network framework designed for scientific machine learning applications involving long temporal sequences. By reformulating the original Mechanistic Neural Network (MNN) (Pervez et al., 2024), we reduce the computational time and space complexities from cubic and quadratic with respect to the sequence length, respectively, to linear. This significant improvement enables efficient modeling of long-term dynamics without sacrificing accuracy or interpretability. Extensive experiments demonstrate that S-MNN matches the original MNN in precision while substantially reducing computational resources. Consequently, S-MNN can drop-in replace the original MNN in applications, providing a practical and efficient tool for integrating mechanistic bottlenecks into neural network models of complex dynamical systems. Source code is available at this https URL
When examined through the lens of their residual streams, a puzzling property emerges in transformer networks: residual contributions (e.g., attention heads) sometimes specialize in specific tasks or input attributes. In this paper, we analyze this phenomenon in vision transformers, focusing on the spectral geometry of residuals, and explore its implications for modality alignment in vision-language models. First, we link it to the intrinsically low-dimensional structure of visual head representations, zooming into their principal components and showing that they encode specialized roles across a wide variety of input data distributions. Then, we analyze the effect of head specialization in multimodal models, focusing on how improved alignment between text and specialized heads impacts zero-shot classification performance. This specialization-performance link consistently holds across diverse pre-training data, network sizes, and objectives, demonstrating a powerful new mechanism for boosting zero-shot classification through targeted alignment. Ultimately, we translate these insights into actionable terms by introducing ResiDual, a technique for spectral alignment of the residual stream. Much like panning for gold, it lets the noise from irrelevant unit principal components (i.e., attributes) wash away to amplify task-relevant ones. Remarkably, this dual perspective on modality alignment yields fine-tuning level performance on different data distributions while modelling an extremely interpretable and parameter-efficient transformation, as we extensively show on 70 pre-trained network-dataset combinations (7 models, 10 datasets).
There are no more papers matching your filters at the moment.