Univ-Rennes
· +2
This paper offers a comprehensive guide to self-supervised learning (SSL), systematizing diverse methods into coherent families and providing practical implementation advice. It aims to make the rapidly evolving field more accessible by distilling historical context, theoretical underpinnings, and empirical best practices for various data modalities.
Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space. In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories. We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution. We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear transform from the human and camera latents to a framing latent. We then introduce auxiliary sampling, which exploits this linear transform to steer generation toward a coherent framing modality. To support this task, we also introduce the PulpMotion dataset, a human-motion and camera-trajectory dataset with rich captions, and high-quality human motions. Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent human-camera motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task. Code, models and data are available in our \href{this https URL}{project page}.
Foundation Models are designed to serve as versatile embedding machines, with strong zero shot capabilities and superior generalization performance when fine-tuned on diverse downstream tasks. While this is largely true for language and vision foundation models, we argue that the inherent diversity of time series data makes them less suited for building effective foundation models. We demonstrate this using forecasting as our downstream task. We show that the zero-shot capabilities of a time series foundation model are significantly influenced and tied to the specific domains it has been pretrained on. Furthermore, when applied to unseen real-world time series data, fine-tuned foundation models do not consistently yield substantially better results, relative to their increased parameter count and memory footprint, than smaller, dedicated models tailored to the specific forecasting task at hand.
Quadrotors can carry slung loads to hard-to-reach locations at high speed. Since a single quadrotor has limited payload capacities, using a team of quadrotors to collaboratively manipulate a heavy object is a scalable and promising solution. However, existing control algorithms for multi-lifting systems only enable low-speed and low-acceleration operations due to the complex dynamic coupling between quadrotors and the load, limiting their use in time-critical missions such as search and rescue. In this work, we present a solution to significantly enhance the agility of cable-suspended multi-lifting systems. Unlike traditional cascaded solutions, we introduce a trajectory-based framework that solves the whole-body kinodynamic motion planning problem online, accounting for the dynamic coupling effects and constraints between the quadrotors and the load. The planned trajectory is provided to the quadrotors as a reference in a receding-horizon fashion and is tracked by an onboard controller that observes and compensates for the cable tension. Real-world experiments demonstrate that our framework can achieve at least eight times greater acceleration than state-of-the-art methods to follow agile trajectories. Our method can even perform complex maneuvers such as flying through narrow passages at high speed. Additionally, it exhibits high robustness against load uncertainties and does not require adding any sensors to the load, demonstrating strong practicality.
Researchers from Univ. Rennes, Inria, CNRS, IRISA, and LABEL4.AI developed a guidance watermarking framework that enables any differentiable post-hoc watermarking scheme to be intrinsically embedded into diffusion model outputs. This method robustly identifies AI-generated images without retraining the generative model, achieving up to three times greater watermark capacity and significantly improved detectability against diverse attacks.
PAC generalization bounds on the risk, when expressed in terms of the expected loss, are often insufficient to capture imbalances between subgroups in the data. To overcome this limitation, we introduce a new family of risk measures, called constrained f-entropic risk measures, which enable finer control over distributional shifts and subgroup imbalances via f-divergences, and include the Conditional Value at Risk (CVaR), a well-known risk measure. We derive both classical and disintegrated PAC-Bayesian generalization bounds for this family of risks, providing the first disintegratedPAC-Bayesian guarantees beyond standard risks. Building on this theory, we design a self-bounding algorithm that minimizes our bounds directly, yielding models with guarantees at the subgroup level. Finally, we empirically demonstrate the usefulness of our approach.
Designing categorical kernels is a major challenge for Gaussian process regression with continuous and categorical inputs. Despite previous studies, it is difficult to identify a preferred method, either because the evaluation metrics, the optimization procedure, or the datasets change depending on the study. In particular, reproducible code is rarely available. The aim of this paper is to provide a reproducible comparative study of all existing categorical kernels on many of the test cases investigated so far. We also propose new evaluation metrics inspired by the optimization community, which provide quantitative rankings of the methods across several tasks. From our results on datasets which exhibit a group structure on the levels of categorical inputs, it appears that nested kernels methods clearly outperform all competitors. When the group structure is unknown or when there is no prior knowledge of such a structure, we propose a new clustering-based strategy using target encodings of categorical variables. We show that on a large panel of datasets, which do not necessarily have a known group structure, this estimation strategy still outperforms other approaches while maintaining low computational cost.
Time Series Foundation Models (TSFMs) have shown promising zero-shot generalization across diverse forecasting tasks. However, their robustness to continual adaptation remains underexplored. In this work, we investigate the extent to which TSFMs suffer from catastrophic forgetting when fine-tuned sequentially on multiple datasets. Using synthetic datasets designed with varying degrees of periodic structure, we measure the trade-off between adaptation to new data and retention of prior knowledge. Our experiments reveal that, while fine-tuning improves performance on new tasks, it often causes significant degradation on previously learned ones, illustrating a fundamental stability-plasticity dilemma.
This paper offers a comprehensive review of training methodologies for Physical Neural Networks (PNNs), addressing the escalating energy and performance demands of digital AI. It systematically categorizes diverse training approaches, from physics-aware backpropagation to in-situ gradient computation, and evaluates their potential to enable energy-efficient, scalable AI systems.
18
Watermarking is a technical means to dissuade malfeasant usage of Large Language Models. This paper proposes a novel watermarking scheme, so-called WaterMax, that enjoys high detectability while sustaining the quality of the generated text of the original LLM. Its new design leaves the LLM untouched (no modification of the weights, logits, temperature, or sampling technique). WaterMax balances robustness and complexity contrary to the watermarking techniques of the literature inherently provoking a trade-off between quality and robustness. Its performance is both theoretically proven and experimentally validated. It outperforms all the SotA techniques under the most complete benchmark suite. Code available at this https URL.
Sparfels presents a method for rapidly reconstructing detailed 3D geometry from a few unposed images, combining a 3D foundation model with test-time 2D Gaussian Splatting and novel variance regularization. The approach achieves state-of-the-art reconstruction accuracy on the DTU dataset within approximately three minutes on a consumer GPU while enhancing novel view synthesis quality and camera pose estimation.
BOGausS improves 3D Gaussian Splatting optimization by addressing challenges in parameter tuning, model size reduction, and visual artifacts. It achieves higher quality scene reconstructions with up to ten times fewer Gaussians than prior methods, developed by researchers from Orange Innovation and French academic institutions.
In repeated games, players choose actions concurrently at each step. We consider a parameterized setting of repeated games in which the players form a population of an arbitrary size. Their utility functions encode a reachability objective. The problem is whether there exists a uniform coalition strategy for the players so that they are sure to win independently of the population size. We use algebraic tools to show that the problem can be solved in polynomial space. First we exhibit a finite semigroup whose elements summarize strategies over a finite interval of population sizes. Then, we characterize the existence of winning strategies by the existence of particular elements in this semigroup. Finally, we provide a matching complexity lower bound, to conclude that repeated population games with reachability objectives are PSPACE-complete.
This systematic scoping review synthesizes findings from 49 studies (2018-2025) on unsupervised deep generative models for anomaly detection in neuroimaging, providing a pathology-specific comparison of performance metrics and architectural design choices. The review finds that these models achieve Dice scores up to 0.77 for large lesions like brain tumors but consistently struggle with smaller or sparser abnormalities such as those in multiple sclerosis and stroke, where Dice scores are often below 0.50.
The Bethe-Hessian matrix, introduced by Saade, Krzakala, and Zdeborová (2014), is a Hermitian matrix designed for applying spectral clustering algorithms to sparse networks. Rather than employing a non-symmetric and high-dimensional non-backtracking operator, a spectral method based on the Bethe-Hessian matrix is conjectured to also reach the Kesten-Stigum detection threshold in the sparse stochastic block model (SBM). We provide the first rigorous analysis of the Bethe-Hessian spectral method in the SBM under both the bounded expected degree and the growing degree regimes. Specifically, we demonstrate that: (i) When the expected degree d2d\geq 2, the number of negative outliers of the Bethe-Hessian matrix can consistently estimate the number of blocks above the Kesten-Stigum threshold, thus confirming a conjecture from Saade, Krzakala, and Zdeborová (2014) for d2d\geq 2. (ii) For sufficiently large dd, its eigenvectors can be used to achieve weak recovery. (iii) As dd\to\infty, we establish the concentration of the locations of its negative outlier eigenvalues, and weak consistency can be achieved via a spectral method based on the Bethe-Hessian matrix.
Researchers from French institutions introduce the Fused Gromov-Wasserstein (FGW) distance, a novel optimal transport metric for structured data like graphs, which unifies both node-level feature information and graph-level structural information. The FGW distance achieves state-of-the-art performance in graph classification across various benchmarks and enables the computation of meaningful graph barycenters for unsupervised learning tasks.
98
A decentralized reinforcement learning framework, LGTC-IPPO, enables multi-agent multi-resource allocation through dynamic cluster agreements, achieving stable, high rewards and successfully reallocating resources on physical drones. This approach integrates a Liquid-Graph Time-Constant (LGTC) neural network to learn dynamic clustering, improving coordination and adaptability in complex environments.
With the rise of AI-based code generation, customizing existing code out of natural language instructions to modify visual results -such as figures or images -has become possible, promising to reduce the need for deep programming expertise. However, even experienced developers can struggle with this task, as it requires identifying relevant code regions (feature location), generating valid code variants, and ensuring the modifications reliably align with user intent. In this paper, we introduce vTikZ, the first benchmark designed to evaluate the ability of Large Language Models (LLMs) to customize code while preserving coherent visual outcomes. Our benchmark consists of carefully curated vTikZ editing scenarios, parameterized ground truths, and a reviewing tool that leverages visual feedback to assess correctness. Empirical evaluation with stateof-the-art LLMs shows that existing solutions struggle to reliably modify code in alignment with visual intent, highlighting a gap in current AI-assisted code editing approaches. We argue that vTikZ opens new research directions for integrating LLMs with visual feedback mechanisms to improve code customization tasks in various domains beyond TikZ, including image processing, art creation, Web design, and 3D modeling.
Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, currently across 14 diverse datasets and attack scenarios, standardized evaluation metrics and protocols for reproducibility and transparency. It also includes a leaderboard to compare and rank the systems to help researchers and developers enhance their reliability and robustness. We include 14 evaluation sets, 12 state-of-the-art open-source and 3 proprietary detection systems. Our study presents many systems exhibiting high EER in out-of-domain scenarios, highlighting the need for extensive cross-domain evaluation. The leaderboard is hosted on Huggingface1 and a toolkit for reproducing results across the listed datasets is available on GitHub.
Researchers from IRISA, Univ. Rennes, CNRS, Imatag, and LABEL4.AI developed LatentSeal, an image watermarking system that redefines watermarking as semantic communication to embed full-sentence textual messages into images. It achieves up to 121 times faster decoding compared to baselines and robustly reconstructs text while maintaining high imperceptibility and providing a confidence metric for extracted messages.
There are no more papers matching your filters at the moment.