Institut des Syst`emes Intelligents et de RobotiqueISIR
JAFAR: Jack up Any Feature at Any Resolution
17 Nov 2025
Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at this https URL
View blog
Resources11
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
16 Oct 2023

Rewarded Soups proposes an efficient multi-policy strategy that achieves Pareto-optimal alignment for large foundation models by interpolating the weights of independently fine-tuned expert models. This approach allows for a posteriori customization to diverse user preferences with significantly fewer training runs compared to traditional multi-objective reinforcement learning.

View blog
Resources
VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making
10 Sep 2025

VIPER integrates frozen Vision-Language Models for perception with fine-tuned Large Language Models for reasoning, using text as an intermediate representation to enable visual instruction-based planning. The framework, developed by researchers from Sorbonne Université and CNRS, achieves state-of-the-art performance on embodied AI benchmarks while providing inherent explainability of agent decisions.

View blog
Resources
I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models
Language-conditioned robotic manipulation in open-world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real-world environments. Although recent advances in vision-language models (VLMs) have significantly improved the spatial reasoning and task-planning capabilities of robots, they remain limited in their ability to recognize their own failures. In particular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets. We also present I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on post-training a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I-FailSense outperforms state-of-the-art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I-FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments and real-world with zero-shot or minimal post-training. The datasets and models are publicly released on HuggingFace (Webpage: this https URL).
View blog
Resources1
Tracing Multilingual Factual Knowledge Acquisition in Pretraining
07 Oct 2025
Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts -- an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities. We release our code and data to facilitate further research at this https URL.
View blog
Resources4
Efficient Generative Transformer Operators For Million-Point PDEs
04 Dec 2025

Researchers from Sorbonne Université and Criteo AI Lab introduce ECHO, a framework for efficient generative transformer operators that addresses the scalability and long-horizon error accumulation issues in neural operators for PDEs. ECHO achieves high spatio-temporal compression and accurate, multi-task solutions for million-point PDE trajectories, notably enabling super-resolution forecasting on a 1024x1024 Vorticity grid without out-of-memory errors where other models failed.

View blog
Resources45
How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study
15 Oct 2025
As texts generated by Large Language Models (LLMs) are ever more common and often indistinguishable from human-written content, research on automatic text detection has attracted growing attention. Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99\%. However, these claims typically assume fixed generation settings, leaving open the question of how robust such systems are to changes in decoding strategies. In this work, we systematically examine how sampling-based decoding impacts detectability, with a focus on how subtle variations in a model's (sub)word-level distribution affect detection performance. We find that even minor adjustments to decoding parameters - such as temperature, top-p, or nucleus sampling - can severely impair detector accuracy, with AUROC dropping from near-perfect levels to 1\% in some settings. Our findings expose critical blind spots in current detection methods and emphasize the need for more comprehensive evaluation protocols. To facilitate future research, we release a large-scale dataset encompassing 37 decoding configurations, along with our code and evaluation framework this https URL
View blog
Resources
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
03 Mar 2025
The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 this https URL, Pipeline v. 3.0 this https URL
View blog
Resources20
On Relation-Specific Neurons in Large Language Models
07 Oct 2025
In large language models (LLMs), certain \emph{neurons} can store distinct pieces of knowledge learned during pretraining. While factual knowledge typically appears as a combination of \emph{relations} and \emph{entities}, it remains unclear whether some neurons focus on a relation itself -- independent of any entity. We hypothesize such neurons \emph{detect} a relation in the input text and \emph{guide} generation involving such a relation. To investigate this, we study the LLama-2 family on a chosen set of relations, with a \textit{statistics}-based method. Our experiments demonstrate the existence of relation-specific neurons. We measure the effect of selectively deactivating candidate neurons specific to relation rr on the LLM's ability to handle (1) facts involving relation rr and (2) facts involving a different relation rrr' \neq r. With respect to their capacity for encoding relation information, we give evidence for the following three properties of relation-specific neurons. \textbf{(i) Neuron cumulativity.} Multiple neurons jointly contribute to processing facts involving relation rr, with no single neuron fully encoding a fact in rr on its own. \textbf{(ii) Neuron versatility.} Neurons can be shared across multiple closely related as well as less related relations. In addition, some relation neurons transfer across languages. \textbf{(iii) Neuron interference.} Deactivating neurons specific to one relation can improve LLMs' factual recall performance for facts of other relations. We make our code and data publicly available at this https URL.
View blog
Resources2
Generative AI and Creative Work: Narratives, Values, and Impacts
06 Feb 2025
Generative AI has gained a significant foothold in the creative and artistic sectors. In this context, the concept of creative work is influenced by discourses originating from technological stakeholders and mainstream media. The framing of narratives surrounding creativity and artistic production not only reflects a particular vision of culture but also actively contributes to shaping it. In this article, we review online media outlets and analyze the dominant narratives around AI's impact on creative work that they convey. We found that the discourse promotes creativity freed from its material realisation through human labor. The separation of the idea from its material conditions is achieved by automation, which is the driving force behind productive efficiency assessed as the reduction of time taken to produce. And the withdrawal of the skills typically required in the execution of the creative process is seen as a means for democratising creativity. This discourse tends to correspond to the dominant techno-positivist vision and to assert power over the creative economy and culture.
View blog
Resources
GlotLID: Language Identification for Low-Resource Languages
02 Jul 2024
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model (including future versions), code, and list of data sources are available: https://github.com/cisnlp/GlotLID.
View blog
Resources
How Programming Concepts and Neurons Are Shared in Code Language Models
01 Jun 2025

LMU Munich and Sorbonne Université researchers investigate how code language models internally represent programming languages and English through three interpretability methods, revealing that English and dominant programming languages like C# and C++ function as central pivot languages in the model's concept space, with language-specific neurons concentrated in bottom layers (0-4) for general constructs and top layers (29-31) for syntax-specific mappings, while demonstrating that highly aligned programming languages share substantial neural representations, making truly exclusive neurons difficult to identify and suggesting opportunities for more efficient modular architectures that leverage shared representations across similar languages.

View blog
Resources3
Operator Learning with Neural Fields: Tackling PDEs on General Geometries
30 Nov 2023

CORAL is a novel machine learning framework for solving Partial Differential Equations (PDEs) by learning mappings between function spaces on general geometries. It employs modulated Implicit Neural Representations to handle irregular spatial samplings and achieves competitive or superior performance across initial value problems, dynamics modeling, and geometry-aware inference tasks, demonstrating robustness and efficient inference.

View blog
Resources
Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu
29 May 2025

Researchers from LMU Munich and Sorbonne Université developed a systematic framework to improve in-context machine translation for low-resource languages, using Manchu as a case study. They found that providing high-quality dictionary entries and relevant parallel examples significantly enhances LLM performance, achieving a BLEU score of 12.35 with DeepSeek-V3, and successfully applied this to augment data for traditional NMT models.

View blog
Resources
Survey on Question Answering over Visually Rich Documents: Methods, Challenges, and Trends
10 Mar 2025
The field of visually-rich document understanding, which involves interacting with visually-rich documents (whether scanned or born-digital), is rapidly evolving and still lacks consensus on several key aspects of the processing pipeline. In this work, we provide a comprehensive overview of state-of-the-art approaches, emphasizing their strengths and limitations, pointing out the main challenges in the field, and proposing promising research directions.
View blog
Resources
Probing Language Models on Their Knowledge Source
09 Nov 2024
Large Language Models (LLMs) often encounter conflicts between their learned, internal (parametric knowledge, PK) and external knowledge provided during inference (contextual knowledge, CK). Understanding how LLMs models prioritize one knowledge source over the other remains a challenge. In this paper, we propose a novel probing framework to explore the mechanisms governing the selection between PK and CK in LLMs. Using controlled prompts designed to contradict the model's PK, we demonstrate that specific model activations are indicative of the knowledge source employed. We evaluate this framework on various LLMs of different sizes and demonstrate that mid-layer activations, particularly those related to relations in the input, are crucial in predicting knowledge source selection, paving the way for more reliable models capable of handling knowledge conflicts effectively.
View blog
Resources
Eagle: Large-Scale Learning of Turbulent Fluid Dynamics with Mesh Transformers
17 Mar 2023
Estimating fluid dynamics is classically done through the simulation and integration of numerical models solving the Navier-Stokes equations, which is computationally complex and time-consuming even on high-end hardware. This is a notoriously hard problem to solve, which has recently been addressed with machine learning, in particular graph neural networks (GNN) and variants trained and evaluated on datasets of static objects in static scenes with fixed geometry. We attempt to go beyond existing work in complexity and introduce a new model, method and benchmark. We propose EAGLE, a large-scale dataset of 1.1 million 2D meshes resulting from simulations of unsteady fluid dynamics caused by a moving flow source interacting with nonlinear scene structure, comprised of 600 different scenes of three different types. To perform future forecasting of pressure and velocity on the challenging EAGLE dataset, we introduce a new mesh transformer. It leverages node clustering, graph pooling and global attention to learn long-range dependencies between spatially distant data points without needing a large number of iterations, as existing GNN methods do. We show that our transformer outperforms state-of-the-art performance on, both, existing synthetic and real datasets and on EAGLE. Finally, we highlight that our approach learns to attend to airflow, integrating complex information in a single iteration.
View blog
Resources
DIP: Unsupervised Dense In-Context Post-training of Visual Representations
09 Sep 2025
We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: this https URL
View blog
Resources3
ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training
Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.
View blog
Resources
On the Entity-Level Alignment in Crosslingual Consistency
11 Oct 2025
Multilingual large language models (LLMs) are expected to recall factual knowledge consistently across languages. However, the factors that give rise to such crosslingual consistency -- and its frequent failure -- remain poorly understood. In this work, we hypothesize that these inconsistencies may arise from failures in entity alignment, the process of mapping subject and object entities into a shared conceptual space across languages. To test this, we assess alignment through entity-level (subject and object) translation tasks, and find that consistency is strongly correlated with alignment across all studied models, with misalignment of subjects or objects frequently resulting in inconsistencies. Building on this insight, we propose SubSub and SubInj, two effective methods that integrate English translations of subjects into prompts across languages, leading to substantial gains in both factual recall accuracy and consistency. Finally, our mechanistic analysis reveals that these interventions reinforce the entity representation alignment in the conceptual space through model's internal pivot-language processing, offering effective and practical strategies for improving multilingual factual prediction.
View blog
Resources
There are no more papers matching your filters at the moment.