LIPNUniversity Sorbonne Paris Nord
We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{this https URL}.
3
In this paper, we propose a novel method for joint entity and relation extraction from unstructured text by framing it as a conditional sequence generation problem. In contrast to conventional generative information extraction models that are left-to-right token-level generators, our approach is \textit{span-based}. It generates a linearized graph where nodes represent text spans and edges represent relation triplets. Our method employs a transformer encoder-decoder architecture with pointing mechanism on a dynamic vocabulary of spans and relation types. Our model can capture the structural characteristics and boundaries of entities and relations through span representations while simultaneously grounding the generated output in the original text thanks to the pointing mechanism. Evaluation on benchmark datasets validates the effectiveness of our approach, demonstrating competitive results. Code is available at this https URL.
48
Researchers at Massachusetts General Hospital and collaborating institutions develop a region-adaptive MRI super-resolution system combining mixture-of-experts with diffusion models, achieving superior image quality metrics while enabling specialized processing of distinct anatomical regions through three expert networks that dynamically adapt to tissue characteristics.
Object detection in Remote Sensing Images (RSI) is a critical task for numerous applications in Earth Observation (EO). Differing from object detection in natural images, object detection in remote sensing images faces challenges of scarcity of annotated data and the presence of small objects represented by only a few pixels. Multi-modal fusion has been determined to enhance the accuracy by fusing data from multiple modalities such as RGB, infrared (IR), lidar, and synthetic aperture radar (SAR). To this end, the fusion of representations at the mid or late stage, produced by parallel subnetworks, is dominant, with the disadvantages of increasing computational complexity in the order of the number of modalities and the creation of additional engineering obstacles. Using the cross-attention mechanism, we propose a novel multi-modal fusion strategy for mapping relationships between different channels at the early stage, enabling the construction of a coherent input by aligning the different modalities. By addressing fusion in the early stage, as opposed to mid or late-stage methods, our method achieves competitive and even superior performance compared to existing techniques. Additionally, we enhance the SWIN transformer by integrating convolution layers into the feed-forward of non-shifting blocks. This augmentation strengthens the model's capacity to merge separated windows through local attention, thereby improving small object detection. Extensive experiments prove the effectiveness of the proposed multimodal fusion module and the architecture, demonstrating their applicability to object detection in multimodal aerial imagery.
In recent years, multimodal learning has become essential in robotic vision and information fusion, especially for understanding human behavior in complex environments. However, current methods struggle to fully leverage the textual modality, relying on supervised pretrained models, which limits semantic extraction in unsupervised robotic environments, particularly with significant modality loss. These methods also tend to be computationally intensive, leading to high resource consumption in real-world applications. To address these challenges, we propose the Multi Modal Mamba Enhanced Transformer (M3ET), a lightweight model designed for efficient multimodal learning, particularly on mobile platforms. By incorporating the Mamba module and a semantic-based adaptive attention mechanism, M3ET optimizes feature fusion, alignment, and modality reconstruction. Our experiments show that M3ET improves cross-task performance, with a 2.3 times increase in pretraining inference speed. In particular, the core VQA task accuracy of M3ET remains at 0.74, while the model's parameter count is reduced by 0.67. Although performance on the EQA task is limited, M3ET's lightweight design makes it well suited for deployment on resource-constrained robotic platforms.
We propose a definition of quantum computable functions as mappings between superpositions of natural numbers to probability distributions of natural numbers. Each function is obtained as a limit of an infinite computation of a quantum Turing machine. The class of quantum computable functions is recursively enumerable, thus opening the door to a quantum computability theory which may follow some of the classical developments.
The GraphER model reframes information extraction as a Graph Structure Learning task, utilizing a Token Graph Transformer to explicitly model and refine the graph of entities and relations. The model achieved competitive performance on the ACE05 dataset and state-of-the-art results on the CoNLL04 and SciERC benchmarks for joint entity and relation extraction.
66
Biological control strategies against mosquito-borne diseases--such as the sterile insect technique (SIT), RIDL, and Wolbachia-based releases--require reliable estimates of dispersal and survival of released males. We propose a mechanistic--statistical framework for mark--release--recapture (MRR) data linking an individual-based 2D diffusion model with its reaction--diffusion limit. Inference is based on solving the macroscopic system and embedding it in a Poisson observation model for daily trap counts, with uncertainty quantified via a parametric bootstrap. We validate identifiability using simulated data and apply the model to an urban MRR campaign in El Cano (Havana, Cuba) involving four weekly releases of sterile Aedes aegypti males. The best-supported model suggests a mean life expectancy of about five days and a typical displacement of about 180 m. Unlike empirical fits of survival or dispersal, our mechanistic approach jointly estimates movement, mortality, and capture, yielding biologically interpretable parameters and a principled framework for designing and evaluating SIT-based interventions.
Even though Variational Autoencoders (VAEs) are widely used for semi-supervised learning, the reason why they work remains unclear. In fact, the addition of the unsupervised objective is most often vaguely described as a regularization. The strength of this regularization is controlled by down-weighting the objective on the unlabeled part of the training set. Through an analysis of the objective of semi-supervised VAEs, we observe that they use the posterior of the learned generative model to guide the inference model in learning the partially observed latent variable. We show that given this observation, it is possible to gain finer control on the effect of the unsupervised objective on the training procedure. Using importance weighting, we derive two novel objectives that prioritize either one of the partially observed latent variable, or the unobserved latent variable. Experiments on the IMDB english sentiment analysis dataset and on the AG News topic classification dataset show the improvements brought by our prioritization mechanism and exhibit a behavior that is inline with our description of the inner working of Semi-Supervised VAEs.
The Unified Modeling Language (UML) is a standard for modeling dynamic systems. UML behavioral state machines are used for modeling the dynamic behavior of object-oriented designs. The UML specification, maintained by the Object Management Group (OMG), is documented in natural language (in contrast to formal language). The inherent ambiguity of natural languages may introduce inconsistencies in the resulting state machine model. Formalizing UML state machine specification aims at solving the ambiguity problem and at providing a uniform view to software designers and developers. Such a formalization also aims at providing a foundation for automatic verification of UML state machine models, which can help to find software design vulnerabilities at an early stage and reduce the development cost. We provide here a comprehensive survey of existing work from 1997 to 2021 related to formalizing UML state machine semantics for the purpose of conducting model checking at the design stage.
The Game Description Language (GDL) is a widely used formalism for specifying the rules of general games. Writing correct GDL descriptions can be challenging, especially for non-experts. Automated theorem proving has been proposed to assist game design by verifying if a GDL description satisfies desirable logical properties. However, when a description is proved to be faulty, the repair task itself can only be done manually. Motivated by the work on repairing unsolvable planning domain descriptions, we define a more general problem of finding minimal repairs for GDL descriptions that violate formal requirements, and we provide complexity results for various computational problems related to minimal repair. Moreover, we present an Answer Set Programming-based encoding for solving the minimal repair problem and demonstrate its application for automatically repairing ill-defined game descriptions.
Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains. However, they remain computationally intensive, particularly during the iterative denoising process. Latent-space models like Stable Diffusion alleviate some of this cost by operating in compressed representations, though at the expense of fine-grained detail. More recent approaches such as Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning denoising on similar examples retrieved from large external memory banks. While effective, these methods introduce drawbacks: they require costly storage and retrieval infrastructure, depend on static vision-language models like CLIP for similarity, and lack adaptability during training. We propose the Prototype Diffusion Model (PDM), a method that integrates prototype learning directly into the diffusion process for efficient and adaptive visual conditioning - without external memory. Instead of retrieving reference samples, PDM constructs a dynamic set of compact visual prototypes from clean image features using contrastive learning. These prototypes guide the denoising steps by aligning noisy representations with semantically relevant visual patterns, enabling efficient generation with strong semantic grounding. Experiments show that PDM maintains high generation quality while reducing computational and storage overhead, offering a scalable alternative to retrieval-based conditioning in diffusion models.
Several methods for triclustering three-dimensional data require the cluster size or the number of clusters in each dimension to be specified. To address this issue, the Multi-Slice Clustering (MSC) for 3-order tensor finds signal slices that lie in a low dimensional subspace for a rank-one tensor dataset in order to find a cluster based on the threshold similarity. We propose an extension algorithm called MSC-DBSCAN to extract the different clusters of slices that lie in the different subspaces from the data if the dataset is a sum of r rank-one tensor (r > 1). Our algorithm uses the same input as the MSC algorithm and can find the same solution for rank-one tensor data as MSC.
Parametric timed automata are a powerful formalism for reasoning on concurrent real-time systems with unknown or uncertain timing constants. Reducing their state space is a significant way to reduce the inherently large analysis times. We present here different merging reduction techniques based on convex union of constraints (parametric zones), allowing to decrease the number of states while preserving the correctness of verification and synthesis results. We perform extensive experiments, and identify the best heuristics in practice, bringing a significant decrease in the computation time on a benchmarks library.
Failure injection in distributed systems has been an important issue to experiment with robust, resilient distributed systems. In order to reproduce real-life conditions, parts of the application must be killed without letting the operating system close the existing network communications in a "clean" way. When a process is simply killed, the OS closes them. SystemTap is a an infrastructure that probes the Linux kernel's internal calls. If processes are killed at kernel-level, they can be destroyed without letting the OS do anything else. In this paper, we present a kernel-level failure injection system based on SystemTap. We present how it can be used to implement deterministic and probabilistic failure scenarios.
We prove that, for XX, YY, AA and BB matrices with entries in a non-commutative ring such that [Xij,Yk]=AiBkj[X_{ij},Y_{k\ell}]=-A_{i\ell} B_{kj}, satisfying suitable commutation relations (in particular, XX is a Manin matrix), the following identity holds: $ \mathrm{coldet} X \mathrm{coldet} Y = < 0 | \mathrm{coldet} (a A + X (I-a^{\dagger} B)^{-1} Y) |0 > $. Furthermore, if also YY is a Manin matrix, $ \mathrm{coldet} X \mathrm{coldet} Y =\int \mathcal{D}(\psi, \psi^{\dagger}) \exp [ \sum_{k \geq 0} \frac{1}{k+1} (\psi^{\dagger} A \psi)^{k} (\psi^{\dagger} X B^k Y \psi) ] .Notations:. Notations: < 0 |,, | 0 >$, are respectively the bra and the ket of the ground state, aa^{\dagger} and aa the creation and annihilation operators of a quantum harmonic oscillator, while ψi\psi^{\dagger}_i and ψi\psi_i are Grassmann variables in a Berezin integral. These results should be seen as a generalization of the classical Cauchy-Binet formula, in which AA and BB are null matrices, and of the non-commutative generalization, the Capelli identity, in which AA and BB are identity matrices and [Xij,Xk]=[Yij,Yk]=0[X_{ij},X_{k\ell}]=[Y_{ij},Y_{k\ell}]=0.
Real-world tabular learning production scenarios typically involve evolving data streams, where data arrives continuously and its distribution may change over time. In such a setting, most studies in the literature regarding supervised learning favor the use of instance incremental algorithms due to their ability to adapt to changes in the data distribution. Another significant reason for choosing these algorithms is \textit{avoid storing observations in memory} as commonly done in batch incremental settings. However, the design of instance incremental algorithms often assumes immediate availability of labels, which is an optimistic assumption. In many real-world scenarios, such as fraud detection or credit scoring, labels may be delayed. Consequently, batch incremental algorithms are widely used in many real-world tasks. This raises an important question: "In delayed settings, is instance incremental learning the best option regarding predictive performance and computational efficiency?" Unfortunately, this question has not been studied in depth, probably due to the scarcity of real datasets containing delayed information. In this study, we conduct a comprehensive empirical evaluation and analysis of this question using a real-world fraud detection problem and commonly used generated datasets. Our findings indicate that instance incremental learning is not the superior option, considering on one side state-of-the-art models such as Adaptive Random Forest (ARF) and other side batch learning models such as XGBoost. Additionally, when considering the interpretability of the learning systems, batch incremental solutions tend to be favored. Code: \url{this https URL}
Aperiodic tilings are non-periodic tilings characterized by local constraints. They play a key role in the proof of the undecidability of the domino problem (1964) and naturally model quasicrystals (discovered in 1982). A central question is to characterize, among a class of non-periodic tilings, the aperiodic ones. In this paper, we answer this question for the well-studied class of non-periodic tilings obtained by digitizing irrational vector spaces. Namely, we prove that such tilings are aperiodic if and only if the digitized vector spaces are computable.
Plastic flow is conventionally treated as continuous in finite element (FE) codes, whether in isotropic, anisotropic plasticity, or crystal plasticity. This approach, derived from continuum mechanics, contradicts the intermittent nature of plasticity at the elementary scale. Understanding crystal plasticity at micro-scale opens the door to new engineering applications, such as microscale machining. In this work, a new approach is proposed to account for the intermittence of plastic deformation while remaining within the framework of continuum mechanics. We introduce a material parameter, the plastic deformation threshold, denoted as Δpmin\Delta p_{min}, corresponding to the plastic deformation carried by the minimal plastic deformation burst within the material. The incremental model is based on the traditional predictor-corrector algorithm to calculate the elastoplastic behavior of a material subjected to any external loading. The model is presented within the framework of small deformations for von Mises plasticity. To highlight the main features of the approach, the plastic strain increment is calculated using normality rule and consistency conditions, and is accepted only if it exceeds Δpmin\Delta p_{min}. To achieve this, a time-discontinuous generalization of the Karush-Kuhn-Tucker (KKT) conditions is proposed. The simulations show that the introduction of the plastic threshold allows for the reproduction of the spatiotemporal intermittence of plastic flow, capturing the self-organization of plastic flow in complex loading scenarios within an FE model.
Communication-avoiding algorithms allow redundant computations to minimize the number of inter-process communications. In this paper, we propose to exploit this redundancy for fault-tolerance purpose. We illustrate this idea with QR factorization of tall and skinny matrices, and we evaluate the number of failures our algorithm can tolerate under different semantics.
There are no more papers matching your filters at the moment.