Institut Agro
Researchers developed a family of model-free Lloyd-type algorithms for node clustering in Stochastic Block Model (SBM) type graphs, demonstrating substantially faster computation times and comparable or lower estimation errors compared to existing state-of-the-art methods. The approach provides strong theoretical consistency guarantees and successfully identifies social roles in animal interaction networks.
The synthetic difference-in-differences method provides an efficient method to estimate a causal effect with a latent factor model. However, it relies on the use of panel data. This paper presents an adaptation of the synthetic difference-in-differences method for repeated cross-sectional data. The treatment is considered to be at the group level so that it is possible to aggregate data by group to compute the two types of synthetic difference-in-differences weights on these aggregated data. Then, I develop and compute a third type of weight that accounts for the different number of observations in each cross-section. Simulation results show that the performance of the synthetic difference-in-differences estimator is improved when using the third type of weights on repeated cross-sectional data.
We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when evaluated on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.
19 Aug 2024
In this paper, we consider the problem of seriation of a permuted structured matrix based on noisy observations. The entries of the matrix relate to an expected quantification of interaction between two objects: the higher the value, the closer the objects. A popular structured class for modelling such matrices is the permuted Robinson class, namely the set of matrices whose coefficients are decreasing away from its diagonal, up to a permutation of its lines and columns. We consider in this paper two submodels of Robinson matrices: the T{\oe}plitz model, and the latent position model. We provide a computational lower bound based on the low-degree paradigm, which hints that there is a statistical-computational gap for seriation when measuring the error based on the Frobenius norm. We also provide a simple and polynomial-time algorithm that achives this lower bound. Along the way, we also characterize the information-theory optimal risk thereby giving evidence for the extent of the computation/information gap for this problem.
Earthworms are key drivers of soil function, influencing organic matter turnover, nutrient cycling, and soil structure. Understanding the environmental controls on their distribution is essential for predicting the impacts of land use and climate change on soil ecosystems. While local studies have identified abiotic drivers of earthworm communities, broad-scale spatial patterns remain underexplored. We developed a multi-species, multi-task deep learning model to jointly predict the distribution of 77 earthworm species across metropolitan France, using historical (1960-1970) and contemporary (1990-2020) records. The model integrates climate, soil, and land cover variables to estimate habitat suitability. We applied SHapley Additive exPlanations (SHAP) to identify key environmental drivers and used species clustering to reveal ecological response groups. The joint model achieved high predictive performance (TSS >= 0.7) and improved predictions for rare species compared to traditional species distribution models. Shared feature extraction across species allowed for more robust identification of common and contrasting environmental responses. Precipitation variability, temperature seasonality, and land cover emerged as dominant predictors of earthworm distribution. Species clustering revealed distinct ecological strategies tied to climatic and land use gradients. Our study advances both the methodological and ecological understanding of soil biodiversity. We demonstrate the utility of interpretable deep learning approaches for large-scale soil fauna modeling and provide new insights into earthworm habitat specialization. These findings support improved soil biodiversity monitoring and conservation planning in the face of global environmental change.
This paper investigates the causal impact of the parental environment on the student's academic performance in mathematics, literature and English (as a foreign language), using a new database covering all children aged 8 to 15 of the Madrid community, from 2016 to 2019. Parental environment refers here to the parents' level of education (i.e. the skills they acquired before bringing up their children), and parental investment (the effort made by parents to bring up their children). We distinguish the persistent effect of the parental environment from the so-called Matthew effect, which describes a possible tendency for the impact of the parental environment to increase as the child grows up. Whatever the subject (mathematics, literature or English), our results are in line with most studies concerning the persistent effect: a favourable parental environment goes hand in hand with better results for the children. As regards the Matthew effect, the results differ between subjects: while the impact of the parental environment tends to diminish from the age of 8 to 15 in mathematics, it forms a bell curve in literature (first increasing, then decreasing) and increases steadily in English. This result, which is encouraging for mathematics and even literature, confirms the social dimension involved in learning a foreign language compared to more academic subjects.
To get a good understanding of a dynamical system, it is convenient to have an interpretable and versatile model of it. Timed discrete event systems are a kind of model that respond to these requirements. However, such models can be inferred from timestamped event sequences but not directly from numerical data. To solve this problem, a discretization step must be done to identify events or symbols in the time series. Persist is a discretization method that intends to create persisting symbols by using a score called persistence score. This allows to mitigate the risk of undesirable symbol changes that would lead to a too complex model. After the study of the persistence score, we point out that it tends to favor excessive cases making it miss interesting persisting symbols. To correct this behavior, we replace the metric used in the persistence score, the Kullback-Leibler divergence, with the Wasserstein distance. Experiments show that the improved persistence score enhances Persist's ability to capture the information of the original time series and that it makes it better suited for discrete event systems learning.
In this paper, we report on the outputs and adoption of the Agrisemantics Working Group of the Research Data Alliance (RDA), consisting of a set of recommendations to facilitate the adoption of semantic technologies and methods for the purpose of data interoperability in the field of agriculture and nutrition. From 2016 to 2019, the group gathered researchers and practitioners at the crossing point between information technology and agricultural science, to study all aspects in the life cycle of semantic resources: conceptualization, edition, sharing, standardization, services, alignment, long term support. First, the working group realized a landscape study, a study of the uses of semantics in agrifood, then collected use cases for the exploitation of semantics resources-a generic term to encompass vocabularies, terminologies, thesauri, ontologies. The resulting requirements were synthesized into 39 "hints" for users and developers of semantic resources, and providers of semantic resource services. We believe adopting these recommendations will engage agrifood sciences in a necessary transition to leverage data production, sharing and reuse and the adoption of the FAIR data principles. The paper includes examples of adoption of those requirements, and a discussion of their contribution to the field of data science.
In this paper, we consider a flocculation model in a chemostat where one species is present in two forms: planktonic and aggregated bacteria with the presence of a single resource. The removal rates of isolated and attached bacteria are distinct and include the specific death rates. Considering distinct yield coefficients with a large class of growth rates, we present a mathematical analysis of the model by establishing the necessary and sufficient conditions of the existence and local asymptotic stability of all steady states according to the two operating parameters which are the dilution rate and the inflowing concentration of the substrate. Using these conditions, we determine first theoretically the operating diagram of the flocculation process describing the asymptotic behavior of the system with respect to two control parameters. The bifurcations analysis shows a rich set of possible types of bifurcations: transcritical bifurcation or branch points of steady states, saddle-node bifurcation or limit points of steady states, Hopf, and homoclinic bifurcations. Using the numerical method with MATCONT software based on a continuation and correction algorithm, we find the same operating diagram obtained theoretically. However, MATCONT detects other types of two-parameter bifurcations such as Bogdanov-Takens and Cusp bifurcations.
The problems of observability and identifiability have been of great interest as previous steps to estimating parameters and initial conditions of dynamical systems to which some known data (observations) are associated. While most works focus on linear and polynomial/rational systems of ODEs, general nonlinear systems have received far less attention and, to the best of our knowledge, no general constructive methodology has been proposed to assess and guarantee parameter and state recoverability in this context. We consider a class of systems of parameterized nonlinear ODEs and some observations, and study if a system of this class is observable, identifiable or jointly observable-identifiable; our goal is to identify its parameters and/or reconstruct the initial condition from the data. To achieve this, we introduce a family of efficient and fully constructive procedures that allow recoverability of the unknowns with low computational cost and address the aforementioned gap. Each procedure is tailored to different observational scenarios and based on the resolution of linear systems. As a case study, we apply these procedures to several epidemic models, with a detailed focus on the SIRS model, demonstrating its joined observability-identifiability when only a portion of the infected individuals is measured, a scenario that has not been studied before. In contrast, for the same observations, the SIR model is observable and identifiable, but not jointly observable-identifiable. This distinction allows us to introduce a novel approach to discriminating between different epidemiological models (SIR vs. SIRS) from short-time data. For these two models, we illustrate the theoretical results through some numerical experiments, validating the approach and highlighting its practical applicability to real-world scenarios.
In many high-dimensional problems, like sparse-PCA, planted clique, or clustering, the best known algorithms with polynomial time complexity fail to reach the statistical performance provably achievable by algorithms free of computational constraints. This observation has given rise to the conjecture of the existence, for some problems, of gaps -- so called statistical-computational gaps -- between the best possible statistical performance achievable without computational constraints, and the best performance achievable with poly-time algorithms. A powerful approach to assess the best performance achievable in poly-time is to investigate the best performance achievable by polynomials with low-degree. We build on the seminal paper of Schramm and Wein (2022) and propose a new scheme to derive lower bounds on the performance of low-degree polynomials in some latent space models. By better leveraging the latent structures, we obtain new and sharper results, with simplified proofs. We then instantiate our scheme to provide computational lower bounds for the problems of clustering, sparse clustering, and biclustering. We also prove matching upper-bounds and some additional statistical results, in order to provide a comprehensive description of the statistical-computational gaps occurring in these three problems.
Approximate Bayesian Computation (ABC) methods are commonly used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Classical ABC methods are based on nearest neighbor type algorithms and rely on the choice of so-called summary statistics, distances between datasets and a tolerance threshold. Recently, methods combining ABC with more complex machine learning algorithms have been proposed to mitigate the impact of these ``user-choices''. In this paper, we propose the first, to our knowledge, ABC method completely free of summary statistics, distance, and tolerance threshold. Moreover, in contrast with usual generalizations of the ABC method, it associates a confidence interval (having a proper frequentist marginal coverage) with the posterior mean estimation (or other moment-type estimates). Our method, named ABCD-Conformal, uses a neural network with Monte Carlo Dropout to provide an estimation of the posterior mean (or other moment type functionals), and conformal theory to obtain associated confidence sets. Efficient for estimating multidimensional parameters and amortized, we test this new method on four different applications and compare it with other ABC methods in the literature.
We investigate the Active Clustering Problem (ACP). A learner interacts with an NN-armed stochastic bandit with dd-dimensional subGaussian feedback. There exists a hidden partition of the arms into KK groups, such that arms within the same group, share the same mean vector. The learner's task is to uncover this hidden partition with the smallest budget - i.e., the least number of observation - and with a probability of error smaller than a prescribed constant δ\delta. In this paper, (i) we derive a non-asymptotic lower bound for the budget, and (ii) we introduce the computationally efficient ACB algorithm, whose budget matches the lower bound in most regimes. We improve on the performance of a uniform sampling strategy. Importantly, contrary to the batch setting, we establish that there is no computation-information gap in the active setting.
Prospective studies require discussing and collaborating with the stakeholders to create scenarios of the possible evolution of the studied value-chain. However, stakeholders don't always use the same words when referring to one idea. Constructing an ontology and homogenizing vocabularies is thus crucial to identify key variables which serve in the construction of the needed scenarios. Nevertheless, it is a very complex and timeconsuming task. In this paper we present the method we used to manually build ontologies adapted to the needs of two complementary system-analysis models (namely the "Godet" and the "MyChoice" models), starting from interviews of the agri-food system's stakeholders.
This article proposes a generic framework to process jointly the spatial and spectral information of hyperspectral images. First, sub-images are extracted. Then each of these sub-images follows two parallel workflows, one dedicated to the extraction of spatial features and the other dedicated to the extraction of spectral features. Finally, the extracted features are merged, producing as many scores as sub-images. Two applications are proposed, illustrating different spatial and spectral processing methods. The first one is related to the characterization of a teak wood disk, in an unsupervised way. It implements tensors of structure for the spatial branch, simple averaging for the spectral branch and multi-block principal component analysis for the fusion process. The second application is related to the early detection of apple scab on leaves. It implements co-occurrence matrices for the spatial branch, singular value decomposition for the spectral branch and multiblock partial least squares discriminant analysis for the fusion process. Both applications demonstrate the interest of the proposed method for the extraction of relevant spatial and spectral information and show how promising this new approach is for hyperspectral imaging processing.
Information on the grass growth over a year is essential for some models simulating the use of this resource to feed animals on pasture or at barn with hay or grass silage. Unfortunately, this information is rarely available. The challenge is to reconstruct grass growth from two sources of information: usual daily climate data (rainfall, radiation, etc.) and cumulative growth over the year. We have to be able to capture the effect of seasonal climatic events which are known to distort the growth curve within the year. In this paper, we formulate this challenge as a problem of disaggregating the cumulative growth into a time series. To address this problem, our method applies time series forecasting using climate information and grass growth from previous time steps. Several alternatives of the method are proposed and compared experimentally using a database generated from a grassland process-based model. The results show that our method can accurately reconstruct the time series, independently of the use of the cumulative growth information.
We consider a population spreading across a finite number of sites. Individuals can move from one site to the other according to a network (oriented links between the sites) that vary periodically over time. On each site, the population experiences a growth rate which is also periodically time varying. Recently, this kind of models have been extensively studied, using various technical tools to derive precise necessary and sufficient conditions on the parameters of the system (ie the local growth rate on each site, the time period and the strength of migration between the sites) for the population to grow. In the present paper, we take a completely different approach: using elementary comparison results between linear systems, we give sufficient condition for the growth of the population This condition is easy to check and can be applied in a broad class of examples. In particular, in the case when all sites are sinks (ie, in the absence of migration, the population become extinct in each site), we prove that when our condition of growth if satisfied, the population grows when the time period is large and for values of the migration strength that are exponentially small with respect to the time period, which answers positively to a conjecture stated by Katriel.
Large systems are often coarse-grained in order to study their low-dimensional macroscopic dynamics, yet microscopic complexity can in principle disrupt these predictions in many ways. We first consider one form of fine-grained complexity, heterogeneity in the time scales of microscopic dynamics, and show by an algebraic approach that it can stabilize macroscopic degrees of freedom. We then show that this time scale heterogeneity can arise from other forms of complexity, in particular disordered interactions between microscopic variables, and that it can drive the system's coarse-grained dynamics to transition from nonequilibrium attractors to fixed points. These mechanisms are demonstrated in a model of many-species ecosystems, where we find a quasi-decoupling between the low- and high-dimensional facets of the dynamics, interacting only through a key feature of ecological models, the fact that species' dynamical time scales are controlled by their abundances. We conclude that fine-grained disorder may enable a macroscopic equilibrium description of many-species ecosystems.
We establish central limit theorems for a large class of supercritical branching Markov processes in infinite dimension with spatially dependent and non-necessarily local branching mechanisms. This result relies on a fourth moment assumption and the exponential convergence of the mean semigroup in a weighted total variation norm. This latter assumption is pretty weak and does not necessitate symmetric properties or specific spectral knowledge on this semigroup. In particular, we recover two of the three known regimes (namely the small and critical branching processes) of convergence in known cases, and extend them to a wider family of processes. To prove our central limit theorems, we use the Stein's method, which in addition allows us to newly provide a rate of convergence to this type of convergence.
We consider branching processes for structured populations: each individual is characterized by a type or trait which belongs to a general measurable state space. We focus on the supercritical recurrent case, where the population may survive and grow and the trait distribution converges. The branching process is then expected to be driven by the positive triplet of first eigenvalue problem of the first moment semigroup. Under the assumption of convergence of the renormalized semigroup in weighted total variation norm, we prove strong convergence of the normalized empirical measure and non-degeneracy of the limiting martingale. Convergence is obtained under an Llog L condition which provides a Kesten-Stigum result in infinite dimension and relaxes the uniform convergence assumption of the renormalized first moment semigroup required in the work of Asmussen and Hering in 1976. The techniques of proofs combine families of martingales and contraction of semigroups and the truncation procedure of Asmussen and Hering. We also obtain L^1 convergence of the renormalized empirical measure and contribute to unifying different results in the literature. These results greatly extend the class of examples where a law of large numbers applies, as we illustrate it with absorbed branching diffusion, the house of cards model and some growth-fragmentation processes.
There are no more papers matching your filters at the moment.