University of Navarra
Predicting the impact of genomic and drug perturbations in cellular function is crucial for understanding gene functions and drug effects, ultimately leading to improved therapies. To this end, Causal Representation Learning (CRL) constitutes one of the most promising approaches, as it aims to identify the latent factors that causally govern biological systems, thus facilitating the prediction of the effect of unseen perturbations. Yet, current CRL methods fail in reconciling their principled latent representations with known biological processes, leading to models that are not interpretable. To address this major issue, we present SENA-discrepancy-VAE, a model based on the recently proposed CRL method discrepancy-VAE, that produces representations where each latent factor can be interpreted as the (linear) combination of the activity of a (learned) set of biological processes. To this extent, we present an encoder, SENA-{\delta}, that efficiently compute and map biological processes' activity levels to the latent causal factors. We show that SENA-discrepancy-VAE achieves predictive performances on unseen combinations of interventions that are comparable with its original, non-interpretable counterpart, while inferring causal latent factors that are biologically meaningful.
In recent years, the data science community has pursued excellence and made significant research efforts to develop advanced analytics, focusing on solving technical problems at the expense of organizational and socio-technical challenges. According to previous surveys on the state of data science project management, there is a significant gap between technical and organizational processes. In this article we present new empirical data from a survey to 237 data science professionals on the use of project management methodologies for data science. We provide additional profiling of the survey respondents' roles and their priorities when executing data science projects. Based on this survey study, the main findings are: (1) Agile data science lifecycle is the most widely used framework, but only 25% of the survey participants state to follow a data science project methodology. (2) The most important success factors are precisely describing stakeholders' needs, communicating the results to end-users, and team collaboration and coordination. (3) Professionals who adhere to a project methodology place greater emphasis on the project's potential risks and pitfalls, version control, the deployment pipeline to production, and data security and privacy.
In this paper we review the socalled altmetrics or alternative metrics. This concept raises from the development of new indicators based on Web 2.0, for the evaluation of the research and academic activity. The basic assumption is that variables such as mentions in blogs, number of twits or of researchers bookmarking a research paper for instance, may be legitimate indicators for measuring the use and impact of scientific publications. In this sense, these indicators are currently the focus of the bibliometric community and are being discussed and debated. We describe the main platforms and indicators and we analyze as a sample the Spanish research output in Communication Studies. Comparing traditional indicators such as citations with these new indicators. The results show that the most cited papers are also the ones with a highest impact according to the altmetrics. We conclude pointing out the main shortcomings these metrics present and the role they may play when measuring the research impact through 2.0 platforms.
The digitalization of financial markets has shifted trading from voice to electronic channels, with Multi-Dealer-to-Client (MD2C) platforms now enabling clients to request quotes (RfQs) for financial instruments like bonds from multiple dealers simultaneously. In this competitive landscape, dealers cannot see each other's prices, making a rigorous analysis of the negotiation process crucial to ensure their profitability. This article introduces a novel general framework for analyzing the RfQ process using probabilistic graphical models and causal inference. Within this framework, we explore different inferential questions that are relevant for dealers participating in MD2C platforms, such as the computation of optimal prices, estimating potential revenues and the identification of clients that might be interested in trading the dealer's axes. We then move into analyzing two different approaches for model specification: a generative model built on the work of (Fermanian, Guéant & Pu, 2017); and discriminative models utilizing machine learning techniques. We evaluate these methodologies using predictive metrics designed to assess their effectiveness in the context of optimal pricing, highlighting the relative benefits of using models that take into account the internal mechanisms of the negotiation process.
Data science has employed great research efforts in developing advanced analytics, improving data models and cultivating new algorithms. However, not many authors have come across the organizational and socio-technical challenges that arise when executing a data science project: lack of vision and clear objectives, a biased emphasis on technical issues, a low level of maturity for ad-hoc projects and the ambiguity of roles in data science are among these challenges. Few methodologies have been proposed on the literature that tackle these type of challenges, some of them date back to the mid-1990, and consequently they are not updated to the current paradigm and the latest developments in big data and machine learning technologies. In addition, fewer methodologies offer a complete guideline across team, project and data & information management. In this article we would like to explore the necessity of developing a more holistic approach for carrying out data science projects. We first review methodologies that have been presented on the literature to work on data science projects and classify them according to the their focus: project, team, data and information management. Finally, we propose a conceptual framework containing general characteristics that a methodology for managing data science projects with a holistic point of view should have. This framework can be used by other researchers as a roadmap for the design of new data science methodologies or the updating of existing ones.
We present a family of quantum stabilizer codes using the structure of duadic constacyclic codes over F4\mathbb{F}_4. Within this family, quantum codes can possess varying dimensions, and their minimum distances are lower bounded by a square root bound. For each fixed dimension, this allows us to construct an infinite sequence of binary quantum codes with a growing minimum distance. Additionally, we prove that this family of quantum codes includes an infinite subclass of degenerate codes. We also introduce a technique for extending splittings of duadic constacyclic codes, providing new insights into the minimum distance and minimum odd-like weight of specific duadic constacyclic codes. Finally, we provide numerical examples of some quantum codes with short lengths within this family.
This paper jointly addresses the challenges of non-stationarity and high dimensionality in analysing multivariate time series. Building on the classical concept of cointegration, we introduce a more flexible notion, called stability space, aimed at capturing stationary components in settings where traditional assumptions may not hold. Based on the dimensionality reduction techniques of Partial Least Squares and Principal Component Analysis, we proposed two non-parametric procedures for estimating such a space and a targeted selection of components that prioritises stationarity. We compare these alternatives with the parametric Johansen procedure, when possible. Through simulations and real-data applications, we evaluated the performance of these methodologies across various scenarios, including high-dimensional configurations.
Researchers at the Christian Medical Center & Hospital in Purnia, Bihar, India, developed an AI-driven method using cough sound analysis to predict abnormal chest X-ray findings. This approach achieved an ROC-AUC up to 0.78, indicating its potential as a triage tool for optimizing limited radiographic resources in low-resource settings.
In this study, the combined use of structural equation modeling (SEM) and Bayesian network modeling (BNM) in causal inference analysis is revisited. The perspective highlights the debate between proponents of using BNM as either an exploratory phase or even as the sole phase in the definition of structural models, and those advocating for SEM as the superior alternative for exploratory analysis. The individual strengths and limitations of SEM and BNM are recognized, but this exploration evaluates the contention between utilizing SEM's robust structural inference capabilities and the dynamic probabilistic modeling offered by BNM. A case study of the work of, \citet{balaguer_2022} in a structural model for personal positive youth development (\textit{PYD}) as a function of positive parenting (\textit{PP}) and perception of the climate and functioning of the school (\textit{CFS}) is presented. The paper at last presents a clear stance on the analytical primacy of SEM in exploratory causal analysis, while acknowledging the potential of BNM in subsequent phases.
Response Surface Methodology (RSM) and desirability functions were employed in a case study to optimize the thermal and daylight performance of a computational model of a tropical housing typology. Specifically, this approach simultaneously optimized Indoor Overheating Hours (IOH) and Useful Daylight Illuminance (UDI) metrics through an Overall Desirability (D). The lack of significant association between IOH and other annual daylight metrics enabled a focused optimization of IOH and UDI. Each response required only 138 simulation runs (~30 hours for 276 runs) to determine the optimal values for passive strategies: window-to-wall ratio (WWR) and roof overhang depth across four orientations, totalling eight factors. First, initial screening based on 2V822_V^{8-2} fractional factorial design, identified four key factors using stepwise and Lasso regression, narrowed down to three: roof overhang depth on the south and west, WWR on the west, and WWR on the south. Then, RSM optimization yielded an optimal solution (roof overhang: 3.78 meters, west WWR: 3.76%, south WWR: 29.3%) with a D of 0.625 (IOH: 8.33%, UDI: 79.67%). Finally, robustness analysis with 1,000 bootstrap replications provided 95% confidence intervals for the optimal values. This study optimally balances thermal comfort and daylight with few experiments using a computationally-efficient multi-objective approach.
In this paper, we propose a unified approach to harness quantum conformal methods for multi-output distributions, with a particular emphasis on two experimental paradigms: (i) a standard 2-qubit circuit scenario producing a four-dimensional outcome distribution, and (ii) a multi-basis measurement setting that concatenates measurement probabilities in different bases (Z, X, Y) into a twelve-dimensional output space. By combining a multioutput regression model (e.g., random forests) with distributional conformal prediction, we validate coverage and interval-set sizes on both simulated quantum data and multi-basis measurement data. Our results confirm that classical conformal prediction can effectively provide coverage guarantees even when the target probabilities derive from inherently quantum processes. Such synergy opens the door to next-generation quantum-classical hybrid frameworks, providing both improved interpretability and rigorous coverage for quantum machine learning tasks. All codes and full reproducible Colab notebooks are made available at this https URL.
The rise of social media has ignited an unprecedented circulation of false information in our society. It is even more evident in times of crises, such as the COVID-19 pandemic. Fact-checking efforts have expanded greatly and have been touted as among the most promising solutions to fake news, especially in times like these. Several studies have reported the development of fact-checking organizations in Western societies, albeit little attention has been given to the Global South. Here, to fill this gap, we introduce a novel Markov-inspired computational method for identifying topics in tweets. In contrast to other topic modeling approaches, our method clusters topics and their current evolution in a predefined time window. Through these, we collected data from Twitter accounts of two Brazilian fact-checking outlets and presented the topics debunked by these initiatives in fortnights throughout the pandemic. By comparing these organizations, we could identify similarities and differences in what was shared by them. Our method resulted in an important technique to cluster topics in a wide range of scenarios, including an infodemic -- a period overabundance of the same information. In particular, the data clearly revealed a complex intertwining between politics and the health crisis during this period. We conclude by proposing a generic model which, in our opinion, is suitable for topic modeling and an agenda for future research.
This paper focuses on the problem of supplying the workstations of assembly lines with components during the production process. For that specific problem, this paper presents a Mixed Integer Linear Program (MILP) that aims at minimizing the energy consumption of the supplying strategy. More specifically, in contrast of the usual formulations that only consider component flows, this MILP handles the mass flow that are routed from one workstation to the other.
The feed-forward relationship naturally observed in time-dependent processes and in a diverse number of real systems -such as some food-webs and electronic and neural wiring- can be described in terms of so-called directed acyclic graphs (DAGs). An important ingredient of the analysis of such networks is a proper comparison of their observed architecture against an ensemble of randomized graphs, thereby quantifying the {\em randomness} of the real systems with respect to suitable null models. This approximation is particularly relevant when the finite size and/or large connectivity of real systems make inadequate a comparison with the predictions obtained from the so-called {\em configuration model}. In this paper we analyze four methods of DAG randomization as defined by the desired combination of topological invariants (directed and undirected degree sequence and component distributions) aimed to be preserved. A highly ordered DAG, called \textit{snake}-graph and a Erd\:os-Rényi DAG were used to validate the performance of the algorithms. Finally, three real case studies, namely, the \textit{C. elegans} cell lineage network, a PhD student-advisor network and the Milgram's citation network were analyzed using each randomization method. Results show how the interpretation of degree-degree relations in DAGs respect to their randomized ensembles depend on the topological invariants imposed. In general, real DAGs provide disordered values, lower than the expected by chance when the directedness of the links is not preserved in the randomization process. Conversely, if the direction of the links is conserved throughout the randomization process, disorder indicators are close to the obtained from the null-model ensemble, although some deviations are observed.
Semantic memory is the subsystem of human memory that stores knowledge of concepts or meanings, as opposed to life specific experiences. The organization of concepts within semantic memory can be understood as a semantic network, where the concepts (nodes) are associated (linked) to others depending on perceptions, similarities, etc. Lexical access is the complementary part of this system and allows the retrieval of such organized knowledge. While conceptual information is stored under certain underlying organization (and thus gives rise to a specific topology), it is crucial to have an accurate access to any of the information units, e.g. the concepts, for efficiently retrieving semantic information for real-time needings. An example of an information retrieval process occurs in verbal fluency tasks, and it is known to involve two different mechanisms: -clustering-, or generating words within a subcategory, and, when a subcategory is exhausted, -switching- to a new subcategory. We extended this approach to random-walking on a network (clustering) in combination to jumping (switching) to any node with certain probability and derived its analytical expression based on Markov chains. Results show that this dual mechanism contributes to optimize the exploration of different network models in terms of the mean first passage time. Additionally, this cognitive inspired dual mechanism opens a new framework to better understand and evaluate exploration, propagation and transport phenomena in other complex systems where switching-like phenomena are feasible.
The EPC GEN 2 communication protocol for Ultra-high frequency Radio Frequency Identification (RFID) has offered a promising avenue for advancing the intelligence of transportation infrastructure. With the capability of linking vehicles to RFID readers to crowdsource information from RFID tags on road infrastructures, the RF-enhanced road infrastructure (REI) can potentially transform data acquisition for urban transportation. Despite its potential, the broader adoption of RFID technologies in building intelligent roads has been limited by a deficiency in understanding how the GEN 2 protocol impacts system performance under different transportation settings. This paper fills this knowledge gap by presenting the system architecture and detailing the design challenges associated with REI. Comprehensive real-world experiments are conducted to assess REI's effectiveness across various urban contexts. The results yield crucial insights into the optimal design of on-vehicle RFID readers and on-road RFID tags, considering the constraints imposed by vehicle dynamics, road geometries, and tag placements. With the optimized designs of encoding schemes for reader-tag communication and on-vehicle antennas, REI is able to fulfill the requirements of traffic sign inventory management and environmental monitoring while falling short of catering to the demand for high-speed navigation. In particular, the Miller 2 encoding scheme strikes the best balance between reading performance (e.g., throughput) and noise tolerance for the multipath effect. Additionally, we show that the on-vehicle antenna should be oriented to maximize the available time for reading on-road tags, although it may reduce the received power by the tags in the forward link.
In this paper we explore the concept of hierarchy as a quantifiable descriptor of ordered structures, departing from the definition of three conditions to be satisfied for a hierarchical structure: {\em order}, {\em predictability} and {\em pyramidal structure}. According to these principles we define a hierarchical index taking concepts from graph and information theory. This estimator allows to quantify the hierarchical character of any system susceptible to be abstracted in a feedforward causal graph, i.e., a directed acyclic graph defined in a single connected structure. Our hierarchical index is a balance between this predictability and pyramidal condition by the definition of two entropies: one attending the onward flow and other for the backward reversion. We show how this index allows to identify hierarchical, anti-hierarchical and non hierarchical structures. Our formalism reveals that departing from the defined conditions for a hierarchical structure, feedforward trees and the inverted tree graphs emerge as the only causal structures of maximal hierarchical and anti-hierarchical systems, respectively. Conversely, null values of the hierarchical index are attributed to a number of different configuration networks; from linear chains, due to their lack of pyramid structure, to full-connected feedforward graphs where the diversity of onward pathways is canceled by the uncertainty (lack of predictability) when going backwards. Some illustrative examples are provided for the distinction among these three types of hierarchical causal graphs.
Today, the human brain can be studied as a whole. Electroencephalography, magnetoencephalography, or functional magnetic resonance imaging techniques provide functional connectivity patterns between different brain areas, and during different pathological and cognitive neuro-dynamical states. In this Tutorial we review novel complex networks approaches to unveil how brain networks can efficiently manage local processing and global integration for the transfer of information, while being at the same time capable of adapting to satisfy changing neural demands.
Accurate temperature measurements are essential for the proper monitoring and control of industrial furnaces. However, measurement uncertainty is a risk for such a critical parameter. Certain instrumental and environmental errors must be considered when using spectral-band radiation thermometry techniques, such as the uncertainty in the emissivity of the target surface, reflected radiation from surrounding objects, or atmospheric absorption and emission, to name a few. Undesired contributions to measured radiation can be isolated using measurement models, also known as error-correction models. This paper presents a methodology for budgeting significant sources of error and uncertainty during temperature measurements in a petrochemical furnace scenario. A continuous monitoring system is also presented, aided by a deep-learning-based measurement correction model, to allow domain experts to analyze the furnace's operation in real-time. To validate the proposed system's functionality, a real-world application case in a petrochemical plant is presented. The proposed solution demonstrates the viability of precise industrial furnace monitoring, thereby increasing operational security and improving the efficiency of such energy-intensive systems.
This paper studies the investment decision of the Spanish households using a unique data set, the Spanish Survey of Household Finance (EFF). We propose a theoretical model in which households, given a fixed investment in housing, allocate their net wealth across bank time deposits, stocks, and mortgage. Besides considering housing as an indivisible and illiquid asset that restricts the portfolio choice decision, we take into account the financial constraints that households face when they apply for external funding. For every representative household in the EFF we solve this theoretical problem and obtain the theoretically optimal portfolio that is compared with households' actual choices. We find that households significantly underinvest in stocks and deposits while the optimal and actual mortgage investments are alike. Considering the three types of financial assets at once, we find that the households headed by highly financially sophisticated, older, retired, richer, and unconstrained persons are the ones investing more efficiently.
There are no more papers matching your filters at the moment.