Universitat Polit`ecnica de Val`encia
Research in AI evaluation has grown increasingly complex and multidisciplinary, attracting researchers with diverse backgrounds and objectives. As a result, divergent evaluation paradigms have emerged, often developing in isolation, adopting conflicting terminologies, and overlooking each other's contributions. This fragmentation has led to insular research trajectories and communication barriers both among different paradigms and with the general public, contributing to unmet expectations for deployed AI systems. To help bridge this insularity, in this paper we survey recent work in the AI evaluation landscape and identify six main paradigms. We characterise major recent contributions within each paradigm across key dimensions related to their goals, methodologies and research cultures. By clarifying the unique combination of questions and approaches associated with each paradigm, we aim to increase awareness of the breadth of current evaluation approaches and foster cross-pollination between different paradigms. We also identify potential gaps in the field to inspire future research directions.
As large language models (LLMs) become more advanced, it is increasingly difficult to distinguish between human-written and AI-generated text. This paper draws a conceptual parallel between quantum uncertainty and the limits of authorship detection in natural language. We argue that there is a fundamental trade-off: the more confidently one tries to identify whether a text was written by a human or an AI, the more one risks disrupting the text's natural flow and authenticity. This mirrors the tension between precision and disturbance found in quantum systems. We explore how current detection methods--such as stylometry, watermarking, and neural classifiers--face inherent limitations. Enhancing detection accuracy often leads to changes in the AI's output, making other features less reliable. In effect, the very act of trying to detect AI authorship introduces uncertainty elsewhere in the text. Our analysis shows that when AI-generated text closely mimics human writing, perfect detection becomes not just technologically difficult but theoretically impossible. We address counterarguments and discuss the broader implications for authorship, ethics, and policy. Ultimately, we suggest that the challenge of AI-text detection is not just a matter of better tools--it reflects a deeper, unavoidable tension in the nature of language itself.
Comprehensive and accurate evaluation of general-purpose AI systems such as large language models allows for effective mitigation of their risks and deepened understanding of their capabilities. Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems, as present techniques lack a scientific foundation for predicting their performance on unforeseen tasks and explaining their varying performance on specific task items or user inputs. Moreover, existing benchmarks of specific tasks raise growing concerns about their reliability and validity. To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation. Psychometrics, the science of psychological measurement, provides a rigorous methodology for identifying and measuring the latent constructs that underlie performance across multiple tasks. We discuss its merits, warn against potential pitfalls, and propose a framework to put it into practice. Finally, we explore future opportunities of integrating psychometrics with the evaluation of general-purpose AI systems.
We consider a positively curved FLRW spacetime as a background in which a nonrelativistic quantum particle propagates according to the Schroedinger equation. The probability fluid for the corresponding quantum states is taken as a model for the cosmological fluid filling this FLRW Universe. The Hamiltonian operator governing this fictitious particle is proportional to the Laplacian operator derived from the FLRW metric, while the mass of the particle equals the overall matter of the Universe (baryonic and dark). A complete, orthonormal set of quantum eigenstates of the Hamiltonian is obtained. Restricting to radially symmetric states, the latter are then used to compute matrix elements and expectation values of two observables for which quantum operators are identified, namely, the cosmological constant and the gravitational Boltzmann entropy. This entropy is regarded as corresponding to a positively-curved FLRW geometry when the cosmological fluid filling the Universe occupies a given quantum state.
We investigate the solution of low-rank matrix approximation problems using the truncated SVD. For this purpose, we develop and optimize GPU implementations for the randomized SVD and a blocked variant of the Lanczos approach. Our work takes advantage of the fact that the two methods are composed of very similar linear algebra building blocks, which can be assembled using numerical kernels from existing high-performance linear algebra libraries. Furthermore, the experiments with several sparse matrices arising in representative real-world applications and synthetic dense test matrices reveal a performance advantage of the block Lanczos algorithm when targeting the same approximation accuracy.
Magnetic resonance imaging (MRI) is the gold standard imaging modality for numerous diagnostic tasks, yet its usefulness is tempered due to its high cost and infrastructural requirements. Low-cost very-low-field portable scanners offer new opportunities, while enabling imaging outside conventional MRI suites. However, achieving diagnostic-quality images in clinically acceptable scan times remains challenging with these systems. Therefore methods for improving the image quality while reducing the scan duration are highly desirable. Here, we investigate a physics-informed 3D deep unrolled network for the reconstruction of portable MR acquisitions. Our approach includes a novel network architecture that utilizes momentum-based acceleration and leverages complex conjugate symmetry of k-space for improved reconstruction performance. Comprehensive evaluations on emulated datasets as well as 47mT portable MRI acquisitions demonstrate the improved reconstruction quality of the proposed method compared to existing methods.
Large Language Models (LLMs) have revo lutionized natural language processing Natural Language Processing (NLP), with Chat Generative Pre-trained Transformer (ChatGPT) standing out as a notable exampledue to its advanced capabilities and widespread applications. This survey provides a comprehensive analysis of ChatGPT, exploring its architecture, training processes, and functionalities. We examine its integration into various domains across industries such as customer service, education, healthcare, and entertainment. A comparative analysis with other LLMs highlights ChatGPT's unique features and performance metrics. Regarding benchmarks, the paper examines ChatGPT's comparative performance against other LLMs and discusses potential risks such as misinformation, bias, and data privacy concerns. Additionally, we offer a number of figures and tables that outline the backdrop of the discussion, the main ideas of the article, the numerous LLM models, a thorough list of datasets used for pre-training, fine-tuning, and evaluation, as well as particular LLM applications with pertinent references. Finally, we identify future research directions and technological advancements, underscoring the evolving landscape of LLMs and their profound impact on artificial intelligence Artificial Intelligence (AI) and society.
In this work, we explore and propose several quantum circuit mapping strategies to optimize qubit shuttling in scalable quantum computing architectures based on silicon spin qubits. Our goal is to minimize phase errors introduced during shuttling operations while reducing the overall execution time of quantum circuits. We propose and evaluate five mapping algorithms using benchmarks from quantum algorithms. The Swap Return strategy emerged as the most robust solution, offering a superior balance between execution time and error minimization by considering future qubit interactions. Additionally, we assess the importance of initial qubit placement, demonstrating that an informed placement strategy can significantly enhance the performance of dynamic mapping approaches.
We explore extreme mass-ratio inspirals (EMRIs) in the co-evolution of massive black holes (MBHs) and nuclear star clusters (NSCs), which host diverse stellar populations across a wide range of masses. The dynamics are simulated self-consistently with GNC, which we have updated to incorporate gravitational wave orbital decay, the loss cone of a spinning MBH, and stellar evolution. Over 1212 Gyr, we investigate the evolution of the NSC with a mass-growing MBH, as well as the EMRIs of stellar black holes, neutron stars, white dwarfs, brown dwarfs (BDs), and low-mass main-sequence stars (MSs), along with tidal disruption events (TDEs) involving MSs, BDs, and post-MSs. The mass growth of the MBH contributed by TDEs is typically 107M\sim 10^7\,M_{\odot}, 106M\sim 10^6\,M_{\odot}, and 5×104M\sim 5\times10^4\,M_{\odot} for massive, Milky-Way-like, and smaller NSCs, respectively. Between 40%40\% and 70%70\% of the stellar mass is lost during stellar evolution, which dominates the mass growth of the MBH if a significant fraction of the lost mass is accreted. The evolution of EMRI rates is generally affected by the cluster's size expansion or contraction, stellar population evolution, MBH mass growth, and the stellar initial mass function. The EMRI rates for compact objects peak at early epochs (1\lesssim 1 Gyr) and then gradually decline over cosmic time. LISA-band (0.10.1 mHz) EMRIs involving compact objects around Milky-Way-like MBHs tend to have high eccentricities, while those around spinning MBHs preferentially occupy low-inclination (prograde) orbits. In contrast, MS- and BD-EMRIs usually have eccentricity and inclination distributions that are distinct from those of compact objects.
Modular quantum computing architectures are a promising alternative to monolithic QPU (Quantum Processing Unit) designs for scaling up quantum devices. They refer to a set of interconnected QPUs or cores consisting of tightly coupled quantum bits that can communicate via quantum-coherent and classical links. In multi-core architectures, it is crucial to minimize the amount of communication between cores when executing an algorithm. Therefore, mapping a quantum circuit onto a modular architecture involves finding an optimal assignment of logical qubits (qubits in the quantum circuit) to different cores with the aim to minimize the number of expensive inter-core operations while adhering to given hardware constraints. In this paper, we propose for the first time a Quadratic Unconstrained Binary Optimization (QUBO) technique to encode the problem and the solution for both qubit allocation and inter-core communication costs in binary decision variables. To this end, the quantum circuit is split into slices, and qubit assignment is formulated as a graph partitioning problem for each circuit slice. The costly inter-core communication is reduced by penalizing inter-core qubit communications. The final solution is obtained by minimizing the overall cost across all circuit slices. To evaluate the effectiveness of our approach, we conduct a detailed analysis using a representative set of benchmarks having a high number of qubits on two different multi-core architectures. Our method showed promising results and performed exceptionally well with very dense and highly-parallelized circuits that require on average 0.78 inter-core communications per two-qubit gate.
4
07 May 2024
This overview is devoted to splitting methods, a class of numerical integrators intended for differential equations that can be subdivided into different problems easier to solve than the original system. Closely connected with this class of integrators are composition methods, in which one or several low-order schemes are composed to construct higher-order numerical approximations to the exact solution. We analyze in detail the order conditions that have to be satisfied by these classes of methods to achieve a given order, and provide some insight about their qualitative properties in connection with geometric numerical integration and the treatment of highly oscillatory problems. Since splitting methods have received considerable attention in the realm of partial differential equations, we also cover this subject in the present survey, with special attention to parabolic equations and their problems. An exhaustive list of methods of different orders is collected and tested on simple examples. Finally, some applications of splitting methods in different areas, ranging from celestial mechanics to statistics, are also provided.
Frequency upconversion is a cornerstone of electromagnetic signal processing, analysis and detection. It is used to transfer energy and information from one frequency domain to another where transmission, modulation or detection is technically easier or more efficient. Optomechanical transduction is emerging as a flexible approach to coherent frequency upconversion; it has been successfully demonstrated for conversion from radio- and microwaves (kHz to GHz) to optical fields. Nevertheless, optomechanical transduction of multi-THz and mid-infrared signals remains an open challenge. Here, we utilize molecular cavity optomechanics to demonstrate upconversion of sub-microwatt continuous-wave signals at \sim32~THz into the visible domain at ambient conditions. The device consists in a plasmonic nanocavity hosting a small number of molecules. The incoming field resonantly drives a collective molecular vibration, which imprints an optomechanical modulation on a visible pump laser and results in Stokes and anti-Stokes upconverted Raman sidebands with sub-natural linewidth, indicating a coherent process. The nanocavity offers 13 orders of magnitude enhancement of upconversion efficiency per molecule compared to free space, with a measured phonon-to-photon internal conversion efficiency larger than 10410^{-4} per milliwatt of pump power. Our results establish a flexible paradigm for optomechanical frequency conversion using molecular oscillators coupled to plasmonic nanocavities, whose vibrational and electromagnetic properties can be tailored at will using chemical engineering and nanofabrication.
We address the problem of performing message-passing-based decoding of quantum LDPC codes under hardware latency limitations. We propose a novel way to do layered decoding that suits quantum constraints and outperforms flooded scheduling, the usual scheduling on parallel architectures. A generic construction is given to construct layers of hypergraph product codes. In the process, we introduce two new notions, t-covering layers which is a generalization of the usual layer decomposition, and a new scheduling called random order scheduling. Numerical simulations show that the random ordering is of independent interest as it helps relieve the high error floor typical of message-passing decoders on quantum codes for both layered and serial decoding without the need for post-processing.
AI assistants such as Alexa, Google Assistant, and Siri, are making their way into the healthcare sector, offering a convenient way for users to access different healthcare services. Trust is a vital factor in the uptake of healthcare services, but the factors affecting trust in voice assistants used for healthcare are under-explored and this specialist domain introduces additional requirements. This study explores the effects of different functional, personal, and risk factors on trust in and adoption of healthcare voice AI assistants (HVAs), generating a partial least squares structural model from a survey of 300 voice assistant users. Our results indicate that trust in HVAs can be significantly explained by functional factors (usefulness, content credibility, quality of service relative to a healthcare professional), together with security, and privacy risks and personal stance in technology. We also discuss differences in terms of trust between HVAs and general-purpose voice assistants as well as implications that are unique to HVAs.
Building a theoretical understanding of the capabilities of large language models (LLMs) is vital for our ability to predict and explain the behavior of these systems. Here, we investigate the structure of LLM capabilities by extracting latent capabilities from patterns of individual differences across a varied population of LLMs. Using a combination of Bayesian and frequentist factor analysis, we analyzed data from 29 different LLMs across 27 cognitive tasks. We found evidence that LLM capabilities are not monolithic. Instead, they are better explained by three well-delineated factors that represent reasoning, comprehension and core language modeling. Moreover, we found that these three factors can explain a high proportion of the variance in model performance. These results reveal a consistent structure in the capabilities of different LLMs and demonstrate the multifaceted nature of these capabilities. We also found that the three abilities show different relationships to model properties such as model size and instruction tuning. These patterns help refine our understanding of scaling laws and indicate that changes to a model that improve one ability might simultaneously impair others. Based on these findings, we suggest that benchmarks could be streamlined by focusing on tasks that tap into each broad model ability.
7
The Gleason scoring system is the primary diagnostic and prognostic tool for prostate cancer. In recent years, with the development of digitisation devices, the use of computer vision techniques for the analysis of biopsies has increased. However, to the best of the authors' knowledge, the development of algorithms to automatically detect individual cribriform patterns belonging to Gleason grade 4 has not yet been studied in the literature. The objective of the work presented in this paper is to develop a deep-learning-based system able to support pathologists in the daily analysis of prostate biopsies. The methodological core of this work is a patch-wise predictive model based on convolutional neural networks able to determine the presence of cancerous patterns. In particular, we train from scratch a simple self-design architecture. The cribriform pattern is detected by retraining the set of filters of the last convolutional layer in the network. From the reconstructed prediction map, we compute the percentage of each Gleason grade in the tissue to feed a multi-layer perceptron which provides a biopsy-level this http URL our SICAPv2 database, composed of 182 annotated whole slide images, we obtained a Cohen's quadratic kappa of 0.77 in the test set for the patch-level Gleason grading with the proposed architecture trained from scratch. Our results outperform previous ones reported in the literature. Furthermore, this model reaches the level of fine-tuned state-of-the-art architectures in a patient-based four groups cross validation. In the cribriform pattern detection task, we obtained an area under ROC curve of 0.82. Regarding the biopsy Gleason scoring, we achieved a quadratic Cohen's Kappa of 0.81 in the test subset. Shallow CNN architectures trained from scratch outperform current state-of-the-art methods for Gleason grades classification.
Argument Mining is defined as the task of automatically identifying and extracting argumentative components (e.g., premises, claims, etc.) and detecting the existing relations among them (i.e., support, attack, rephrase, no relation). One of the main issues when approaching this problem is the lack of data, and the size of the publicly available corpora. In this work, we use the recently annotated US2016 debate corpus. US2016 is the largest existing argument annotated corpus, which allows exploring the benefits of the most recent advances in Natural Language Processing in a complex domain like Argument (relation) Mining. We present an exhaustive analysis of the behavior of transformer-based models (i.e., BERT, XLNET, RoBERTa, DistilBERT and ALBERT) when predicting argument relations. Finally, we evaluate the models in five different domains, with the objective of finding the less domain dependent model. We obtain a macro F1-score of 0.70 with the US2016 evaluation corpus, and a macro F1-score of 0.61 with the Moral Maze cross-domain corpus.
In the fast-paced field of quantum computing, identifying the architectural characteristics that will enable quantum processors to achieve high performance across a diverse range of quantum algorithms continues to pose a significant challenge. Given the extensive and costly nature of experimentally testing different designs, this paper introduces the first Design Space Exploration (DSE) for quantum-dot spin-qubit architectures. Utilizing the upgraded SpinQ compilation framework, this study explores a substantial design space comprising 29,312 spin-qubit-based architectures and applies an innovative optimization tool, ArtA (Artificial Architect), to speed up the design space traversal. ArtA can leverage 17 optimization configurations, significantly reducing exploration times by up to 99.1% compared to a traditional brute-force approach while maintaining the same result quality. After a comprehensive evaluation of best-matching optimization configurations per quantum circuit, ArtA suggests specific as well as universal architectural features that provide optimal performance across the examined circuits. Our work demonstrates that combining DSE methodologies with optimization algorithms can be effectively used to generate meaningful design insights for quantum processor development.
Understanding the impact of accuracy and speed when quantum error correction (QEC) decoders transition from floating-point software implementations to finite-precision hardware architectures is crucial for resource estimation on both classical and quantum sides. The final performance of the hardware implementation influences the code distance, affecting the number of physical qubits needed, and defines connectivity between quantum and classical control units, among other factors like refrigeration systems. This paper introduces a hardware emulator to evaluate QEC decoders using real hardware instead of software models. The emulator can explore 101310^{13} different error patterns in 20 days with a single FPGA device running at 150 MHz, guaranteeing the decoder's performance at logical rates of 101210^{-12}, the requirement for most quantum algorithms. In contrast, an optimized C++ software on an Intel Core i9 with 128 GB RAM would take over a year to achieve similar results. The emulator also enables storing patterns that generate logical errors for offline analysis and to design new decoders. Using results from the emulator, we propose a diversity-based method combining several belief propagation (BP) decoders with different quantization levels. Individually, these decoders may show subpar error correction, but together they outperform the floating-point version of BP for quantum low-density parity-check (QLDPC) codes like hypergraph or lifted product. Preliminary results with circuit-level noise and bivariate bicycle codes suggest hardware insights can also improve software. Our diversity-based proposal achieves a similar logical error rate as BP with ordered statistics decoding, with average speed improvements ranging from 30% to 80%, and 10% to 120% in worst-case scenarios, while reducing post-processing algorithm activation by 47% to 96.93%, maintaining the same accuracy.
Integrated sensing and communications (ISAC) is poised to be a native technology for the forthcoming Sixth Generation (6G) era, with an emphasis on its potential to enhance communications performance through the integration of sensing information, i.e., sensing-assisted communications (SAC). Nevertheless, existing research on SAC has predominantly confined its focus to scenarios characterized by minimal clutter and obstructions, largely neglecting indoor environments, particularly those in industrial settings, where propagation channels involve high clutter density. To address this research gap, background subtraction is proposed on the monostatic sensing echoes, which effectively addresses clutter removal and facilitates detection and tracking of user equipments (UEs) in cluttered indoor environments with SAC. A realistic evaluation of the introduced SAC strategy is provided, using ray tracing (RT) data with the scenario layout following Third Generation Partnership Project (3GPP) indoor factory (InF) channel models. Simulation results show that the proposed approach enables precise predictive beamforming largely unaffected by clutter echoes, leading to significant improvements in effective data rate over the existing SAC benchmarks and exhibiting performance very close to the ideal case where perfect knowledge of UE location is available.
There are no more papers matching your filters at the moment.