University of York
This research investigates whether In-Context Learning (ICL) in Large Language Models (LLMs) represents genuine learning, rigorously defining it within a PAC learning framework. The study demonstrates that while ICL improves with more examples (optimal at 50-100 shots), it exhibits considerable brittleness to out-of-distribution shifts and inconsistent generalization across formally similar tasks.
457
UniTraj introduces a universal trajectory foundation model, trained on the new billion-scale, globally distributed WorldTrace dataset, to address limitations in task specificity, regional dependency, and data sensitivity for trajectory analysis. It achieves superior zero-shot and fine-tuned performance across recovery, prediction, classification, and generation tasks, for instance, reducing MAE by 32.73% on GeoLife for trajectory recovery compared to TrajBERT.
11
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
Gamma-ray bursts are the most luminous electromagnetic events in the universe. Their prompt gamma-ray emission has typical durations between a fraction of a second and several minutes. A rare subset of these events have durations in excess of a thousand seconds, referred to as ultra-long gamma-ray bursts. Here, we report the discovery of the longest gamma-ray burst ever seen with a ~25,000 s gamma-ray duration, GRB 250702B, and characterize this event using data from four instruments in the InterPlanetary Network and the Monitor of All-sky X-ray Image. We find a hard spectrum, subsecond variability, and high total energy, which are only known to arise from ultrarelativistic jets powered by a rapidly-spinning stellar-mass central engine. These properties and the extreme duration are together incompatible with all confirmed gamma-ray burst progenitors and nearly all models in the literature. This burst is naturally explained with the helium merger model, where a field binary ends when a black hole falls into a stripped star and proceeds to consume and explode it from within. Under this paradigm, GRB 250702B adds to the growing evidence that helium stars expand and that some ultra-long GRBs have similar evolutionary pathways as collapsars, stellar-mass gravitational wave sources, and potentially rare types of supernovae.
Since the publication of the first International AI Safety Report, AI capabilities have continued to improve across key domains. New training techniques that teach AI systems to reason step-by-step and inference-time enhancements have primarily driven these advances, rather than simply training larger models. As a result, general-purpose AI systems can solve more complex problems in a range of domains, from scientific research to software development. Their performance on benchmarks that measure performance in coding, mathematics, and answering expert-level science questions has continued to improve, though reliability challenges persist, with systems excelling on some tasks while failing completely on others. These capability improvements also have implications for multiple risks, including risks from biological weapons and cyber attacks. Finally, they pose new challenges for monitoring and controllability. This update examines how AI capabilities have improved since the first Report, then focuses on key risk areas where substantial new evidence warrants updated assessments.
AnyRIR, developed by researchers at Aalto University and the University of York, introduces a method for estimating room impulse responses (RIRs) robustly in uncontrolled, noisy environments by leveraging background music as the excitation signal. The approach, based on "l"_1-norm regression, achieved a -36.0 dB RIR estimation error in simulated non-stationary noise, outperforming conventional methods.
Researchers performed the first calculation of light (photon) scattering on heavy dark matter particles, revealing non-zero cross-sections for both weakly interacting and purely gravitational dark matter. The study predicts distinct energy-dependent "coloring" effects and polarization signatures, and established an upper limit on dark matter mass below 5.0 × 10^19 GeV using Galactic Center gamma-ray observations.
54
The last two decades has seen quantum thermodynamics become a well established field of research in its own right. In that time, it has demonstrated a remarkably broad applicability, ranging from providing foundational advances in the understanding of how thermodynamic principles apply at the nano-scale and in the presence of quantum coherence, to providing a guiding framework for the development of efficient quantum devices. Exquisite levels of control have allowed state-of-the-art experimental platforms to explore energetics and thermodynamics at the smallest scales which has in turn helped to drive theoretical advances. This Roadmap provides an overview of the recent developments across many of the field's sub-disciplines, assessing the key challenges and future prospects, providing a guide for its near term progress.
A comprehensive white paper from the GenAINet Initiative introduces Large Telecom Models (LTMs) as a novel framework for integrating AI into telecommunications infrastructure, providing a detailed roadmap for innovation while addressing critical challenges in scalability, hardware requirements, and regulatory compliance through insights from a diverse coalition of academic, industry and regulatory experts.
Immersion in virtual and augmented reality solutions is reliant on plausible spatial audio. However, plausibly representing a space for immersive audio often requires many individual acoustic measurements of source-microphone pairs with specialist spatial microphones, making the procedure time-consuming and expensive. In this study, we evaluate the plausibility of extrapolated and spatialised Room Impulse Responses (RIRs) by using a 3-Alternative Forced Choice (3AFC) listening test. The stimuli comprised of RIRs from three spaces convolved with speech, orchestral, and instrumental music. When asked to select which stimuli was artificial out of one extrapolated and two real stimuli, an overall accuracy of 38% was achieved from 20 participants (5 percentage points above the expected guessing rate). Given the listening test result, this study shows that it is possible to extrapolate plausible spatial RIRs from mono measurements, decreasing the need for time and specialist equipment in acoustic measurements.
A comprehensive evaluation framework reveals significant limitations in commonly used no-reference image quality metrics (NRIQMs) for medical image generation, demonstrating that upstream metrics often fail to detect clinically relevant issues and correlate poorly with downstream task performance across VAE, GAN, and DDPM architectures.
UVLLM introduces an automated framework for Register Transfer Level (RTL) hardware verification, integrating Large Language Models (LLMs) with the Universal Verification Methodology (UVM). The system achieves an average fix rate of 86.99% for syntax errors and 71.92% for functional errors, outperforming prior methods like MEIC by up to 36.3% and demonstrating a 10.42x speedup.
The Laplace-Beltrami operator has established itself in the field of non-rigid shape analysis due to its many useful properties such as being invariant under isometric transformation, having a countable eigensystem forming an orthornormal basis, and fully characterizing geodesic distances of the manifold. However, this invariancy only applies under isometric deformations, which leads to a performance breakdown in many real-world applications. In recent years emphasis has been placed upon extracting optimal features using deep learning methods,however spectral signatures play a crucial role and still add value. In this paper we take a step back, revisiting the LBO and proposing a supervised way to learn several operators on a manifold. Depending on the task, by applying these functions, we can train the LBO eigenbasis to be more task-specific. The optimization of the LBO leads to enormous improvements to established descriptors such as the heat kernel signature in various tasks such as retrieval, classification, segmentation, and correspondence, proving the adaption of the LBO eigenbasis to both global and highly local learning settings.
A study objectively assessed the fairness and robustness of Large Language Models (LLMs) in reasoning tasks when queried in African American Vernacular English (AAVE) versus Standardized English (SE). It found that most LLMs experienced statistically significant performance drops, averaging over 10% relative reduction, on AAVE queries across various reasoning categories, with Chain of Thought and standardization prompting proving insufficient to close this gap.
The deployment of Large Language Models (LLMs) for code debugging (e.g., C and Python) is widespread, benefiting from their ability to understand and interpret intricate concepts. However, in the semiconductor industry, utilising LLMs to debug Register Transfer Level (RTL) code is still insufficient, largely due to the underrepresentation of RTL-specific data in training sets. This work introduces a novel framework, Make Each Iteration Count (MEIC), which contrasts with traditional one-shot LLM-based debugging methods that heavily rely on prompt engineering, model tuning, and model training. MEIC utilises LLMs in an iterative process to overcome the limitation of LLMs in RTL code debugging, which is suitable for identifying and correcting both syntax and function errors, while effectively managing the uncertainties inherent in LLM operations. To evaluate our framework, we provide an open-source dataset comprising 178 common RTL programming errors. The experimental results demonstrate that the proposed debugging framework achieves fix rate of 93% for syntax errors and 78% for function errors, with up to 48x speedup in debugging processes when compared with experienced engineers. The Repo. of dataset and code: this https URL.
Researchers propose a physics-guided motion loss that regularizes video diffusion models by enforcing physical plausibility for translation, rotation, and scaling directly in the frequency domain. This approach improves temporal consistency and motion quality in generated videos, achieving substantial gains across various metrics and strong user preference without requiring architectural changes to the generative models.
174
This is the interim publication of the first International Scientific Report on the Safety of Advanced AI. The report synthesises the scientific understanding of general-purpose AI -- AI that can perform a wide variety of tasks -- with a focus on understanding and managing its risks. A diverse group of 75 AI experts contributed to this report, including an international Expert Advisory Panel nominated by 30 countries, the EU, and the UN. Led by the Chair, these independent experts collectively had full discretion over the report's content. The final report is available at arXiv:2501.17805
Verification presents a major bottleneck in Integrated Circuit (IC) development, consuming nearly 70% of the total development effort. While the Universal Verification Methodology (UVM) is widely used in industry to improve verification efficiency through structured and reusable testbenches, constructing these testbenches and generating sufficient stimuli remain challenging. These challenges arise from the considerable manual coding effort required, repetitive manual execution of multiple EDA tools, and the need for in-depth domain expertise to navigate complex this http URL, we present UVM^2, an automated verification framework that leverages Large Language Models (LLMs) to generate UVM testbenches and iteratively refine them using coverage feedback, significantly reducing manual effort while maintaining rigorous verification this http URL evaluate UVM^2, we introduce a benchmark suite comprising Register Transfer Level (RTL) designs of up to 1.6K lines of this http URL results show that UVM^2 reduces testbench setup time by up to UVM^2 compared to experienced engineers, and achieve average code and function coverage of 87.44% and 89.58%, outperforming state-of-the-art solutions by 20.96% and 23.51%, respectively.
Emergent effects can arise in multi-agent systems (MAS) where execution is decentralized and reliant on local information. These effects may range from minor deviations in behavior to catastrophic system failures. To formally define these effects, we identify misalignments between the global inherent specification (the true specification) and its local approximation (such as the configuration of different reward components or observations). Using established safety terminology, we develop a framework to understand these emergent effects. To showcase the resulting implications, we use two broadly configurable exemplary gridworld scenarios, where insufficient specification leads to unintended behavior deviations when derived independently. Recognizing that a global adaptation might not always be feasible, we propose adjusting the underlying parameterizations to mitigate these issues, thereby improving the system's alignment and reducing the risk of emergent failures.
Dexterous in-hand manipulation remains a foundational challenge in robotics, with progress often constrained by the prevailing paradigm of imitating the human hand. This anthropomorphic approach creates two critical barriers: 1) it limits robotic capabilities to tasks humans can already perform, and 2) it makes data collection for learning-based methods exceedingly difficult. Both challenges are caused by traditional force-closure which requires coordinating complex, multi-point contacts based on friction, normal force, and gravity to grasp an object. This makes teleoperated demonstrations unstable and amplifies the sim-to-real gap for reinforcement learning. In this work, we propose a paradigm shift: moving away from replicating human mechanics toward the design of novel robotic embodiments. We introduce the \textbf{S}uction \textbf{Leap}-Hand (SLeap Hand), a multi-fingered hand featuring integrated fingertip suction cups that realize a new form of suction-enabled dexterity. By replacing complex force-closure grasps with stable, single-point adhesion, our design fundamentally simplifies in-hand teleoperation and facilitates the collection of high-quality demonstration data. More importantly, this suction-based embodiment unlocks a new class of dexterous skills that are difficult or even impossible for the human hand, such as one-handed paper cutting and in-hand writing. Our work demonstrates that by moving beyond anthropomorphic constraints, novel embodiments can not only lower the barrier for collecting robust manipulation data but also enable the stable, single-handed completion of tasks that would typically require two human hands. Our webpage is this https URL.
There are no more papers matching your filters at the moment.