Leverhulme Centre for the Future of Intelligence
Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) have been increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances--such as code execution and knowledge bases--that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.
With the recent wave of progress in artificial intelligence (AI) has come a growing awareness of the large-scale impacts of AI systems, and recognition that existing regulations and norms in industry and academia are insufficient to ensure responsible AI development. In order for AI developers to earn trust from system users, customers, civil society, governments, and other stakeholders that they are building AI responsibly, they will need to make verifiable claims to which they can be held accountable. Those outside of a given organization also need effective means of scrutinizing such claims. This report suggests various steps that different stakeholders can take to improve the verifiability of claims made about AI systems and their associated development processes, with a focus on providing evidence about the safety, security, fairness, and privacy protection of AI systems. We analyze ten mechanisms for this purpose--spanning institutions, software, and hardware--and make recommendations aimed at implementing, exploring, or improving those mechanisms.
As AI systems appear to exhibit ever-increasing capability and generality, assessing their true potential and safety becomes paramount. This paper contends that the prevalent evaluation methods for these systems are fundamentally inadequate, heightening the risks and potential hazards associated with AI. I argue that a reformation is required in the way we evaluate AI systems and that we should look towards cognitive sciences for inspiration in our approaches, which have a longstanding tradition of assessing general intelligence across diverse species. We will identify some of the difficulties that need to be overcome when applying cognitively-inspired approaches to general-purpose AI systems and also analyse the emerging area of "Evals". The paper concludes by identifying promising research pathways that could refine AI evaluation, advancing it towards a rigorous scientific domain that contributes to the development of safe AI systems.
Compute governance can underpin international institutions for the governance of frontier AI. To demonstrate this I explore four institutions for governing and developing frontier AI. Next steps for compute-indexed domestic frontier AI regulation could include risk assessments and pre-approvals, data centre usage reports, and release gate regulation. Domestic regimes could be harmonized and monitored through an International AI Agency - an International Atomic Energy Agency (IAEA) for AI. This could be backed up by a Secure Chips Agreement - a Non-Proliferation Treaty (NPT) for AI. This would be a non-proliferation regime for advanced chips, building on the chip export controls - states that do not have an IAIA-certified frontier regulation regime would not be allowed to import advanced chips. Frontier training runs could be carried out by a megaproject between the USA and its allies - a US-led Allied Public-Private Partnership for frontier AI. As a project to develop advanced AI, this could have significant advantages over alternatives led by Big Tech or particular states: it could be more legitimate, secure, safe, non-adversarial, peaceful, and less prone to misuse. For each of these four scenarios, a key incentive for participation is access to the advanced AI chips that are necessary for frontier training runs and large-scale inference. Together, they can create a situation in which governments can be reassured that frontier AI is developed and deployed in a secure manner with misuse minimised and benefits widely shared. Building these institutions may take years or decades, but progress is incremental and evolutionary and the first steps have already been taken.
This research investigates whether OpenAI's GPT-4, a state-of-the-art large language model, can accurately classify the political bias of news sources based solely on their URLs. Given the subjective nature of political labels, third-party bias ratings like those from Ad Fontes Media, AllSides, and Media Bias/Fact Check (MBFC) are often used in research to analyze news source diversity. This study aims to determine if GPT-4 can replicate these human ratings on a seven-degree scale ("far-left" to "far-right"). The analysis compares GPT-4's classifications against MBFC's, and controls for website popularity using Open PageRank scores. Findings reveal a high correlation (Spearman’s ρ=.89\text{Spearman's } \rho = .89, n=5,877n = 5,877, p < 0.001) between GPT-4's and MBFC's ratings, indicating the model's potential reliability. However, GPT-4 abstained from classifying approximately 23\frac{2}{3} of the dataset. It is more likely to abstain from rating unpopular websites, which also suffer from less accurate assessments. The LLM tends to avoid classifying sources that MBFC considers to be centrist, resulting in more polarized outputs. Finally, this analysis shows a slight leftward skew in GPT's classifications compared to MBFC's. Therefore, while this paper suggests that while GPT-4 can be a scalable, cost-effective tool for political bias classification of news websites, its use should be as a complement to human judgment to mitigate biases.
Artificial intelligence (AI)-powered recommender systems play a crucial role in determining the content that users are exposed to on social media platforms. However, the behavioural patterns of these systems are often opaque, complicating the evaluation of their impact on the dissemination and consumption of disinformation and misinformation. To begin addressing this evidence gap, this study presents a measurement approach that uses observed digital traces to infer the status of algorithmic amplification of low-credibility content on Twitter over a 14-day period in January 2023. Using an original dataset of 2.7 million posts on COVID-19 and climate change published on the platform, this study identifies tweets sharing information from low-credibility domains, and uses a bootstrapping model with two stratifications, a tweet's engagement level and a user's followers level, to compare any differences in impressions generated between low-credibility and high-credibility samples. Additional stratification variables of toxicity, political bias, and verified status are also examined. This analysis provides valuable observational evidence on whether the Twitter algorithm favours the visibility of low-credibility content, with results indicating that tweets containing low-credibility URL domains perform significantly better than tweets that do not across both datasets. Furthermore, high toxicity tweets and those with right-leaning bias see heightened amplification, as do low-credibility tweets from verified accounts. This suggests that Twitter s recommender system may have facilitated the diffusion of false content, even when originating from notoriously low-credibility sources.
Cultural heritage, a testament to human history and civilization, has gained increasing recognition for its significance in preservation and dissemination. The integration of immersive technologies has transformed how cultural heritage is presented, enabling audiences to engage with it in more vivid, intuitive, and interactive ways. However, the adoption of these technologies also brings a range of challenges and potential risks. This paper presents a systematic review, with an in-depth analysis of 177 selected papers. We comprehensively examine and categorize current applications, technological approaches, and user devices in immersive cultural heritage presentations, while also highlighting the associated risks and challenges. Furthermore, we identify areas for future research in the immersive presentation of cultural heritage. Our goal is to provide a comprehensive reference for researchers and practitioners, enhancing understanding of the technological applications, risks, and challenges in this field, and encouraging further innovation and development.
The aim of this paper is to facilitate nuanced discussion around research norms and practices to mitigate the harmful impacts of advances in machine learning (ML). We focus particularly on the use of ML to create "synthetic media" (e.g. to generate or manipulate audio, video, images, and text), and the question of what publication and release processes around such research might look like, though many of the considerations discussed will apply to ML research more broadly. We are not arguing for any specific approach on when or how research should be distributed, but instead try to lay out some useful tools, analogies, and options for thinking about these issues. We begin with some background on the idea that ML research might be misused in harmful ways, and why advances in synthetic media, in particular, are raising concerns. We then outline in more detail some of the different paths to harm from ML research, before reviewing research risk mitigation strategies in other fields and identifying components that seem most worth emulating in the ML and synthetic media research communities. Next, we outline some important dimensions of disagreement on these issues which risk polarizing conversations. Finally, we conclude with recommendations, suggesting that the machine learning community might benefit from: working with subject matter experts to increase understanding of the risk landscape and possible mitigation strategies; building a community and norms around understanding the impacts of ML research, e.g. through regular workshops at major conferences; and establishing institutions and systems to support release practices that would otherwise be onerous and error-prone.
The integrity of AI benchmarks is fundamental to accurately assess the capabilities of AI systems. The internal validity of these benchmarks - i.e., making sure they are free from confounding factors - is crucial for ensuring that they are measuring what they are designed to measure. In this paper, we explore a key issue related to internal validity: the possibility that AI systems can solve benchmarks in unintended ways, bypassing the capability being tested. This phenomenon, widely known in human and animal experiments, is often referred to as the 'Clever Hans' effect, where tasks are solved using spurious cues, often involving much simpler processes than those putatively assessed. Previous research suggests that language models can exhibit this behaviour as well. In several older Natural Language Processing (NLP) benchmarks, individual nn-grams like "not" have been found to be highly predictive of the correct labels, and supervised NLP models have been shown to exploit these patterns. In this work, we investigate the extent to which simple nn-grams extracted from benchmark instances can be combined to predict labels in modern multiple-choice benchmarks designed for LLMs, and whether LLMs might be using such nn-gram patterns to solve these benchmarks. We show how simple classifiers trained on these nn-grams can achieve high scores on several benchmarks, despite lacking the capabilities being tested. Additionally, we provide evidence that modern LLMs might be using these superficial patterns to solve benchmarks. This suggests that the internal validity of these benchmarks may be compromised and caution should be exercised when interpreting LLM performance results on them.
Recent advances in artificial intelligence have been strongly driven by the use of game environments for training and evaluating agents. Games are often accessible and versatile, with well-defined state-transitions and goals allowing for intensive training and experimentation. However, agents trained in a particular environment are usually tested on the same or slightly varied distributions, and solutions do not necessarily imply any understanding. If we want AI systems that can model and understand their environment, we need environments that explicitly test for this. Inspired by the extensive literature on animal cognition, we present an environment that keeps all the positive elements of standard gaming environments, but is explicitly designed for the testing of animal-like artificial cognition.
Large Language Models (LLMs) present a dual-use dilemma: they enable beneficial applications while harboring potential for harm, particularly through conversational interactions. Despite various safeguards, advanced LLMs remain vulnerable. A watershed case in early 2023 involved journalist Kevin Roose's extended dialogue with Bing, an LLM-powered search engine, which revealed harmful outputs after probing questions, highlighting vulnerabilities in the model's safeguards. This contrasts with simpler early jailbreaks, like the "Grandma Jailbreak," where users framed requests as innocent help for a grandmother, easily eliciting similar content. This raises the question: How much conversational effort is needed to elicit harmful information from LLMs? We propose two measures to quantify this effort: Conversational Length (CL), which measures the number of conversational turns needed to obtain a specific harmful response, and Conversational Complexity (CC), defined as the Kolmogorov complexity of the user's instruction sequence leading to the harmful response. To address the incomputability of Kolmogorov complexity, we approximate CC using a reference LLM to estimate the compressibility of the user instructions. Applying this approach to a large red-teaming dataset, we perform a quantitative analysis examining the statistical distribution of harmful and harmless conversational lengths and complexities. Our empirical findings suggest that this distributional analysis and the minimization of CC serve as valuable tools for understanding AI safety, offering insights into the accessibility of harmful information. This work establishes a foundation for a new perspective on LLM safety, centered around the algorithmic complexity of pathways to harm.
Consider a binary decision making process where a single machine learning classifier replaces a multitude of humans. We raise questions about the resulting loss of diversity in the decision making process. We study the potential benefits of using random classifier ensembles instead of a single classifier in the context of fairness-aware learning and demonstrate various attractive properties: (i) an ensemble of fair classifiers is guaranteed to be fair, for several different measures of fairness, (ii) an ensemble of unfair classifiers can still achieve fair outcomes, and (iii) an ensemble of classifiers can achieve better accuracy-fairness trade-offs than a single classifier. Finally, we introduce notions of distributional fairness to characterize further potential benefits of random classifier ensembles.
How can many people (who may disagree) come together to answer a question or make a decision? "Collective response systems" are a type of generative collective intelligence (CI) facilitation process meant to address this challenge. They enable a form of "generative voting", where both the votes, and the choices of what to vote on, are provided by the group. Such systems overcome the traditional limitations of polling, town halls, standard voting, referendums, etc. The generative CI outputs of collective response systems can also be chained together into iterative "collective dialogues", analogously to some kinds of generative AI. Technical advances across domains including recommender systems, language models, and human-computer interaction have led to the development of innovative and scalable collective response systems. For example, Polis has been used around the world to support policy-making at different levels of government, and Remesh has been used by the UN to understand the challenges and needs of ordinary people across war-torn countries. This paper aims to develop a shared language by defining the structure, processes, properties, and principles of such systems. Collective response systems allow non-confrontational exploration of divisive issues, help identify common ground, and elicit insights from those closest to the issues. As a result, they can help overcome gridlock around conflict and governance challenges, increase trust, and develop mandates. Continued progress toward their development and adoption could help revitalize democracies, reimagine corporate governance, transform conflict, and govern powerful AI systems -- both as a complement to deeper deliberative democratic processes and as an option where deeper processes are not applicable or possible.
The adoption of automated, data-driven decision making in an ever expanding range of applications has raised concerns about its potential unfairness towards certain social groups. In this context, a number of recent studies have focused on defining, detecting, and removing unfairness from data-driven decision systems. However, the existing notions of fairness, based on parity (equality) in treatment or outcomes for different social groups, tend to be quite stringent, limiting the overall decision making accuracy. In this paper, we draw inspiration from the fair-division and envy-freeness literature in economics and game theory and propose preference-based notions of fairness -- given the choice between various sets of decision treatments or outcomes, any group of users would collectively prefer its treatment or outcomes, regardless of the (dis)parity as compared to the other groups. Then, we introduce tractable proxies to design margin-based classifiers that satisfy these preference-based notions of fairness. Finally, we experiment with a variety of synthetic and real-world datasets and show that preference-based fairness allows for greater decision accuracy than parity-based fairness.
The terms 'human-level artificial intelligence' and 'artificial general intelligence' are widely used to refer to the possibility of advanced artificial intelligence (AI) with potentially extreme impacts on society. These terms are poorly defined and do not necessarily indicate what is most important with respect to future societal impacts. We suggest that the term 'transformative AI' is a helpful alternative, reflecting the possibility that advanced AI systems could have very large impacts on society without reaching human-level cognitive abilities. To be most useful, however, more analysis of what it means for AI to be 'transformative' is needed. In this paper, we propose three different levels on which AI might be said to be transformative, associated with different levels of societal change. We suggest that these distinctions would improve conversations between policy makers and decision makers concerning the mid- to long-term impacts of advances in AI. Further, we feel this would have a positive effect on strategic foresight efforts involving advanced AI, which we expect to illuminate paths to alternative futures. We conclude with a discussion of the benefits of our new framework and by highlighting directions for future work in this area.
It is increasingly recognised that advances in artificial intelligence could have large and long-lasting impacts on society. However, what form those impacts will take, just how large and long-lasting they will be, and whether they will ultimately be positive or negative for humanity, is far from clear. Based on surveying literature on the societal impacts of AI, we identify and discuss five potential long-term impacts of AI: how AI could lead to long-term changes in science, cooperation, power, epistemics, and values. We review the state of existing research in each of these areas and highlight priority questions for future research.
Language models have become very popular recently and many claims have been made about their abilities, including for commonsense reasoning. Given the increasingly better results of current language models on previous static benchmarks for commonsense reasoning, we explore an alternative dialectical evaluation. The goal of this kind of evaluation is not to obtain an aggregate performance value but to find failures and map the boundaries of the system. Dialoguing with the system gives the opportunity to check for consistency and get more reassurance of these boundaries beyond anecdotal evidence. In this paper we conduct some qualitative investigations of this kind of evaluation for the particular case of spatial reasoning (which is a fundamental aspect of commonsense reasoning). We conclude with some suggestions for future work both to improve the capabilities of language models and to systematise this kind of dialectical evaluation.
Many ethical frameworks require artificial intelligence (AI) systems to be explainable. Explainable AI (XAI) models are frequently tested for their adequacy in user studies. Since different people may have different explanatory needs, it is important that participant samples in user studies are large enough to represent the target population to enable generalizations. However, it is unclear to what extent XAI researchers reflect on and justify their sample sizes or avoid broad generalizations across people. We analyzed XAI user studies (n = 220) published between 2012 and 2022. Most studies did not offer rationales for their sample sizes. Moreover, most papers generalized their conclusions beyond their target population, and there was no evidence that broader conclusions in quantitative studies were correlated with larger samples. These methodological problems can impede evaluations of whether XAI systems implement the explainability called for in ethical frameworks. We outline principles for more inclusive XAI user studies.
In this essay, I argue that explicit ethical machines, whose moral principles are inferred through a bottom-up approach, are unable to replicate human-like moral reasoning and cannot be considered moral agents. By utilizing Alan Turing's theory of computation, I demonstrate that moral reasoning is computationally intractable by these machines due to the halting problem. I address the frontiers of machine ethics by formalizing moral problems into 'algorithmic moral questions' and by exploring moral psychology's dual-process model. While the nature of Turing Machines theoretically allows artificial agents to engage in recursive moral reasoning, critical limitations are introduced by the halting problem, which states that it is impossible to predict with certainty whether a computational process will halt. A thought experiment involving a military drone illustrates this issue, showing that an artificial agent might fail to decide between actions due to the halting problem, which limits the agent's ability to make decisions in all instances, undermining its moral agency.
Large Language Models (LLMs) present a dual-use dilemma: they enable beneficial applications while harboring potential for harm, particularly through conversational interactions. Despite various safeguards, advanced LLMs remain vulnerable. A watershed case in early 2023 involved journalist Kevin Roose's extended dialogue with Bing, an LLM-powered search engine, which revealed harmful outputs after probing questions, highlighting vulnerabilities in the model's safeguards. This contrasts with simpler early jailbreaks, like the "Grandma Jailbreak," where users framed requests as innocent help for a grandmother, easily eliciting similar content. This raises the question: How much conversational effort is needed to elicit harmful information from LLMs? We propose two measures to quantify this effort: Conversational Length (CL), which measures the number of conversational turns needed to obtain a specific harmful response, and Conversational Complexity (CC), defined as the Kolmogorov complexity of the user's instruction sequence leading to the harmful response. To address the incomputability of Kolmogorov complexity, we approximate CC using a reference LLM to estimate the compressibility of the user instructions. Applying this approach to a large red-teaming dataset, we perform a quantitative analysis examining the statistical distribution of harmful and harmless conversational lengths and complexities. Our empirical findings suggest that this distributional analysis and the minimization of CC serve as valuable tools for understanding AI safety, offering insights into the accessibility of harmful information. This work establishes a foundation for a new perspective on LLM safety, centered around the algorithmic complexity of pathways to harm.
There are no more papers matching your filters at the moment.