Oxford Internet Institute
Polymarket is a prediction market platform where users can speculate on future events by trading shares tied to specific outcomes, known as conditions. Each market is associated with a set of one or more such conditions. To ensure proper market resolution, the condition set must be exhaustive -- collectively accounting for all possible outcomes -- and mutually exclusive -- only one condition may resolve as true. Thus, the collective prices of all related outcomes should be \1, representing a combined probability of 1 of any outcome. Despite this design, Polymarket exhibits cases where dependent assets are mispriced, allowing for purchasing (or selling) a certain outcome for less than (or more than) \1, guaranteeing profit. This phenomenon, known as arbitrage, could enable sophisticated participants to exploit such inconsistencies. In this paper, we conduct an empirical arbitrage analysis on Polymarket data to answer three key questions: (Q1) What conditions give rise to arbitrage (Q2) Does arbitrage actually occur on Polymarket and (Q3) Has anyone exploited these opportunities. A major challenge in analyzing arbitrage between related markets lies in the scalability of comparisons across a large number of markets and conditions, with a naive analysis requiring O(2n+m)O(2^{n+m}) comparisons. To overcome this, we employ a heuristic-driven reduction strategy based on timeliness, topical similarity, and combinatorial relationships, further validated by expert input. Our study reveals two distinct forms of arbitrage on Polymarket: Market Rebalancing Arbitrage, which occurs within a single market or condition, and Combinatorial Arbitrage, which spans across multiple markets. We use on-chain historical order book data to analyze when these types of arbitrage opportunities have existed, and when they have been executed by users. We find a realized estimate of 40 million USD of profit extracted.
A comprehensive study by OpenAI, the Centre for the Governance of AI, and collaborating institutions examines computing power as a critical lever for AI governance. The research details how compute's unique properties, such as detectability and supply chain concentration, enable policymakers to enhance visibility into AI development, influence resource allocation, and enforce regulations, proposing specific policy mechanisms for these capacities.
2
We introduce the Momentum Transformer, an attention-based deep-learning architecture, which outperforms benchmark time-series momentum and mean-reversion trading strategies. Unlike state-of-the-art Long Short-Term Memory (LSTM) architectures, which are sequential in nature and tailored to local processing, an attention mechanism provides our architecture with a direct connection to all previous time-steps. Our architecture, an attention-LSTM hybrid, enables us to learn longer-term dependencies, improves performance when considering returns net of transaction costs and naturally adapts to new market regimes, such as during the SARS-CoV-2 crisis. Via the introduction of multiple attention heads, we can capture concurrent regimes, or temporal dynamics, which are occurring at different timescales. The Momentum Transformer is inherently interpretable, providing us with greater insights into our deep-learning momentum trading strategy, including the importance of different factors over time and the past time-steps which are of the greatest significance to the model.
501
The global workforce is urged to constantly reskill, as technological change favours particular new skills while making others redundant. But which skills are a good investment for workers and firms? As skills are seldomly applied in isolation, we propose that complementarity strongly determines a skill's economic value. For 962 skills, we demonstrate that their value is strongly determined by complementarity - that is, how many different skills, ideally of high value, a competency can be combined with. We show that the value of a skill is relative, as it depends on the skill background of the worker. For most skills, their value is highest when used in combination with skills of a different type. We put our model to the test with a set of skills related to Artificial Intelligence (AI). We find that AI skills are particularly valuable - increasing worker wages by 21% on average - because of their strong complementarities and their rising demand in recent years. The model and metrics of our work can inform the policy and practice of digital re-skilling to reduce labour market mismatches. In cooperation with data and education providers, researchers and policy makers should consider using this blueprint to provide learners with personalised skill recommendations that complement their existing capacities and fit their occupational background.
Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org .
Mobile dating applications (MDAs) have skyrocketed in popularity in the last few years, with popular MDA Tinder alone matching 26 million pairs of users per day. In addition to becoming an influential part of modern dating culture, MDAs facilitate a unique form of mediated communication: dyadic mobile text messages between pairs of users who are not already acquainted. Furthermore, mobile dating has paved the way for analysis of these digital interactions via massive sets of data generated by the instant matching and messaging functions of its many platforms at an unprecedented scale. This paper looks at one of these sets of data: metadata of approximately two million conversations, containing 19 million messages, exchanged between 400,000 heterosexual users on an MDA. Through computational analysis methods, this study offers the very first large scale quantitative depiction of mobile dating as a whole. We report on differences in how heterosexual male and female users communicate with each other on MDAs, differences in behaviors of dyads of varying degrees of social separation, and factors leading to "success"-operationalized by the exchange of phone numbers between a match. For instance, we report that men initiate 79% of conversations--and while about half of the initial messages are responded to, conversations initiated by men are more likely to be reciprocated. We also report that the length of conversations, the waiting times, and the length of messages have fat-tailed distributions. That said, the majority of reciprocated conversations lead to a phone number exchange within the first 20 messages.
This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.
Digital watermarking is a promising solution for mitigating some of the risks arising from the misuse of automatically generated text. These approaches either embed non-specific watermarks to allow for the detection of any text generated by a particular sampler, or embed specific keys that allow the identification of the LLM user. However, simultaneously using the same embedding for both detection and user identification leads to a false detection problem, whereby, as user capacity grows, unwatermarked text is increasingly likely to be falsely detected as watermarked. Through theoretical analysis, we identify the underlying causes of this phenomenon. Building on these insights, we propose Dual Watermarking which jointly encodes detection and identification watermarks into generated text, significantly reducing false positives while maintaining high detection accuracy. Our experimental results validate our theoretical findings and demonstrate the effectiveness of our approach.
The proliferation of applications using artificial intelligence (AI) systems has led to a growing number of users interacting with these systems through sophisticated interfaces. Human-computer interaction research has long shown that interfaces shape both user behavior and user perception of technical capabilities and risks. Yet, practitioners and researchers evaluating the social and ethical risks of AI systems tend to overlook the impact of anthropomorphic, deceptive, and immersive interfaces on human-AI interactions. Here, we argue that design features of interfaces with adaptive AI systems can have cascading impacts, driven by feedback loops, which extend beyond those previously considered. We first conduct a scoping review of AI interface designs and their negative impact to extract salient themes of potentially harmful design patterns in AI interfaces. Then, we propose Design-Enhanced Control of AI systems (DECAI), a conceptual model to structure and facilitate impact assessments of AI interface designs. DECAI draws on principles from control systems theory -- a theory for the analysis and design of dynamic physical systems -- to dissect the role of the interface in human-AI systems. Through two case studies on recommendation systems and conversational language model systems, we show how DECAI can be used to evaluate AI interface designs.
We present a large-scale computational analysis of migration-related discourse in UK parliamentary debates spanning over 75 years and compare it with US congressional discourse. Using open-weight LLMs, we annotate each statement with high-level stances toward migrants and track the net tone toward migrants across time and political parties. For the UK, we extend this with a semi-automated framework for extracting fine-grained narrative frames to capture nuances of migration discourse. Our findings show that, while US discourse has grown increasingly polarised, UK parliamentary attitudes remain relatively aligned across parties, with a persistent ideological gap between Labour and the Conservatives, reaching its most negative level in 2025. The analysis of narrative frames in the UK parliamentary statements reveals a shift toward securitised narratives such as border control and illegal immigration, while longer-term integration-oriented frames such as social integration have declined. Moreover, discussions of national law about immigration have been replaced over time by international law and human rights, revealing nuances in discourse trends. Taken together broadly, our findings demonstrate how LLMs can support scalable, fine-grained discourse analysis in political and historical contexts.
Social media are now a routine part of political campaigns all over the world. However, studies of the impact of campaigning on social platform have thus far been limited to cross-sectional datasets from one election period which are vulnerable to unobserved variable bias. Hence empirical evidence on the effectiveness of political social media activity is thin. We address this deficit by analysing a novel panel dataset of political Twitter activity in the 2015 and 2017 elections in the United Kingdom. We find that Twitter based campaigning does seem to help win votes, a finding which is consistent across a variety of different model specifications including a first difference regression. The impact of Twitter use is small in absolute terms, though comparable with that of campaign spending. Our data also support the idea that effects are mediated through other communication channels, hence challenging the relevance of engaging in an interactive fashion.
Recently developed information communication technologies, particularly the Internet, have affected how we, both as individuals and as a society, create, store, and recall information. Internet also provides us with a great opportunity to study memory using transactional large scale data, in a quantitative framework similar to the practice in statistical physics. In this project, we make use of online data by analysing viewership statistics of Wikipedia articles on aircraft crashes. We study the relation between recent events and past events and particularly focus on understanding memory triggering patterns. We devise a quantitative model that explains the flow of viewership from a current event to past events based on similarity in time, geography, topic, and the hyperlink structure of Wikipedia articles. We show that on average the secondary flow of attention to past events generated by such remembering processes is larger than the primary attention flow to the current event. We are the first to report these cascading effects.
The capabilities of natural language models trained on large-scale data have increased immensely over the past few years. Open source libraries such as HuggingFace have made these models easily available and accessible. While prior research has identified biases in large language models, this paper considers biases contained in the most popular versions of these models when applied `out-of-the-box' for downstream tasks. We focus on generative language models as they are well-suited for extracting biases inherited from training data. Specifically, we conduct an in-depth analysis of GPT-2, which is the most downloaded text generation model on HuggingFace, with over half a million downloads per month. We assess biases related to occupational associations for different protected categories by intersecting gender with religion, sexuality, ethnicity, political affiliation, and continental name origin. Using a template-based data collection pipeline, we collect 396K sentence completions made by GPT-2 and find: (i) The machine-predicted jobs are less diverse and more stereotypical for women than for men, especially for intersections; (ii) Intersectional interactions are highly relevant for occupational associations, which we quantify by fitting 262 logistic models; (iii) For most occupations, GPT-2 reflects the skewed gender and ethnicity distribution found in US Labor Bureau data, and even pulls the societally-skewed distribution towards gender parity in cases where its predictions deviate from real labor market observations. This raises the normative question of what language models should learn - whether they should reflect or correct for existing inequalities.
Companies claim to "democratise" artificial intelligence (AI) when they donate AI open source software (OSS) to non-profit foundations or release AI models, among others, but what does this term mean and why do they do it? As the impact of AI on society and the economy grows, understanding the commercial incentives behind AI democratisation efforts is crucial for ensuring these efforts serve broader interests beyond commercial agendas. Towards this end, this study employs a mixed-methods approach to investigate commercial incentives for 43 AI OSS donations to the Linux Foundation. It makes contributions to both research and practice. It contributes a taxonomy of both individual and organisational social, economic, and technological incentives for AI democratisation. In particular, it highlights the role of democratising the governance and control rights of an OSS project (i.e., from one company to open governance) as a structural enabler for downstream goals, such as attracting external contributors, reducing development costs, and influencing industry standards, among others. Furthermore, OSS donations are often championed by individual developers within companies, highlighting the importance of the bottom-up incentives for AI democratisation. The taxonomy provides a framework and toolkit for discerning incentives for other AI democratisation efforts, such as the release of AI models. The paper concludes with a discussion of future research directions.
This study investigates the non-monetary rewards associated with artificial intelligence (AI) skills in the U.S. labour market. Using a dataset of approximately ten million online job vacancies from 2018 to 2024, we identify AI roles-positions requiring at least one AI-related skill-and examine the extent to which these roles offer non-monetary benefits such as tuition assistance, paid leave, health and well-being perks, parental leave, workplace culture enhancements, and remote work options. While previous research has documented substantial wage premiums for AI-related roles due to growing demand and limited talent supply, our study asks whether this demand also translates into enhanced non-monetary compensation. We find that AI roles are significantly more likely to offer such perks, even after controlling for education requirements, industry, and occupation type. It is twice as likely for an AI role to offer parental leave and almost three times more likely to provide remote working options. Moreover, the highest-paying AI roles tend to bundle these benefits, suggesting a compound premium where salary increases coincide with expanded non-monetary rewards. AI roles offering parental leave or health benefits show salaries that are, on average, 12% to 20% higher than AI roles without this benefit. This pattern is particularly pronounced in years and occupations experiencing the highest AI-related demand, pointing to a demand-driven dynamic. Our findings underscore the strong pull of AI talent in the labor market and challenge narratives of technological displacement, highlighting instead how employers compete for scarce talent through both financial and non-financial incentives.
Recommender systems are among the most commonly deployed systems today. Systems design approaches to AI-powered recommender systems have done well to urge recommender system developers to follow more intentional data collection, curation, and management procedures. So too has the "human-in-the-loop" paradigm been widely adopted, primarily to address the issue of accountability. However, in this paper, we take the position that human oversight in recommender system design also entails novel risks that have yet to be fully described. These risks are "codetermined" by the information context in which such systems are often deployed. Furthermore, new knowledge of the shortcomings of "human-in-the-loop" practices to deliver meaningful oversight of other AI systems suggest that they may also be inadequate for achieving socially responsible recommendations. We review how the limitations of human oversight may increase the chances of a specific kind of failure: a "cascade" or "compound" failure. We then briefly explore how the unique dynamics of three common deployment contexts can make humans in the loop more likely to fail in their oversight duties. We then conclude with two recommendations.
US voters shared large volumes of polarizing political news and information in the form of links to content from Russian, WikiLeaks and junk news sources. Was this low quality political information distributed evenly around the country, or concentrated in swing states and particular parts of the country? In this data memo we apply a tested dictionary of sources about political news and information being shared over Twitter over a ten day period around the 2016 Presidential Election. Using self-reported location information, we place a third of users by state and create a simple index for the distribution of polarizing content around the country. We find that (1) nationally, Twitter users got more misinformation, polarizing and conspiratorial content than professionally produced news. (2) Users in some states, however, shared more polarizing political news and information than users in other states. (3) Average levels of misinformation were higher in swing states than in uncontested states, even when weighted for the relative size of the user population in each state. We conclude with some observations about the impact of strategically disseminated polarizing information on public life.
Anecdotal evidence suggests an increasing number of people are turning to VPN services for the properties of privacy, anonymity and free communication over the internet. Despite this, there is little research into what these services are actually being used for. We use DNS cache snooping to determine what domains people are accessing through VPNs. This technique is used to discover whether certain queries have been made against a particular DNS server. Some VPNs operate their own DNS servers, ensuring that any cached queries were made by users of the VPN. We explore 3 methods of DNS cache snooping and briefly discuss their strengths and limitations. Using the most reliable of the methods, we perform a DNS cache snooping scan against the DNS servers of several major VPN providers. With this we discover which domains are actually accessed through VPNs. We run this technique against popular domains, as well as those known to be censored in certain countries; China, Indonesia, Iran, and Turkey. Our work gives a glimpse into what users use VPNs for, and provides a technique for discovering the frequency with which domain records are accessed on a DNS server.
Artificial intelligence AI can bring substantial benefits to society by helping to reduce costs, increase efficiency and enable new solutions to complex problems. Using Floridi's notion of how to design the 'infosphere' as a starting point, in this chapter I consider the question: what are the limits of design, i.e. what are the conceptual constraints on designing AI for social good? The main argument of this chapter is that while design is a useful conceptual tool to shape technologies and societies, collective efforts towards designing future societies are constrained by both internal and external factors. Internal constraints on design are discussed by evoking Hardin's thought experiment regarding 'the Tragedy of the Commons'. Further, Hayek's classical distinction between 'cosmos' and 'taxis' is used to demarcate external constraints on design. Finally, five design principles are presented which are aimed at helping policymakers manage the internal and external constraints on design. A successful approach to designing future societies needs to account for the emergent properties of complex systems by allowing space for serendipity and socio-technological coevolution.
Generative Artificial Intelligence (GenAI) is driving significant environmental impacts. The rapid development and deployment of increasingly larger algorithmic models capable of analysing vast amounts of data are contributing to rising carbon emissions, water withdrawal, and waste generation. Generative models often consume substantially more energy than traditional models, with major tech firms increasingly turning to nuclear power to sustain these systems -- an approach that could have profound environmental consequences. This paper introduces seven data ecofeminist principles delineating a pathway for developing technological alternatives of eco-societal transformations within the AI research context. Rooted in data feminism and ecofeminist frameworks, which interrogate about the historical and social construction of epistemologies underlying the hegemonic development of science and technology that disrupt communities and nature, these principles emphasise the integration of social and environmental justice within a critical AI agenda. The paper calls for an urgent reassessment of the GenAI innovation race, advocating for ecofeminist algorithmic and infrastructural projects that prioritise and respect life, the people, and the planet.
There are no more papers matching your filters at the moment.