Weizenbaum Institute
As large language models (LLMs) transition from static tools to fully agentic systems, their potential for transforming social science research has become increasingly evident. This paper introduces a structured framework for understanding the diverse applications of LLM-based agents, ranging from simple data processors to complex, multi-agent systems capable of simulating emergent social dynamics. By mapping this developmental continuum across six levels, the paper clarifies the technical and methodological boundaries between different agentic architectures, providing a comprehensive overview of current capabilities and future potential. It highlights how lower-tier systems streamline conventional tasks like text classification and data annotation, while higher-tier systems enable novel forms of inquiry, including the study of group dynamics, norm formation, and large-scale social processes. However, these advancements also introduce significant challenges, including issues of reproducibility, ethical oversight, and the risk of emergent biases. The paper critically examines these concerns, emphasizing the need for robust validation protocols, interdisciplinary collaboration, and standardized evaluation metrics. It argues that while LLM-based agents hold transformative potential for the social sciences, realizing this promise will require careful, context-sensitive deployment and ongoing methodological refinement. The paper concludes with a call for future research that balances technical innovation with ethical responsibility, encouraging the development of agentic systems that not only replicate but also extend the frontiers of social science, offering new insights into the complexities of human behavior.
Following the widespread adoption of ChatGPT in early 2023, numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear whether LLMs have become more creative over time, and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs -- including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek -- across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). Contrary to expectations, we found no evidence of increased creative performance over the past 18-24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts.
Researchers from Potsdam University, Weizenbaum Institute, Technical University Berlin, and Zuse Institute Berlin developed FACET, a multi-agent LLM system that creates personalized educational worksheets by considering both student cognitive proficiency and intrinsic motivation. The framework produced worksheets rated highly for suitability and clarity by both automated evaluator agents (M=4.78, M=4.76) and expert teachers (M=5.05, M=4.68), demonstrating its potential to support differentiated instruction.
Human-AI co-creativity represents a transformative shift in how humans and generative AI tools collaborate in creative processes. This chapter explores the synergies between human ingenuity and AI capabilities across four levels of interaction: Digital Pen, AI Task Specialist, AI Assistant, and AI Co-Creator. While earlier digital tools primarily facilitated creativity, generative AI systems now contribute actively, demonstrating autonomous creativity in producing novel and valuable outcomes. Empirical evidence from mathematics showcases how AI can extend human creative potential, from computational problem-solving to co-creative partnerships yielding breakthroughs in longstanding challenges. By analyzing these collaborations, the chapter highlights AI's potential to enhance human creativity without replacing it, underscoring the importance of balancing AI's contributions with human oversight and contextual understanding. This integration pushes the boundaries of creative achievements, emphasizing the need for human-centered AI systems that foster collaboration while preserving the unique qualities of human creativity.
A widespread view is that Artificial Intelligence cannot be creative. We tested this assumption by comparing human-generated ideas with those generated by six Generative Artificial Intelligence (GAI) chatbots: alpa. ⁣aialpa.\!ai, Copy. ⁣aiCopy.\!ai, ChatGPT (versions 3 and 4), Studio. ⁣aiStudio.\!ai, and YouChat. Humans and a specifically trained AI independently assessed the quality and quantity of ideas. We found no qualitative difference between AI and human-generated creativity, although there are differences in how ideas are generated. Interestingly, 9.4 percent of humans were more creative than the most creative GAI, GPT-4. Our findings suggest that GAIs are valuable assistants in the creative process. Continued research and development of GAI in creative tasks is crucial to fully understand this technology's potential benefits and drawbacks in shaping the future of creativity. Finally, we discuss the question of whether GAIs are capable of being truly creative.
The integration of generative AI (GenAI) tools, particularly large language models (LLMs), is transforming professional coaching workflows. This study explores how coaches use GenAI, the perceived benefits and limitations of these tools, and broader attitudes toward AI-assisted coaching. A survey of 205 coaching professionals reveals widespread adoption of GenAI for research, content creation, and administrative support, while its role in relational and interpretative coaching remains limited. Findings indicate that AI literacy and perceived AI impact strongly predict GenAI adoption, with positive attitudes fostering greater use. Ethical considerations, particularly transparency and data privacy, are a key concern, with frequent AI users demonstrating greater ethical awareness. Regression analyses show that while perceived effectiveness drives GenAI adoption, concerns about AI replacing human coaches do not significantly influence usage. Coaches express interest in future AI capabilities that enhance personalization, real-time feedback, and administrative automation while maintaining human oversight. The study highlights that GenAI functions best as an augmentation tool rather than a replacement, emphasizing the need for AI literacy training, ethical guidelines, and human-centered AI integration. These findings contribute to the ongoing discourse on human-AI collaboration, advocating for responsible and effective AI adoption in professional coaching.
This position paper situates AR beauty filters within the broader debate on Body Politics in HCI. We argue that these filters are not neutral tools but technologies of governance that reinforce racialized, gendered, and ableist beauty standards. Through naming conventions, algorithmic bias, and platform governance, they impose aesthetic norms while concealing their influence. To address these challenges, we advocate for transparency-driven interventions and a critical rethinking of algorithmic aesthetics and digital embodiment.
The rapid advancement of Generative AI (GenAI) relies heavily on the digital commons, a vast collection of free and open online content that is created, shared, and maintained by communities. However, this relationship is becoming increasingly strained due to financial burdens, decreased contributions, and misalignment between AI models and community norms. As we move deeper into the GenAI era, it is essential to examine the interdependent relationship between GenAI, the long-term sustainability of the digital commons, and the equity of current AI development practices. We highlight five critical questions that require urgent attention: 1. How can we prevent the digital commons from being threatened by undersupply as individuals cease contributing to the commons and turn to Generative AI for information? 2. How can we mitigate the risk of the open web closing due to restrictions on access to curb AI crawlers? 3. How can technical standards and legal frameworks be updated to reflect the evolving needs of organizations hosting common content? 4. What are the effects of increased synthetic content in open knowledge databases, and how can we ensure their integrity? 5. How can we account for and distribute the infrastructural and environmental costs of providing data for AI training? We emphasize the need for more responsible practices in AI development, recognizing the digital commons not only as content but as a collaborative and decentralized form of knowledge governance, which relies on the practice of "commoning" - making, maintaining, and protecting shared and open resources. Ultimately, our goal is to stimulate discussion and research on the intersection of Generative AI and the digital commons, with the aim of developing an "AI commons" and public infrastructures for AI development that support the long-term health of the digital commons.
Large Language Models (LLMs) are often trained with safety guards intended to prevent harmful text generation. However, such safety training can be removed by fine-tuning the LLM on harmful datasets. While this emerging threat (harmful fine-tuning attacks) has been characterized by previous work, there is little understanding of how we should proceed in constructing and validating defenses against these attacks especially in the case where defenders would not have control of the fine-tuning process. We introduce a formal framework based on the training budget of an attacker which we call "Immunization" conditions. Using a formal characterisation of the harmful fine-tuning problem, we provide a thorough description of what a successful defense must comprise of and establish a set of guidelines on how rigorous defense research that gives us confidence should proceed.
Large language models are deep learning models with a large number of parameters. The models made noticeable progress on a large number of tasks, and as a consequence allowing them to serve as valuable and versatile tools for a diverse range of applications. Their capabilities also offer opportunities for business process management, however, these opportunities have not yet been systematically investigated. In this paper, we address this research problem by foregrounding various management tasks of the BPM lifecycle. We investigate six research directions highlighting problems that need to be addressed when using large language models, including usage guidelines for practitioners.
The rise of large language models (LLMs) has a significant impact on information warfare. By facilitating the production of content related to disinformation and propaganda campaigns, LLMs can amplify different types of information operations and mislead online users. In our study, we empirically investigate how LLM-powered chatbots, developed by Google, Microsoft, and Perplexity, handle disinformation about Russia's war in Ukraine and whether the chatbots' ability to provide accurate information on the topic varies across languages and over time. Our findings indicate that while for some chatbots (Perplexity), there is a significant improvement in performance over time in several languages, for others (Gemini), the performance improves only in English but deteriorates in low-resource languages.
Organizations need to manage numerous business processes for delivering their services and products to customers. One important consideration thereby lies in the adherence to regulations such as laws, guidelines, or industry standards. In order to monitor adherence of their business processes to regulations -- in other words, their regulatory compliance -- organizations make use of various techniques that draw on process execution data of IT systems that support these processes. Previous research has investigated conformance checking, an operation of process mining, for the domains in which it is applied, its operationalization of regulations, the techniques being used, and the presentation of results produced. However, other techniques for regulatory compliance monitoring, which we summarize as compliance checking techniques, have not yet been investigated regarding these aspects in a structural manner. To this end, this work presents a systematic literature review on uses of regulatory compliance monitoring of business processes, thereby offering insights into the various techniques being used, their application and the results they generate. We highlight commonalities and differences between the approaches and find that various steps are performed manually; we also provide further impulses for research on compliance monitoring and its use in practice.
The opacity of machine learning data is a significant threat to ethical data work and intelligible systems. Previous research has addressed this issue by proposing standardized checklists to document datasets. This paper expands that field of inquiry by proposing a shift of perspective: from documenting datasets toward documenting data production. We draw on participatory design and collaborate with data workers at two companies located in Bulgaria and Argentina, where the collection and annotation of data for machine learning are outsourced. Our investigation comprises 2.5 years of research, including 33 semi-structured interviews, five co-design workshops, the development of prototypes, and several feedback instances with participants. We identify key challenges and requirements related to the integration of documentation practices in real-world data production scenarios. Our findings comprise important design considerations and highlight the value of designing data documentation based on the needs of data workers. We argue that a view of documentation as a boundary object, i.e., an object that can be used differently across organizations and teams but holds enough immutable content to maintain integrity, can be useful when designing documentation to retrieve heterogeneous, often distributed, contexts of data production.
Object-centric event logs expand the conventional single-case notion event log by considering multiple objects, allowing for the analysis of more complex and realistic process behavior. However, the number of real-world object-centric event logs remains limited, and further studies are needed to test their usefulness. The increasing availability of data from team sports can facilitate object-centric process mining, leveraging both real-world data and suitable use cases. In this paper, we present a framework for transforming football (soccer) data into an object-centric event log, further enhanced with a spatial dimension. We demonstrate the effectiveness of our framework by generating object-centric event logs based on real-world football data and discuss the results for varying process representations. With our paper, we provide the first example for object-centric event logs in football analytics. Future work should consider variant analysis and filtering techniques to better handle variability
Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.
Privacy-focused cryptocurrencies like Monero remain popular, despite increasing regulatory scrutiny that has led to their delisting from major centralized exchanges. The latter also explains the recent popularity of decentralized exchanges (DEXs) with no centralized ownership structures. These platforms typically leverage peer-to-peer (P2P) networks, promising secure and anonymous asset trading. However, questions of liability remain, and the academic literature lacks comprehensive insights into the functionality, trading activity, and privacy claims of these P2P platforms. In this paper, we provide an early systematization of the current landscape of decentralized peer-to-peer exchanges within the Monero ecosystem. We examine several recently developed DEX platforms, analyzing their popularity, functionality, architectural choices, and potential weaknesses. We further identify and report on a privacy vulnerability in the recently popularized Haveno exchange, demonstrating that certain Haveno trades could be detected, allowing transactions to be linked across the Monero and Bitcoin blockchains. We hope that our findings can nourish the discussion in the research community about more secure designs, and provide insights for regulators.
Large language models (LLMs) are increasingly central to many applications, raising concerns about bias, fairness, and regulatory compliance. This paper reviews risks of biased outputs and their societal impact, focusing on frameworks like the EU's AI Act and the Digital Services Act. We argue that beyond constant regulation, stronger attention to competition and design governance is needed to ensure fair, trustworthy AI. This is a preprint of the Communications of the ACM article of the same title.
This paper introduces S-DAT (Synthetic-Divergent Association Task), a scalable, multilingual framework for automated assessment of divergent thinking (DT) -a core component of human creativity. Traditional creativity assessments are often labor-intensive, language-specific, and reliant on subjective human ratings, limiting their scalability and cross-cultural applicability. In contrast, S-DAT leverages large language models and advanced multilingual embeddings to compute semantic distance -- a language-agnostic proxy for DT. We evaluate S-DAT across eleven diverse languages, including English, Spanish, German, Russian, Hindi, and Japanese (Kanji, Hiragana, Katakana), demonstrating robust and consistent scoring across linguistic contexts. Unlike prior DAT approaches, the S-DAT shows convergent validity with other DT measures and correct discriminant validity with convergent thinking. This cross-linguistic flexibility allows for more inclusive, global-scale creativity research, addressing key limitations of earlier approaches. S-DAT provides a powerful tool for fairer, more comprehensive evaluation of cognitive flexibility in diverse populations and can be freely assessed online: this https URL
Large pre-trained language models are successfully being used in a variety of tasks, across many languages. With this ever-increasing usage, the risk of harmful side effects also rises, for example by reproducing and reinforcing stereotypes. However, detecting and mitigating these harms is difficult to do in general and becomes computationally expensive when tackling multiple languages or when considering different biases. To address this, we present FairDistillation: a cross-lingual method based on knowledge distillation to construct smaller language models while controlling for specific biases. We found that our distillation method does not negatively affect the downstream performance on most tasks and successfully mitigates stereotyping and representational harms. We demonstrate that FairDistillation can create fairer language models at a considerably lower cost than alternative approaches.
To promote sustainable business practices, and to achieve climate neutrality by 2050, the EU has developed the taxonomy of sustainable activities, which describes when exactly business practices can be considered sustainable. While the taxonomy has only been recently established, progressively more companies will have to report how much of their revenue was created via sustainably executed business processes. To help companies prepare to assess whether their business processes comply with the constraints outlined in the taxonomy, we investigate in how far these criteria can be used for conformance checking, that is, assessing in a data-driven manner, whether business process executions adhere to regulatory constraints. For this, we develop a few-shot learning pipeline to characterize the constraints of the taxonomy with the help of an LLM as to the process dimensions they relate to. We find that many constraints of the taxonomy are useable for conformance checking, particularly in the sectors of energy, manufacturing, and transport. This will aid companies in preparing to monitor regulatory compliance with the taxonomy automatically, by characterizing what kind of information they need to extract, and by providing a better understanding of sectors where such an assessment is feasible and where it is not.
There are no more papers matching your filters at the moment.