alphaXiv

5,588

10 Jun 2025

computer-science conversational-ai artificial-intelligence

Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task

MIT Wellesley College Mass. College of Art and Design (MassArt)

Researchers from MIT Media Lab and others used EEG, NLP, and behavioral analysis to show that using Large Language Models for essay writing reduces neural network activity (up to 55% weaker connectivity compared to self-reliant writing), impairs memory, and diminishes essay ownership, indicating an accumulation of "cognitive debt."

1,668

29 Feb 2024

computer-science artificial-intelligence software-engineering

StarCoder 2 and The Stack v2: The Next Generation

Alex Gu

·

Yuxiang Wei

+7

The BigCode project releases StarCoder 2 models and The Stack v2 dataset, setting a new standard for open and ethically sourced Code LLM development. StarCoder 2 models, particularly the 15B variant, demonstrate competitive performance across code generation, completion, and reasoning tasks, often outperforming larger, closed-source alternatives, by prioritizing data quality and efficient architecture over sheer data quantity.

1,145

13 Dec 2023

computer-science artificial-intelligence computation-and-language

StarCoder: may the source be with you!

Alex Gu

·

Christopher Akiki

+3

StarCoder and StarCoderBase are large language models for code developed by The BigCode community, demonstrating state-of-the-art performance among open-access models on Python code generation, achieving 33.6% pass@1 on HumanEval, and strong multi-language capabilities, all while integrating responsible AI practices.

212

23 Sep 2024

computer-science artificial-intelligence machine-learning

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Hugging Face Northestern University Wellesley College Roblox

Researchers from Northeastern University and collaborators introduced `CANITEDIT`, a hand-crafted benchmark for evaluating large language models on instructional code editing, alongside `ExcessCode`, a new metric for edit precision. Their fine-tuned `EDITCODER-33b` model achieved an overall 10.7% absolute increase in `pass@1` over its base, surpassing GPT-3.5-Turbo for descriptive instructions and matching it for lazy instructions.

41

156

24 Feb 2023

computer-science artificial-intelligence machine-learning

SantaCoder: don't reach for the stars!

University of Amsterdam Leipzig University

Northeastern University IBM Research ScaDS.AI

Hugging Face Sea AI Lab

MIT SAP

ServiceNow Wellesley College EleutherAI Berner Fachhochschule UWA Discover Dollar Pvt Ltd Saama Technologies Huawei Noah s Ark Lab Flowrite CSIRO The final JSON should be valid and only contain the organization names in an array. The previous thought process successfully identified all organizations.```json [

Alex Gu

·

Christopher Akiki

+1

The BigCode project introduced SantaCoder, a 1.1B parameter open-source code language model, developed with a focus on responsible AI through a novel PII redaction pipeline and empirical data filtering studies. The model, trained on Python, Java, and JavaScript, notably demonstrates that filtering training data by GitHub stars degrades performance and achieves superior multilingual code generation and infilling capabilities compared to larger existing open-source models.

27

04 Dec 2025

astrophysics-of-galaxies physics

Star Formation under a Cosmic Microscope: Highly magnified z = 11 galaxy behind the Bullet Cluster

University of Ljubljana

University of Florida

Space Telescope Science Institute York University

University of California, Davis

The Ohio State University Carnegie Observatories Universität Innsbruck Saint Mary’s University NRC Herzberg INAF – Osservatorio Astronomico di Roma Kapteyn Astronomical Institute, University of Groningen Wellesley College Dunlap Institute for Astronomy and Astrophysics Ohio University Center for Astrophysics Harvard & Smithsonian Gemini Observatory, NSF NOIRLab IFPU Institute for fundamental physics of the Universe

We present measurements of stellar population properties of a newly discovered spectroscopically confirmed

z=11.10^{+0.11}_{-0.26}

, gravitationally lensed galaxy, using JWST NIRSpec PRISM spectroscopy and NIRCam imaging. The arc is highly magnified by the Bullet Cluster (magnification factor

{\mu}=14.0^{+6.2}_{-0.3}

. It contains three star-forming components of which one is barely resolved and two are unresolved, giving intrinsic sizes of

\lesssim 10pc

. The clumps also contain ~50% of the total stellar mass. The galaxy formed the majority of its stars ~150Myr ago (by z~14). The spectrum shows a pronounced damping wing, typical for galaxies deep in the reionisation era and indicating a neutral IGM at this line of sight. The intrinsic luminosity of the galaxy is

0.086^{+0.008}_{-0.030} L^*

(with

L^*

being the characteristic luminosity for this redshift), making it the lowest luminosity spectroscopically confirmed galaxy at

z>10

discovered to date.

245

19 Dec 2022

computer-science machine-learning programming-languages

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

Northeastern University University of Massachusetts Amherst Stevens Institute of Technology Wellesley College Oberlin College Roblox Research Hanover High School

MultiPL-E introduces a scalable, compiler-based system for translating unit test-driven code generation benchmarks into 18 diverse programming languages, creating the first massively multilingual, parallel benchmark. An evaluation of state-of-the-art models revealed significant variation in multi-language performance, with Codex notably matching or exceeding its Python performance in languages like JavaScript, and found no strong correlation between perplexity and code correctness.

114

03 Jul 2025

computer-science artificial-intelligence computation-and-language

Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models

Harvard University

ETH Zürich Georgia Tech

MIT Wellesley College MIT-IBM Watson AI

Leshem Choshen

·

Gabriel Grand

The ability to build and reason about models of the world is essential for situated language understanding. But evaluating world modeling capabilities in modern AI systems -- especially those based on language models -- has proven challenging, in large part because of the difficulty of disentangling conceptual knowledge about the world from knowledge of surface co-occurrence statistics. This paper presents Elements of World Knowledge (EWoK), a framework for evaluating language models' understanding of the conceptual knowledge underlying world modeling. EWoK targets specific concepts from multiple knowledge domains known to be important for world modeling in humans, from social interactions (help, deceive) to spatial relations (left, right). Objects, agents, and locations in the items can be flexibly filled in, enabling easy generation of multiple controlled datasets. We then introduce EWoK-core-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 open-weights large language models (1.3B--70B parameters) and compare them with human performance. All tested models perform worse than humans, with results varying drastically across domains. Performance on social interactions and social properties was highest and performance on physical relations and spatial relations was lowest. Overall, this dataset highlights simple cases where even large models struggle and presents rich avenues for targeted research on LLM world modeling capabilities.

1,155

26 Nov 2025

computer-science artificial-intelligence machine-learning

ReasoningWeekly: A General Knowledge and Verbal Reasoning Challenge for Large Language Models

Charles University

Northeastern University

University of Texas at Austin Wellesley College Oberlin College Cursor

Existing benchmarks for frontier models often test specialized, "PhD-level" knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark with 613 problems based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models; however correct solutions are easy to verify, and models' mistakes are easy to spot. As LLMs are more widely deployed in society, we believe it is useful to develop benchmarks for frontier models that humans can understand without the need for deep domain expertise. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models on our benchmark, despite being on par with other models when tested on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with "I give up" before providing an answer that it knows is wrong. R1 can also be remarkably "uncertain" in its output and in rare cases, it does not "finish thinking," which suggests the need for techniques to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.

223

22 Sep 2024

computer-science machine-learning programming-languages

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Northeastern University Stevens Institute of Technology Wellesley College Roblox

Researchers from Northeastern University and Microsoft Research developed MultiPL-T, a method for generating high-quality semi-synthetic training data for Code Large Language Models (LLMs) in low-resource programming languages. Fine-tuning models with this test-validated data significantly boosted their performance on benchmarks, often more than doubling `pass@1` scores for languages like OCaml and Racket, and demonstrated weak-to-strong supervision.

10

90

30 Jan 2023

computer-science human-computer-interaction

WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics

Carnegie Mellon University Wellesley College Grinnell College Snooty Bird LLC

WebUI introduces a large-scale, automatically collected dataset of web UIs with rich semantic and stylistic metadata. This dataset facilitates improved computational modeling of user interfaces, demonstrating enhanced performance for mobile UI element detection, screen classification, and screen similarity through transfer learning.

36

26 Jun 2025

astrophysics-of-galaxies physics

CANUCS/Technicolor Data Release 1: Imaging, Photometry, Slit Spectroscopy, and Stellar Population Parameters

We present the first data release of the CAnadian NIRISS Unbiased Cluster Survey (CANUCS), a JWST Cycle 1 GTO program targeting 5 lensing clusters and flanking fields in parallel (Abell 370, MACS0416, MACS0417, MACS1149, MACS1423; survey area \tilda100 arcmin

^{2}

), with NIRCam imaging, NIRISS slitless spectroscopy, and NIRSpec prism multi-object spectroscopy. Fields centered on cluster cores include imaging in 8 bands from 0.9-4.4

\mu

m, alongside continuous NIRISS coverage from 1.15-2

\mu

m, while the NIRCam flanking fields provide 5 wide and 9 medium band filters for exceptional spectral sampling, all to \tilda29 mag

_{AB}

. We also present JWST in Technicolor, a Cycle 2 follow-up GO program targeting 3 CANUCS clusters (Abell 370, MACS0416, MACS1149). The Technicolor program adds NIRISS slitless spectroscopy in F090W to the cluster fields while adding 8 wide, medium, and narrow band filters to the flanking fields. This provides NIRCam imaging in all wide and medium band filters over \tilda30 arcmin

^{2}

. This paper describes our data reduction and photometry methodology. We release NIRCam, NIRISS, and HST imaging, PSFs, PSF-matched imaging, photometric catalogs, and photometric and spectroscopic redshifts. We provide lens models and stellar population parameters in up to 19 filters for \tilda53,000 galaxies in the cluster fields, and \tilda44,000 galaxies in up to 29 filters in the flanking fields. We further present 733 NIRSpec spectra and redshift measurements up to

z=10.8

. Comparing against our photometric redshifts, we find catastrophic outlier rates of only 4-7\% and scatter of

\sigma_{\rm NMAD}

of 0.01-0.03.

9

19 Oct 2025

earth-and-planetary-astrophysics instrumentation-and-methods-for-astrophysics physics

A Cost-Effective Search for Extraterrestrial Probes in the Solar System

CSIC

Stockholm University

KTH Royal Institute of Technology Instituto de Astrofísica de Canarias INTA Centro de Astrobiología NORDITA Wellesley College GRANTECAN

For centuries, astronomers have discussed the possibility of inhabited worlds - from Herschel's 18th-century observations suggesting Mars may host life, to the systematic search for technosignatures that began in the 1960s using radio telescopes. Searching for artifacts in the solar system has received relatively little formal scientific interest and has faced significant technical and social challenges. Automated surveys and new observational techniques developed over the past decade now enable astronomers to survey parts of the sky for anomalous objects. We briefly describe four methods for detecting extraterrestrial artifacts and probes within the Solar System and then focus on demonstrating one of these. The first makes use of pre-Sputnik images to search for flashes from glinting objects. The second method makes use of space-borne telescopes to search for artificial objects. A third approach involves examining the reflectance spectra of objects in Earth orbit, in search of the characteristic reddening that may imply long-term exposure of metallic surfaces to space weathering. We focus here on a fourth approach, which involves using Earth's shadow as a filter when searching for optically luminous objects in near-Earth space. We demonstrate a proof-of-concept of this method by conducting two searches for transients in images acquired by the Zwicky Transient Facility (ZTF), which has generated many repeated 30-second exposures of the same fields. In this way, we identified previously uncatalogued events at short angular separations from the center of the shadow, motivating more extensive searches using this technique. We conclude that the Earth's shadow presents a new and exciting search domain for near-Earth SETI.

27

06 Aug 2025

agents computer-science machine-learning

Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

Northeastern University University of Texas Wellesley College

Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure. We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment. Applied to five low-resource languages--Lua, Julia, R, OCaml, and Fortran--Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for

{\le} 16

B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version LiveCodeBench that we introduce. We will release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.

66

29 Feb 2024

computer-science artificial-intelligence computers-and-society

AI-Augmented Brainwriting: Investigating the use of LLMs in group ideation

University of Haifa University of New Hampshire Wellesley College

The growing availability of generative AI technologies such as large language models (LLMs) has significant implications for creative work. This paper explores twofold aspects of integrating LLMs into the creative process - the divergence stage of idea generation, and the convergence stage of evaluation and selection of ideas. We devised a collaborative group-AI Brainwriting ideation framework, which incorporated an LLM as an enhancement into the group ideation process, and evaluated the idea generation process and the resulted solution space. To assess the potential of using LLMs in the idea evaluation process, we design an evaluation engine and compared it to idea ratings assigned by three expert and six novice evaluators. Our findings suggest that integrating LLM in Brainwriting could enhance both the ideation process and its outcome. We also provide evidence that LLMs can support idea evaluation. We conclude by discussing implications for HCI education and practice.

6

17 Oct 2025

computer-science human-computer-interaction

VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition

Cornell University MIT Media Lab Wellesley College

This study investigated auditory self-recognition boundaries using AI voice morphing technology, examining when individuals cease recognizing their own voice. Through controlled morphing between participants' voices and demographically matched targets at 1% increments using a mixed-methods design, we measured self-identification ratings and response times among 21 participants aged 18-64. Results revealed a critical recognition threshold at 35.2% morphing (95% CI [31.4, 38.1]). Older participants tolerated significantly higher morphing levels before losing self-recognition (

\beta

= 0.617, p = 0.048), suggesting age-related vulnerabilities. Greater acoustic embedding distances predicted slower decision-making (

r \approx 0.5-0.53, p &lt; 0.05

), with the longest response times for cloned versions of participants' own voices. Qualitative analysis revealed prosodic-based recognition strategies, universal voice manipulation discomfort, and awareness of applications spanning assistive technology to security risks. These findings establish foundational evidence for individual differences in voice morphing detection, with implications for AI ethics and vulnerable population protection as voice synthesis becomes accessible.

6

28 Apr 2022

computer-science computers-and-society

It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

University of Utah

Stanford University

Cornell University

University of Pennsylvania Wellesley College PRA, Inc.ACLU

Risk assessment instrument (RAI) datasets, particularly ProPublica's COMPAS dataset, are commonly used in algorithmic fairness papers due to benchmarking practices of comparing algorithms on datasets used in prior work. In many cases, this data is used as a benchmark to demonstrate good performance without accounting for the complexities of criminal justice (CJ) processes. However, we show that pretrial RAI datasets can contain numerous measurement biases and errors, and due to disparities in discretion and deployment, algorithmic fairness applied to RAI datasets is limited in making claims about real-world outcomes. These reasons make the datasets a poor fit for benchmarking under assumptions of ground truth and real-world impact. Furthermore, conventional practices of simply replicating previous data experiments may implicitly inherit or edify normative positions without explicitly interrogating value-laden assumptions. Without context of how interdisciplinary fields have engaged in CJ research and context of how RAIs operate upstream and downstream, algorithmic fairness practices are misaligned for meaningful contribution in the context of CJ, and would benefit from transparent engagement with normative considerations and values related to fairness, justice, and equality. These factors prompt questions about whether benchmarks for intrinsically socio-technical systems like the CJ system can exist in a beneficial and ethical way.

12

31 Oct 2018

high-energy-physics-theory physics

Emergent classical spacetime from microstates of an incipient black hole

Stanford University

University of Pennsylvania Wellesley College University of California at Santa Barbara Vrije Universiteit Brussel (VUB)

Black holes have an enormous underlying space of microstates, but universal macroscopic physics characterized by mass, charge and angular momentum as well as a causally disconnected interior. This leads two related puzzles: (1) How does the effective factorization of interior and exterior degrees of freedom emerge in gravity?, and (2) How does the underlying degeneracy of states wind up having a geometric realization in the horizon area and in properties of the singularity? We explore these puzzles in the context of an incipient black hole in the AdS/CFT correspondence, the microstates of which are dual to half-BPS states of the

\mathcal{N}=4

super-Yang-Mills theory. First, we construct a code subspace for this black hole and show how to organize it as a tensor product of a universal macroscopic piece (describing the exterior), and a factor corresponding to the microscopic degrees of freedom (describing the interior). We then study the classical phase space and symplectic form for low-energy excitations around the black hole. On the AdS side, we find that the symplectic form has a new physical degree of freedom at the stretched horizon of the black hole, reminiscent of soft hair, which is absent in the microstates. We explicitly show how such a soft mode emerges from the microscopic phase space in the dual CFT via a canonical transformation and how it encodes partial information about the microscopic degrees of freedom of the black hole.

44

15 Oct 2024

causal-inference computer-science computers-and-society

Substance Beats Style: Why Beginning Students Fail to Code with LLMs

Northeastern University Wellesley College Oberlin College

Although LLMs are increasing the productivity of professional programmers, existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks. Why is this the case? This paper explores two competing hypotheses about the cause of student-LLM miscommunication: (1) students simply lack the technical vocabulary needed to write good prompts, and (2) students do not understand the extent of information that LLMs need to solve code generation tasks. We study (1) with a causal intervention experiment on technical vocabulary and (2) by analyzing graphs that abstract how students edit prompts and the different failures that they encounter. We find that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more. Our findings have implications for the use of LLMs in programming education, and for efforts to make computing more accessible with LLMs.

24

07 Jun 2023

computer-science human-computer-interaction machine-learning

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Northeastern University Wellesley College Oberlin College Roblox

Code LLMs are being rapidly deployed and there is evidence that they can make professional programmers more productive. Current benchmarks for code generation measure whether models generate correct programs given an expert prompt. In this paper, we present a new benchmark containing multiple prompts per problem, written by a specific population of non-expert prompters: beginning programmers. StudentEval contains 1,749 prompts for 48 problems, written by 80 students who have only completed one semester of Python programming. Our students wrote these prompts while working interactively with a Code LLM, and we observed very mixed success rates. We use StudentEval to evaluate 5 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. We analyze the prompts and find significant variation in students' prompting techniques. We also find that nondeterministic LLM sampling could mislead students into thinking that their prompts are more (or less) effective than they actually are, which has implications for how to teach with Code LLMs.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task

StarCoder 2 and The Stack v2: The Next Generation

StarCoder: may the source be with you!

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

SantaCoder: don't reach for the stars!

Star Formation under a Cosmic Microscope: Highly magnified z = 11 galaxy behind the Bullet Cluster

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models

ReasoningWeekly: A General Knowledge and Verbal Reasoning Challenge for Large Language Models

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics

CANUCS/Technicolor Data Release 1: Imaging, Photometry, Slit Spectroscopy, and Stellar Population Parameters

A Cost-Effective Search for Extraterrestrial Probes in the Solar System

Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

AI-Augmented Brainwriting: Investigating the use of LLMs in group ideation

VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition

It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

Emergent classical spacetime from microstates of an incipient black hole

Substance Beats Style: Why Beginning Students Fail to Code with LLMs

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Events

AI for Law

Personalize Your Feed