Wellesley College
Researchers from MIT Media Lab and others used EEG, NLP, and behavioral analysis to show that using Large Language Models for essay writing reduces neural network activity (up to 55% weaker connectivity compared to self-reliant writing), impairs memory, and diminishes essay ownership, indicating an accumulation of "cognitive debt."
· +7
The BigCode project releases StarCoder 2 models and The Stack v2 dataset, setting a new standard for open and ethically sourced Code LLM development. StarCoder 2 models, particularly the 15B variant, demonstrate competitive performance across code generation, completion, and reasoning tasks, often outperforming larger, closed-source alternatives, by prioritizing data quality and efficient architecture over sheer data quantity.
Researchers from Northeastern University and collaborators introduced `CANITEDIT`, a hand-crafted benchmark for evaluating large language models on instructional code editing, alongside `ExcessCode`, a new metric for edit precision. Their fine-tuned `EDITCODER-33b` model achieved an overall 10.7% absolute increase in `pass@1` over its base, surpassing GPT-3.5-Turbo for descriptive instructions and matching it for lazy instructions.
41
· +1
The BigCode project introduced SantaCoder, a 1.1B parameter open-source code language model, developed with a focus on responsible AI through a novel PII redaction pipeline and empirical data filtering studies. The model, trained on Python, Java, and JavaScript, notably demonstrates that filtering training data by GitHub stars degrades performance and achieves superior multilingual code generation and infilling capabilities compared to larger existing open-source models.
We present measurements of stellar population properties of a newly discovered spectroscopically confirmed z=11.100.26+0.11z=11.10^{+0.11}_{-0.26}, gravitationally lensed galaxy, using JWST NIRSpec PRISM spectroscopy and NIRCam imaging. The arc is highly magnified by the Bullet Cluster (magnification factor μ=14.00.3+6.2{\mu}=14.0^{+6.2}_{-0.3}. It contains three star-forming components of which one is barely resolved and two are unresolved, giving intrinsic sizes of 10pc\lesssim 10pc. The clumps also contain ~50% of the total stellar mass. The galaxy formed the majority of its stars ~150Myr ago (by z~14). The spectrum shows a pronounced damping wing, typical for galaxies deep in the reionisation era and indicating a neutral IGM at this line of sight. The intrinsic luminosity of the galaxy is 0.0860.030+0.008L0.086^{+0.008}_{-0.030} L^* (with LL^* being the characteristic luminosity for this redshift), making it the lowest luminosity spectroscopically confirmed galaxy at z>10z>10 discovered to date.
MultiPL-E introduces a scalable, compiler-based system for translating unit test-driven code generation benchmarks into 18 diverse programming languages, creating the first massively multilingual, parallel benchmark. An evaluation of state-of-the-art models revealed significant variation in multi-language performance, with Codex notably matching or exceeding its Python performance in languages like JavaScript, and found no strong correlation between perplexity and code correctness.
·
The ability to build and reason about models of the world is essential for situated language understanding. But evaluating world modeling capabilities in modern AI systems -- especially those based on language models -- has proven challenging, in large part because of the difficulty of disentangling conceptual knowledge about the world from knowledge of surface co-occurrence statistics. This paper presents Elements of World Knowledge (EWoK), a framework for evaluating language models' understanding of the conceptual knowledge underlying world modeling. EWoK targets specific concepts from multiple knowledge domains known to be important for world modeling in humans, from social interactions (help, deceive) to spatial relations (left, right). Objects, agents, and locations in the items can be flexibly filled in, enabling easy generation of multiple controlled datasets. We then introduce EWoK-core-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 open-weights large language models (1.3B--70B parameters) and compare them with human performance. All tested models perform worse than humans, with results varying drastically across domains. Performance on social interactions and social properties was highest and performance on physical relations and spatial relations was lowest. Overall, this dataset highlights simple cases where even large models struggle and presents rich avenues for targeted research on LLM world modeling capabilities.
Existing benchmarks for frontier models often test specialized, "PhD-level" knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark with 613 problems based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models; however correct solutions are easy to verify, and models' mistakes are easy to spot. As LLMs are more widely deployed in society, we believe it is useful to develop benchmarks for frontier models that humans can understand without the need for deep domain expertise. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models on our benchmark, despite being on par with other models when tested on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with "I give up" before providing an answer that it knows is wrong. R1 can also be remarkably "uncertain" in its output and in rare cases, it does not "finish thinking," which suggests the need for techniques to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.
Researchers from Northeastern University and Microsoft Research developed MultiPL-T, a method for generating high-quality semi-synthetic training data for Code Large Language Models (LLMs) in low-resource programming languages. Fine-tuning models with this test-validated data significantly boosted their performance on benchmarks, often more than doubling `pass@1` scores for languages like OCaml and Racket, and demonstrated weak-to-strong supervision.
10
WebUI introduces a large-scale, automatically collected dataset of web UIs with rich semantic and stylistic metadata. This dataset facilitates improved computational modeling of user interfaces, demonstrating enhanced performance for mobile UI element detection, screen classification, and screen similarity through transfer learning.
We present the first data release of the CAnadian NIRISS Unbiased Cluster Survey (CANUCS), a JWST Cycle 1 GTO program targeting 5 lensing clusters and flanking fields in parallel (Abell 370, MACS0416, MACS0417, MACS1149, MACS1423; survey area \tilda100 arcmin2^{2}), with NIRCam imaging, NIRISS slitless spectroscopy, and NIRSpec prism multi-object spectroscopy. Fields centered on cluster cores include imaging in 8 bands from 0.9-4.4μ\mum, alongside continuous NIRISS coverage from 1.15-2μ\mum, while the NIRCam flanking fields provide 5 wide and 9 medium band filters for exceptional spectral sampling, all to \tilda29 magAB_{AB}. We also present JWST in Technicolor, a Cycle 2 follow-up GO program targeting 3 CANUCS clusters (Abell 370, MACS0416, MACS1149). The Technicolor program adds NIRISS slitless spectroscopy in F090W to the cluster fields while adding 8 wide, medium, and narrow band filters to the flanking fields. This provides NIRCam imaging in all wide and medium band filters over \tilda30 arcmin2^{2}. This paper describes our data reduction and photometry methodology. We release NIRCam, NIRISS, and HST imaging, PSFs, PSF-matched imaging, photometric catalogs, and photometric and spectroscopic redshifts. We provide lens models and stellar population parameters in up to 19 filters for \tilda53,000 galaxies in the cluster fields, and \tilda44,000 galaxies in up to 29 filters in the flanking fields. We further present 733 NIRSpec spectra and redshift measurements up to z=10.8z=10.8. Comparing against our photometric redshifts, we find catastrophic outlier rates of only 4-7\% and scatter of σNMAD\sigma_{\rm NMAD} of 0.01-0.03.
For centuries, astronomers have discussed the possibility of inhabited worlds - from Herschel's 18th-century observations suggesting Mars may host life, to the systematic search for technosignatures that began in the 1960s using radio telescopes. Searching for artifacts in the solar system has received relatively little formal scientific interest and has faced significant technical and social challenges. Automated surveys and new observational techniques developed over the past decade now enable astronomers to survey parts of the sky for anomalous objects. We briefly describe four methods for detecting extraterrestrial artifacts and probes within the Solar System and then focus on demonstrating one of these. The first makes use of pre-Sputnik images to search for flashes from glinting objects. The second method makes use of space-borne telescopes to search for artificial objects. A third approach involves examining the reflectance spectra of objects in Earth orbit, in search of the characteristic reddening that may imply long-term exposure of metallic surfaces to space weathering. We focus here on a fourth approach, which involves using Earth's shadow as a filter when searching for optically luminous objects in near-Earth space. We demonstrate a proof-of-concept of this method by conducting two searches for transients in images acquired by the Zwicky Transient Facility (ZTF), which has generated many repeated 30-second exposures of the same fields. In this way, we identified previously uncatalogued events at short angular separations from the center of the shadow, motivating more extensive searches using this technique. We conclude that the Earth's shadow presents a new and exciting search domain for near-Earth SETI.
Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure. We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment. Applied to five low-resource languages--Lua, Julia, R, OCaml, and Fortran--Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for 16{\le} 16B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version LiveCodeBench that we introduce. We will release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.
The growing availability of generative AI technologies such as large language models (LLMs) has significant implications for creative work. This paper explores twofold aspects of integrating LLMs into the creative process - the divergence stage of idea generation, and the convergence stage of evaluation and selection of ideas. We devised a collaborative group-AI Brainwriting ideation framework, which incorporated an LLM as an enhancement into the group ideation process, and evaluated the idea generation process and the resulted solution space. To assess the potential of using LLMs in the idea evaluation process, we design an evaluation engine and compared it to idea ratings assigned by three expert and six novice evaluators. Our findings suggest that integrating LLM in Brainwriting could enhance both the ideation process and its outcome. We also provide evidence that LLMs can support idea evaluation. We conclude by discussing implications for HCI education and practice.
This study investigated auditory self-recognition boundaries using AI voice morphing technology, examining when individuals cease recognizing their own voice. Through controlled morphing between participants' voices and demographically matched targets at 1% increments using a mixed-methods design, we measured self-identification ratings and response times among 21 participants aged 18-64. Results revealed a critical recognition threshold at 35.2% morphing (95% CI [31.4, 38.1]). Older participants tolerated significantly higher morphing levels before losing self-recognition (β\beta = 0.617, p = 0.048), suggesting age-related vulnerabilities. Greater acoustic embedding distances predicted slower decision-making (r \approx 0.5-0.53, p < 0.05), with the longest response times for cloned versions of participants' own voices. Qualitative analysis revealed prosodic-based recognition strategies, universal voice manipulation discomfort, and awareness of applications spanning assistive technology to security risks. These findings establish foundational evidence for individual differences in voice morphing detection, with implications for AI ethics and vulnerable population protection as voice synthesis becomes accessible.
Risk assessment instrument (RAI) datasets, particularly ProPublica's COMPAS dataset, are commonly used in algorithmic fairness papers due to benchmarking practices of comparing algorithms on datasets used in prior work. In many cases, this data is used as a benchmark to demonstrate good performance without accounting for the complexities of criminal justice (CJ) processes. However, we show that pretrial RAI datasets can contain numerous measurement biases and errors, and due to disparities in discretion and deployment, algorithmic fairness applied to RAI datasets is limited in making claims about real-world outcomes. These reasons make the datasets a poor fit for benchmarking under assumptions of ground truth and real-world impact. Furthermore, conventional practices of simply replicating previous data experiments may implicitly inherit or edify normative positions without explicitly interrogating value-laden assumptions. Without context of how interdisciplinary fields have engaged in CJ research and context of how RAIs operate upstream and downstream, algorithmic fairness practices are misaligned for meaningful contribution in the context of CJ, and would benefit from transparent engagement with normative considerations and values related to fairness, justice, and equality. These factors prompt questions about whether benchmarks for intrinsically socio-technical systems like the CJ system can exist in a beneficial and ethical way.
Black holes have an enormous underlying space of microstates, but universal macroscopic physics characterized by mass, charge and angular momentum as well as a causally disconnected interior. This leads two related puzzles: (1) How does the effective factorization of interior and exterior degrees of freedom emerge in gravity?, and (2) How does the underlying degeneracy of states wind up having a geometric realization in the horizon area and in properties of the singularity? We explore these puzzles in the context of an incipient black hole in the AdS/CFT correspondence, the microstates of which are dual to half-BPS states of the N=4\mathcal{N}=4 super-Yang-Mills theory. First, we construct a code subspace for this black hole and show how to organize it as a tensor product of a universal macroscopic piece (describing the exterior), and a factor corresponding to the microscopic degrees of freedom (describing the interior). We then study the classical phase space and symplectic form for low-energy excitations around the black hole. On the AdS side, we find that the symplectic form has a new physical degree of freedom at the stretched horizon of the black hole, reminiscent of soft hair, which is absent in the microstates. We explicitly show how such a soft mode emerges from the microscopic phase space in the dual CFT via a canonical transformation and how it encodes partial information about the microscopic degrees of freedom of the black hole.
Although LLMs are increasing the productivity of professional programmers, existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks. Why is this the case? This paper explores two competing hypotheses about the cause of student-LLM miscommunication: (1) students simply lack the technical vocabulary needed to write good prompts, and (2) students do not understand the extent of information that LLMs need to solve code generation tasks. We study (1) with a causal intervention experiment on technical vocabulary and (2) by analyzing graphs that abstract how students edit prompts and the different failures that they encounter. We find that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more. Our findings have implications for the use of LLMs in programming education, and for efforts to make computing more accessible with LLMs.
Code LLMs are being rapidly deployed and there is evidence that they can make professional programmers more productive. Current benchmarks for code generation measure whether models generate correct programs given an expert prompt. In this paper, we present a new benchmark containing multiple prompts per problem, written by a specific population of non-expert prompters: beginning programmers. StudentEval contains 1,749 prompts for 48 problems, written by 80 students who have only completed one semester of Python programming. Our students wrote these prompts while working interactively with a Code LLM, and we observed very mixed success rates. We use StudentEval to evaluate 5 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. We analyze the prompts and find significant variation in students' prompting techniques. We also find that nondeterministic LLM sampling could mislead students into thinking that their prompts are more (or less) effective than they actually are, which has implications for how to teach with Code LLMs.
There are no more papers matching your filters at the moment.