Oberlin College
MultiPL-E introduces a scalable, compiler-based system for translating unit test-driven code generation benchmarks into 18 diverse programming languages, creating the first massively multilingual, parallel benchmark. An evaluation of state-of-the-art models revealed significant variation in multi-language performance, with Codex notably matching or exceeding its Python performance in languages like JavaScript, and found no strong correlation between perplexity and code correctness.
Existing benchmarks for frontier models often test specialized, "PhD-level" knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark with 613 problems based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models; however correct solutions are easy to verify, and models' mistakes are easy to spot. As LLMs are more widely deployed in society, we believe it is useful to develop benchmarks for frontier models that humans can understand without the need for deep domain expertise. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models on our benchmark, despite being on par with other models when tested on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with "I give up" before providing an answer that it knows is wrong. R1 can also be remarkably "uncertain" in its output and in rare cases, it does not "finish thinking," which suggests the need for techniques to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.
Blind and Low Vision (BLV) people have adopted AI-powered visual interpretation applications to address their daily needs. While these applications have been helpful, prior work has found that users remain unsatisfied by their frequent errors. Recently, multimodal large language models (MLLMs) have been integrated into visual interpretation applications, and they show promise for more descriptive visual interpretations. However, it is still unknown how this advancement has changed people's use of these applications. To address this gap, we conducted a two-week diary study in which 20 BLV people used an MLLM-enabled visual interpretation application we developed, and we collected 553 entries. In this paper, we report a preliminary analysis of 60 diary entries from 6 participants. We found that participants considered the application's visual interpretations trustworthy (mean 3.75 out of 5) and satisfying (mean 4.15 out of 5). Moreover, participants trusted our application in high-stakes scenarios, such as receiving medical dosage advice. We discuss our plan to complete our analysis to inform the design of future MLLM-enabled visual interpretation systems.
We present the Blue Jay survey, a Cycle-1 JWST program aimed at studying the stellar and gas content of galaxies at Cosmic Noon. The survey consists of deep spectroscopy for 153 targets observed over two pointings in the COSMOS field using the NIRSpec micro-shutter assembly (MSA). We employ the three medium-resolution gratings G140M, G235M, and G395M, with exposure times of 13 hours, 3.2 hours, and 1.6 hours, respectively. We thus obtain full coverage of the 1-5 micron range, corresponding to the entire rest-frame optical wavelength range. The sample is carefully selected to provide a census of galaxies over the redshift range 1.7 < z < 3.5 above a redshift-dependent minimum stellar mass that ranges from 10^8.7 Msun to 10^9.3 this http URL Blue Jay sample is representative of the entire galaxy population at these redshifts, without strong biases in color, star formation rate, or other properties. The sizes of massive galaxies at these redshifts are comparable to the NIRSpec shutters, which requires custom strategies for designing and reducing the observations. Since the standard A-B nod subtraction leads to flux self-subtraction, we construct a master background from empty shutters and subtract it from each of the science spectra. This, in turn, allows for the use of shorter slitlets consisting of only two shutters per galaxy instead of the usual three, with a substantial increase in the multiplexing of the NIRSpec MSA. We measure multi-band photometry using archival JWST and HST observations in two different ways: in a large elliptical aperture encompassing the entire source and from the exact area in the sky where the NIRSpec 1D spectrum is extracted. This enables self-consistent fits of spectroscopic and photometric data. The Blue Jay dataset, which we publicly release, represents the ideal sample for studying the stellar populations, neutral gas, and ionized gas in Cosmic Noon galaxies.
The resonant mode spectrum of the Kerr-Newman spacetime is presently unknown. These modes, called the quasinormal modes, play a central role in determining the stability of Kerr-Newman black holes and their response to perturbations. We present a new formalism, generalized from time-independent perturbation theory in quantum mechanics, for calculating the quasinormal mode frequencies of weakly charged Kerr-Newman spacetimes of arbitrary spin. Our method makes use of an original technique for applying perturbation theory to zeroth-order solutions that are not square- integrable, and it can be applied to other problems in theoretical physics. The new formalism reveals no unstable modes, which together with previous results in the slow-rotation limit strongly indicates the modal stability of the Kerr-Newman spacetime. Our techniques and results are of interest in the areas of holographic duality, foundational problems in General Relativity, and possibly in astrophysical systems.
Although LLMs are increasing the productivity of professional programmers, existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks. Why is this the case? This paper explores two competing hypotheses about the cause of student-LLM miscommunication: (1) students simply lack the technical vocabulary needed to write good prompts, and (2) students do not understand the extent of information that LLMs need to solve code generation tasks. We study (1) with a causal intervention experiment on technical vocabulary and (2) by analyzing graphs that abstract how students edit prompts and the different failures that they encounter. We find that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more. Our findings have implications for the use of LLMs in programming education, and for efforts to make computing more accessible with LLMs.
13 Oct 2025
Exact reconstruction of an image from measurements of its Discrete Fourier Transform (DFT) typically requires all DFT coefficients to be available. However, incorporating the prior assumption that the image contains only integer values enables unique recovery from a limited subset of DFT coefficients. This paper develops both theoretical and algorithmic foundations for this problem. We use algebraic properties of the DFT to define a reduction from two-dimensional recovery to several well-chosen one-dimensional recoveries. Our reduction framework characterizes the minimum number and location of DFT coefficients that must be sampled to guarantee unique reconstruction of an integer-valued image. Algorithmically, we develop reconstruction procedures which use dynamic programming to efficiently recover an integer signal or image from its minimal set of DFT measurements. While the new inversion algorithms still involve NP-hard subproblems, we demonstrate how the divide-and-conquer approach drastically reduces the associated search space. To solve the NP-hard subproblems, we employ a lattice-based framework which leverages the LLL approximation algorithm to make the algorithms fast and practical. We provide an analysis of the lattice method, suggesting approximate parameter choices to ensure correct inversion. Numerical results for the algorithms support the parameter analysis and demonstrate successful recovery of large integer images.
We present an analysis of high-precision pulsar timing data taken as part of the North American Nanohertz Observatory for Gravitational waves (NANOGrav) project. We have observed 17 pulsars for a span of roughly five years using the Green Bank and Arecibo radio telescopes. We analyze these data using standard pulsar timing models, with the addition of time-variable dispersion measure and frequency-variable pulse shape terms. Sub-microsecond timing residuals are obtained in nearly all cases, and the best root-mean-square timing residuals in this set are ~30-50 ns. We present methods for analyzing post-fit timing residuals for the presence of a gravitational wave signal with a specified spectral shape. These optimally take into account the timing fluctuation power removed by the model fit, and can be applied to either data from a single pulsar, or to a set of pulsars to detect a correlated signal. We apply these methods to our dataset to set an upper limit on the strength of the nHz-frequency stochastic supermassive black hole gravitational wave background of h_c (1 yr^-1) < 7x10^-15 (95%). This result is dominated by the timing of the two best pulsars in the set, PSRs J1713+0747 and J1909-3744.
Colliding black holes are systems of profound interest in both gravitational wave astronomy and in gravitation theory, and a variety of methods have been developed for modeling their dynamics in detail. The features of these dynamics are determined by the masses of the holes and by the magnitudes and axes of their spins. While masses and spin magnitudes can be defined in reasonably unambiguous ways, the spin axis is a concept which despite great physical importance is seriously undermined by the coordinate freedom of general relativity. Despite a great wealth of detailed numerical simulations of generic spinning black hole collisions, very little attention has gone into defining or justifying the definitions of the spin axis used in the numerical relativity literature. In this paper, we summarize and contrast the various spin direction measures available in the SpEC code, including a comparison with a method common in other codes, we explain why these measures have shown qualitatively different nutation features than one would expect from post-Newtonian theory, and we derive and implement new measures that give much better agreement.
This paper studies delegation in a model of discrete choice. In the delegation problem, an uninformed principal must consult an informed agent to make a decision. Both the agent and principal have preferences over the decided-upon action which vary based on the state of the world, and which may not be aligned. The principal may commit to a mechanism, which maps reports of the agent to actions. When this mechanism is deterministic, it can take the form of a menu of actions, from which the agent simply chooses upon observing the state. In this case, the principal is said to have delegated the choice of action to the agent. We consider a setting where the decision being delegated is a choice of a utility-maximizing action from a set of several options. We assume the shared portion of the agent's and principal's utilities is drawn from a distribution known to the principal, and that utility misalignment takes the form of a known bias for or against each action. We provide tight approximation analyses for simple threshold policies under three increasingly general sets of assumptions. With independently-distributed utilities, we prove a 33-approximation. When the agent has an outside option the principal cannot rule out, the constant approximation fails, but we prove a logρ/loglogρ\log \rho/\log\log \rho-approximation, where ρ\rho is the ratio of the maximum value to the optimal utility. We also give a weaker but tight bound that holds for correlated values, and complement our upper bounds with hardness results. One special case of our model is utility-based assortment optimization, for which our results are new.
Code LLMs are being rapidly deployed and there is evidence that they can make professional programmers more productive. Current benchmarks for code generation measure whether models generate correct programs given an expert prompt. In this paper, we present a new benchmark containing multiple prompts per problem, written by a specific population of non-expert prompters: beginning programmers. StudentEval contains 1,749 prompts for 48 problems, written by 80 students who have only completed one semester of Python programming. Our students wrote these prompts while working interactively with a Code LLM, and we observed very mixed success rates. We use StudentEval to evaluate 5 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. We analyze the prompts and find significant variation in students' prompting techniques. We also find that nondeterministic LLM sampling could mislead students into thinking that their prompts are more (or less) effective than they actually are, which has implications for how to teach with Code LLMs.
Survey research has a long-standing history of being a human-powered field, but one that embraces various technologies for the collection, processing, and analysis of various behavioral, political, and social outcomes of interest, among others. At the same time, Large Language Models (LLMs) bring new technological challenges and prerequisites in order to fully harness their potential. In this paper, we report work-in-progress on a systematic literature review based on keyword searches from multiple large-scale databases as well as citation networks that assesses how LLMs are currently being applied within the survey research process. We synthesize and organize our findings according to the survey research process to include examples of LLM usage across three broad phases: pre-data collection, data collection, and post-data collection. We discuss selected examples of potential use cases for LLMs as well as its pitfalls based on examples from existing literature. Considering survey research has rich experience and history regarding data quality, we discuss some opportunities and describe future outlooks for survey research to contribute to the continued development and refinement of LLMs.
We present the Methods for Open Agent Systems Evaluation Initiative (MOASEI) Competition, a multi-agent AI benchmarking event designed to evaluate decision-making under open-world conditions. Built on the free-range-zoo environment suite, MOASEI introduced dynamic, partially observable domains with agent and task openness--settings where entities may appear, disappear, or change behavior over time. The 2025 competition featured three tracks--Wildfire, Rideshare, and Cybersecurity--each highlighting distinct dimensions of openness and coordination complexity. Eleven teams from international institutions participated, with four of those teams submitting diverse solutions including graph neural networks, convolutional architectures, predictive modeling, and large language model--driven meta--optimization. Evaluation metrics centered on expected utility, robustness to perturbations, and responsiveness to environmental change. The results reveal promising strategies for generalization and adaptation in open environments, offering both empirical insight and infrastructure for future research. This report details the competition's design, findings, and contributions to the open-agent systems research community.
Binary black holes are the most abundant source of gravitational-wave observations. Gravitational-wave observatories in the next decade will require tremendous increases in the accuracy of numerical waveforms modeling binary black holes, compared to today's state of the art. One approach to achieving the required accuracy is using spectral-type methods that scale to many processors. Using the SpECTRE numerical-relativity code, we present the first simulations of a binary black hole inspiral, merger, and ringdown using discontinuous Galerkin methods. The efficiency of discontinuous Galerkin methods allows us to evolve the binary through ~18 orbits at reasonable computational cost. We then use SpECTRE's Cauchy Characteristic Evolution (CCE) code to extract the gravitational waves at future null infinity. The open-source nature of SpECTRE means this is the first time a spectral-type method for simulating binary black hole evolutions is available to the entire numerical-relativity community.
We present the Herschel Extragalactic Legacy Project (HELP). This project collates, curates, homogenises, and creates derived data products for most of the premium multi-wavelength extragalactic data sets. The sky boundaries for the first data release cover 1270 deg2 defined by the Herschel SPIRE extragalactic survey fields; notably the Herschel Multi-tiered Extragalactic Survey (HerMES) and the Herschel Atlas survey (H-ATLAS). Here, we describe the motivation and principal elements in the design of the project. Guiding principles are transparent or "open" methodologies with care for reproducibility and identification of provenance. A key element of the design focuses around the homogenisation of calibration, meta data and the provision of information required to define the selection of the data for statistical analysis. We apply probabilistic methods that extract information directly from the images at long wavelengths, exploiting the prior information available at shorter wavelengths and providing full posterior distributions rather than maximum likelihood estimates and associated uncertainties as in traditional catalogues. With this project definition paper we provide full access to the first data release of HELP; Data Release 1 (DR1), including a monolithic map of the largest SPIRE extragalactic field at 385 deg2 and 18 million measurements of PACS and SPIRE fluxes. We also provide tools to access and analyse the full HELP database. This new data set includes far-infrared photometry, photometric redshifts, and derived physical properties estimated from modelling the spectral energy distributions.
Thomas Milton Liggett was a world renowned UCLA probabilist, famous for his monograph Interacting Particle Systems. He passed away peacefully on May 12, 2020. This is a perspective article in memory of both Tom Liggett the person and Tom Liggett the mathematician.
Strong line metallicity calibrations are widely used to determine the gas phase metallicities of individual HII regions and entire galaxies. Over a decade ago, based on the Sloan Digital Sky Survey Data Release 4 (SDSS DR4), Kewley \& Ellison published the coefficients of third-order polynomials that can be used to convert between different strong line metallicity calibrations for global galaxy spectra. Here, we update the work of Kewley \& Ellison in three ways. First, by using a newer data release (DR7), we approximately double the number of galaxies used in polynomial fits, providing statistically improved polynomial coefficients. Second, we include in the calibration suite five additional metallicity diagnostics that have been proposed in the last decade and were not included by Kewley \& Ellison. Finally, we develop a new machine learning approach for converting between metallicity calibrations. The random forest algorithm is non-parametric and therefore more flexible than polynomial conversions, due to its ability to capture non-linear behaviour in the data. The random forest method yields the same accuracy as the (updated) polynomial conversions, but has the significant advantage that a single model can be applied over a wide range of metallicities, without the need to distinguish upper and lower branches in R23R_{23} calibrations. The trained random forest is made publicly available for use in the community.
In this work, we use gas phase metallicities calculated from the Sloan Digital Sky Survey (SDSS) Mapping Nearby Galaxies at Apache Point (MaNGA) Data Release 17 (DR17) to assess the extent of potential biases in spaxels which are spatially adjacent to spaxels identified as non-star forming (non-SF) on a BPT diagram. We identify a sample of 21,000\sim21,000 such spaxels with calculable metallicities from the full metallicity catalogue (\sim1.57 million), representing a small fraction (1.3\sim1.3 per cent) of the full metallicity sample. \sim23 per cent of all galaxies with at least one spaxel with a calculable metallicity also contain at least one spaxel with a calculated metallicity adjacent to a non-SF spaxel, with a typical galaxy hosting 9 non-SF-adjacent spaxels. From our suite of 6 different metallicity calibrations, we find that only the metallicity calibrations based entirely on the [NII]6584_{6584}/Hα\alpha ratio are affected, showing systematic offsets to higher metallicities by up to \sim0.04 dex if they are located adjacent to a non-SF flagged spaxel, relative to a radially matched control sample. The inclusion of additional diagnostic diagrams (based on [OI]6300_{6300}~\&/or [SII]6717+6731_{6717+6731}) is insufficient to remove the observed offset in the [NII]6584_{6584}/Hα\alpha based calibrations. Using a stricter diagnostic line on the BPT diagram removes \sim94 per cent of identified bordering spaxels with metallicities for all metallicity calibrations, and removes the residual offset to higher metallicity values seen in [NII]6584_{6584}/Hα\alpha calibrations. If science cases demand an exceptionally clean metallicity sample, we recommend either a stricter BPT cut, and/or a non-[NII]6584_{6584}/Hα\alpha based metallicity calibration.
Many theoretical and observational studies have suggested that galaxy mergers may trigger enhanced star formation or active galactic nuclei (AGN) activity. We present an analysis of merging and nonmerging galaxies from 0.2z30.2 \leq z \leq 3 in the IllustrisTNG50 simulation. These galaxies encompass a range of masses (M>108MM_\star > 10^{8}M_\odot), multiple merger stages, and mass ratios (1:10\geq1:10). We examine the effect that galaxy mergers have on star formation and black hole accretion rates in the TNG50 universe. We additionally investigate how galaxy and black hole mass, merger stage, merger mass ratio, and redshift affect these quantities. Mergers in our sample show excess specific star formation rates (sSFR) at z3z \leq 3 and enhanced specific black hole accretion rates (sBHAR) at z2z \lesssim 2. The difference between sSFRs and sBHARs in the merging sample compared to the non-merging sample increases as redshift decreases. Additionally, we show that these enhancements persist for at least 1\sim1 Gyr after the merger event. Investigating how mergers behave in the TNG50 simulation throughout cosmic time enables both a better appreciation of the importance of spatial resolution in cosmological simulations and a better basis to understand our high-zz universe with observations from JWST\textit{JWST}.
Generative AI models, specifically large language models (LLMs), have made strides towards the long-standing goal of text-to-code generation. This progress has invited numerous studies of user interaction. However, less is known about the struggles and strategies of non-experts, for whom each step of the text-to-code problem presents challenges: describing their intent in natural language, evaluating the correctness of generated code, and editing prompts when the generated code is incorrect. This paper presents a large-scale controlled study of how 120 beginning coders across three academic institutions approach writing and editing prompts. A novel experimental design allows us to target specific steps in the text-to-code process and reveals that beginners struggle with writing and editing prompts, even for problems at their skill level and when correctness is automatically determined. Our mixed-methods evaluation provides insight into student processes and perceptions with key implications for non-expert Code LLM use within and outside of education.
There are no more papers matching your filters at the moment.