The University of Chicago
The Gravity Spy project aims to uncover the origins of glitches, transient bursts of noise that hamper analysis of gravitational-wave data. By using both the work of citizen-science volunteers and machine-learning algorithms, the Gravity Spy project enables reliable classification of glitches. Citizen science and machine learning are intrinsically coupled within the Gravity Spy framework, with machine-learning classifications providing a rapid first-pass classification of the dataset and enabling tiered volunteer training, and volunteer-based classifications verifying the machine classifications, bolstering the machine-learning training set and identifying new morphological classes of glitches. These classifications are now routinely used in studies characterizing the performance of the LIGO gravitational-wave detectors. Providing the volunteers with a training framework that teaches them to classify a wide range of glitches, as well as additional tools to aid their investigations of interesting glitches, empowers them to make discoveries of new classes of glitches. This demonstrates that, when giving suitable support, volunteers can go beyond simple classification tasks to identify new features in data at a level comparable to domain experts. The Gravity Spy project is now providing volunteers with more complicated data that includes auxiliary monitors of the detector to identify the root cause of glitches.
APACE is a computational framework that optimizes AlphaFold2 for supercomputing environments, significantly accelerating protein structure prediction. The system delivers speedups of up to two orders of magnitude and efficiently generates diverse conformational ensembles, transforming prediction times from weeks to minutes.
Researchers from Peking University, Princeton University, and ByteDance introduce MMaDA-Parallel, a multimodal large diffusion language model that generates text and images in parallel, addressing error propagation in sequential thinking-aware models. It achieved the highest output alignment score on a new ParaBench benchmark by enabling continuous, bidirectional interaction between modalities and optimizing with trajectory-level reinforcement learning.
6
We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at this https URL. A Denario demo can also be run directly on the web at this https URL, and the full app will be deployed on the cloud.
76
A self-supervised framework, LightReasoner, enhances large language model reasoning by deriving contrastive supervision from the behavioral differences between an expert and a weaker amateur model. The method improves mathematical reasoning accuracy by up to 28.1% on GSM8K, requiring 90% less training time and 99% fewer tuned tokens compared to existing fine-tuning techniques.
41
This research introduces WORKFORCE, a modular multi-agent inference architecture that decouples planning from execution, and OPTIMIZED WORKFORCE LEARNING (OWL), a training paradigm focused on a domain-agnostic planner. The system achieved 69.70% accuracy on the GAIA benchmark, setting a new open-source state-of-the-art and outperforming commercial baselines like OpenAI's Deep Research.
16,773
CacheGen optimizes Large Language Model serving by compressing and streaming Key-Value (KV) caches, addressing the network bottleneck in fetching long contexts. This system reduces Time-to-First-Token (TTFT) by 3.1-4.7x and the KV cache size by 3.5-4.3x with marginal impact on model quality.
133
·
The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at this https URL.
111
This research demonstrates the spontaneous emergence of well-formed, self-replicating programs from random, non-replicating code in various computational environments, including minimalist languages and real-world instruction sets, without explicit fitness functions. The study finds this emergence is primarily driven by self-modification and interaction, not solely random mutations, and leads to subsequent complex dynamics like competition and co-existence.
SQLENS, an end-to-end framework from AWS, The University of Chicago, and MIT, addresses the issue of semantically incorrect SQL queries generated by Large Language Models (LLMs) by integrating diverse database and LLM-based error signals for fine-grained detection and iterative correction. The framework boosts Text-to-SQL system execution accuracy on benchmarks like BIRD by up to 20.50% and achieves an F1 score of 78.88 for error detection.
Current reinforcement learning (RL) frameworks for large language models (LLM) post-training typically assume a fixed prompt distribution, which is sub-optimal and bottlenecks scalability. Prior works have explored prompt evolving, but are often limited to the supervised fine-tuning stage, and prompts are sampled and evolved uniformly without signals. This empirical work presents a paradigm shift: Evolving Alignment via Asymmetric Self-Play (eva), that casts post-training as an infinite game with regret-based signals for 2 players: (i) a creator, who strategically samples and creates new informative prompts and (ii) a solver, who learns to produce preferred responses. eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training. The design is simple, easy-to-use yet remarkably effective: eva sets a new SOTA on challenging benchmarks, without any extra human prompts, e.g. it boosts the win-rate of gemma-2-9b-it on Arena-Hard by 51.6% -> 60.1% for DPO and 52.6% -> 62.4% for RLOO, surpassing claude-3-opus and catching up to gemini-1.5-pro, both of which are orders of magnitude larger. Extensive experiments show eva can create effective RL curricula and is robust across ablations. We believe adaptively evolving prompts are key to designing the next-generation RL post-training scheme.
This research proposes a differentiable framework for learning causal structures from general binary data, utilizing the multivariate Bernoulli distribution to capture arbitrary dependencies. It demonstrates that while exact DAGs are non-identifiable, causal structures are identifiable up to Markov equivalence under a sparsity assumption. The developed method, BiNOTEARS, shows improved accuracy over baselines on synthetic datasets with complex interactions and a real-world biological network.
This research investigates whether large language model agents can simulate human trust behavior, introducing the concept of 'behavioral alignment' to assess their capacity to mirror human conduct and reasoning. The study finds that GPT-4 agents, in particular, exhibit high behavioral alignment with humans across various trust-related scenarios, including reciprocity, risk perception, and behavioral dynamics over time.
59
Quantum error correction is necessary to perform large-scale quantum computation, but requires extremely large overheads in both space and time. High-rate quantum low-density-parity-check (qLDPC) codes promise a route to reduce qubit numbers, but performing computation while maintaining low space cost has required serialization of operations and extra time costs. In this work, we design fast and parallelizable logical gates for qLDPC codes, and demonstrate their utility for key algorithmic subroutines such as the quantum adder. Our gate gadgets utilize transversal logical CNOTs between a data qLDPC code and a suitably constructed ancilla code to perform parallel Pauli product measurements (PPMs) on the data logical qubits. For hypergraph product codes, we show that the ancilla can be constructed by simply modifying the base classical codes of the data code, achieving parallel PPMs on a subgrid of the logical qubits with a lower space-time cost than existing schemes for an important class of circuits. Generalizations to 3D and 4D homological product codes further feature fast PPMs in constant depth. While prior work on qLDPC codes has focused on individual logical gates, we initiate the study of fault-tolerant compilation with our expanded set of native qLDPC code operations, constructing algorithmic primitives for preparing kk-qubit GHZ states and distilling/teleporting kk magic states with O(1)O(1) space overhead in O(1)O(1) and O(klogk)O(\sqrt{k} \log k) logical cycles, respectively. We further generalize this to key algorithmic subroutines, demonstrating the efficient implementation of quantum adders using parallel operations. Our constructions are naturally compatible with reconfigurable architectures such as neutral atom arrays, paving the way to large-scale quantum computation with low space and time overheads.
A new framework, "executable counterfactuals," rigorously evaluates and enhances large language models' causal reasoning by requiring them to perform abduction, intervention, and prediction. Models trained with reinforcement learning from verifiable rewards consistently generalize this reasoning to out-of-distribution code and math problems, unlike those trained with supervised finetuning, despite all models showing a significant accuracy drop when required to infer latent variables.
Existing approaches to differentiable structure learning of directed acyclic graphs (DAGs) rely on strong identifiability assumptions in order to guarantee that global minimizers of the acyclicity-constrained optimization problem identifies the true DAG. Moreover, it has been observed empirically that the optimizer may exploit undesirable artifacts in the loss function. We explain and remedy these issues by studying the behavior of differentiable acyclicity-constrained programs under general likelihoods with multiple global minimizers. By carefully regularizing the likelihood, it is possible to identify the sparsest model in the Markov equivalence class, even in the absence of an identifiable parametrization. We first study the Gaussian case in detail, showing how proper regularization of the likelihood defines a score that identifies the sparsest model. Assuming faithfulness, it also recovers the Markov equivalence class. These results are then generalized to general models and likelihoods, where the same claims hold. These theoretical results are validated empirically, showing how this can be done using standard gradient-based optimizers, thus paving the way for differentiable structure learning under general models and losses.
Regulatory efforts to govern large language model (LLM) development have predominantly focused on restricting access to high-performance computational resources. This study evaluates the efficacy of such measures by examining whether LLM capabilities can advance through algorithmic innovation in compute-constrained environments. We propose a novel framework distinguishing compute-dependent innovations--which yield disproportionate benefits at high compute--from compute-independent innovations, which improve efficiency across compute scales. The impact is quantified using Compute-Equivalent Gain (CEG). Experimental validation with nanoGPT models confirms that compute-independent advancements yield significant performance gains (e.g., with combined CEG up to 3.5×3.5\times) across the tested scales. In contrast, compute-dependent advancements were detrimental to performance at smaller experimental scales, but showed improved CEG (on par with the baseline) as model size increased, a trend consistent with their definition of yielding primary benefits at higher compute. Crucially, these findings indicate that restrictions on computational hardware, while potentially slowing LLM progress, are insufficient to prevent all capability gains driven by algorithmic advancements. We argue that effective AI oversight must therefore incorporate mechanisms for understanding, anticipating, and potentially guiding algorithmic research, moving beyond a singular focus on hardware. The proposed framework also serves as an analytical tool for forecasting AI progress.
PixCell, a diffusion-based generative foundation model, creates high-fidelity synthetic histopathology images conditioned by self-supervised UNI-2h embeddings. Trained on a 30.8 million patch dataset, it achieves state-of-the-art image quality and enables controllable generation and virtual staining, demonstrating synthetic data can effectively substitute real data for self-supervised learning.
Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then sent to the main model for processing. We evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with up to 7×\times maximal end-to-end QPS on real downstream tasks and 7.66×\times TTFT improvement.
NetLLM proposes a pioneering framework for adapting large language models (LLMs) to solve diverse networking problems by overcoming input modality gaps, output inefficiencies, and high adaptation costs. This approach achieved a 10.1-36.6% reduction in viewport prediction error, a 14.5-36.6% improvement in adaptive bitrate streaming Quality of Experience, and a 6.8-41.3% reduction in cluster job scheduling time, while demonstrating enhanced generalization compared to specialized learning-based baselines.
108
There are no more papers matching your filters at the moment.