alphaXiv

History

Papers Benchmarks

The Ohio State University

7,930

13 Oct 2025

agents computer-science artificial-intelligence

Agent Learning via Early Experience

The Ohio State University FAIR at Meta Meta Superintelligence Labs

The "early experience" paradigm enables autonomous language agents to learn and continuously improve from their own interactions with the environment, leveraging implicit feedback from observed future states rather than explicit reward signals. This approach consistently improves task effectiveness, enhances out-of-domain generalization, and provides a robust foundation for subsequent reinforcement learning.

117

2,047

10 Jul 2025

computer-science artificial-intelligence machine-learning

PIAD-SRNN: Physics-Informed Adaptive Decomposition in State-Space RNN

The Ohio State University Shahid Beheshti University

Researchers at Ohio State University developed PIAD-SRNN, a physics-informed recurrent neural network with adaptive decomposition for time series forecasting and imputation of indoor air quality (IAQ) data. The model consistently achieved superior accuracy (lowest MSE and MAE) for both multi-horizon forecasting and missing data imputation, while maintaining high computational efficiency compared to leading deep learning and linear models.

67,682

02 Aug 2025

agentic-frameworks agents autonomous-vehicles

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Google DeepMind

University of Illinois at Urbana-Champaign

Université de Montréal

University of Southern California

Stanford University

Mila - Quebec AI Institute

The Hong Kong Polytechnic University

Yale University

University of Georgia

Nanyang Technological University

Microsoft

Argonne National Laboratory

Duke University

HKUST King Abdullah University of Science and Technology

University of Sydney

The Ohio State University Penn State University MetaGPT

Yu Su

Bang Liu

A comprehensive, brain-inspired framework integrates diverse research areas of LLM-based intelligent agents, encompassing individual architecture, collaborative systems, and safety. The framework formally conceptualizes agent components, maps AI capabilities to human cognition to identify research gaps, and outlines a roadmap for developing autonomous, adaptive, and safe AI.

596

16,258

19 Jun 2025

computer-science continual-learning artificial-intelligence

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

University of Illinois at Urbana-Champaign

The Ohio State University

Yu Su

Bernal Jimenez

HippoRAG 2 presents a non-parametric continual learning framework for large language models that integrates factual, sense-making, and associative memory capabilities within a single system. It achieves the highest average F1 score (59.8) and improved recall across diverse QA benchmarks, outperforming existing RAG methods without sacrificing performance on basic factual recall tasks.

1,962

9,508

14 Jan 2025

computer-science continual-learning artificial-intelligence

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

Stanford University

The Ohio State University

Yu Su

Bernal Jimenez

HippoRAG, developed by researchers at The Ohio State University and Stanford University, introduces a neurobiologically inspired long-term memory system for LLMs. It achieves single-step multi-hop knowledge integration by constructing a dynamic knowledge graph, outperforming state-of-the-art RAG methods with up to 20% higher Recall@5 on 2WikiMultiHopQA and being 10-30 times faster than iterative approaches.

2,217

3,081

04 Oct 2025

agent-based-systems computer-science conversational-ai

AgentBench: Evaluating LLMs as Agents

UC Berkeley

Tsinghua University

The Ohio State University

Yu Su

Hangliang Ding

AGENTBENCH introduces a multi-dimensional benchmark with 8 interactive environments to systematically evaluate Large Language Models (LLMs) as agents. The benchmark reveals a significant performance gap between commercial and open-source LLMs, identifying predominant failure modes in long-term reasoning and instruction following.

2,843

1,263

09 Oct 2025

astrophysics-of-galaxies physics

A Comprehensive Characterization of Galaxy-cool CGM Connections at $z<0.4$ with DESI Year 1 Data

Academia Sinica National Astronomical Observatory of Japan

UC Berkeley

University College London National Taiwan University

University of Michigan

Boston University Kavli Institute for the Physics and Mathematics of the Universe The University of Texas at Dallas

Lawrence Berkeley National Laboratory

Sorbonne Université Fermi National Accelerator Laboratory Universitat Politècnica de Catalunya University of Portsmouth

The Ohio State University Sejong University Universidad Nacional Autónoma de México Universitat Autònoma de Barcelona

University of California, Santa Cruz NSF NOIRLab Universidad de Los Andes University of Wyoming CIEMAT Institut de Física d’Altes Energies (IFAE)Institució Catalana de Recerca i Estudis Avançats Siena College Instituto Astrofisica de Canarias Institute of Space Sciences (ICE–CSIC)Universit degli Studi di Milano INAF Osservatorio Astronomico di Brera

This study comprehensively characterized the cool circumgalactic medium (CGM) around galaxies at redshifts below 0.4 using data from the Dark Energy Spectroscopic Instrument (DESI) Year 1 survey. It reveals persistent correlations between cool gas absorption and galaxy properties like stellar mass and star formation rate, along with an unexpected absence of azimuthal anisotropy, indicating a possible evolution in CGM dynamics at lower redshifts.

713

31 Aug 2025

computer-science computer-vision-and-pattern-recognition machine-learning

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

National University of Singapore

University of Maryland, College Park

The Ohio State University

LLaVA-Critic-R1 demonstrates a novel paradigm where a single multimodal model, trained as a critic via reinforcement learning, surprisingly excels as a strong policy model. This approach achieves state-of-the-art performance for 7B-scale models on MMMU (71.9), MathVista (82.1), MathVerse (74.1), and Charxiv Reasoning (62.5), while simultaneously enhancing its self-evaluation capabilities.

4,304

5,071

19 Feb 2024

computer-science artificial-intelligence computation-and-language

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Tsinghua University

Peking University

The University of Hong Kong

The Ohio State University

DeepSeek

MATH-SHEPHERD, a collaborative effort by Peking University and DeepSeek-AI, presents an automatic process annotation framework for training a process reward model without human intervention. This approach enables open-source LLMs like DeepSeek-67B to achieve 93.3% accuracy on GSM8K and 48.1% on MATH through verification and step-by-step reinforcement learning, improving mathematical reasoning capabilities and outperforming existing methods.

400

08 Dec 2025

agentic-frameworks agents ai-for-genomics

LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

University of Washington

Stanford University

Princeton University

The Ohio State University

Modern science advances fastest when thought meets action. LabOS represents the first AI co-scientist that unites computational reasoning with physical experimentation through multimodal perception, self-evolving agents, and Extended-Reality(XR)-enabled human-AI collaboration. By connecting multi-model AI agents, smart glasses, and robots, LabOS allows AI to see what scientists see, understand experimental context, and assist in real-time execution. Across applications -- from cancer immunotherapy target discovery to stem-cell engineering and material science -- LabOS shows that AI can move beyond computational design to participation, turning the laboratory into an intelligent, collaborative environment where human and machine discovery evolve together.

1,735

13 Jun 2024

computer-science artificial-intelligence computation-and-language

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

University of Victoria

University of Waterloo

Carnegie Mellon University

Princeton University

The Ohio State University IN.AI Research

Xiang Yue

Researchers from the University of Waterloo, The Ohio State University, and collaborators present MMMU, a new benchmark designed to evaluate Large Multimodal Models on expert-level, multi-discipline understanding and reasoning. The benchmark, featuring over 11,000 college-level questions with diverse image types, reveals that even leading models like GPT-4o and Gemini 1.5 Pro significantly trail human experts, struggling with domain-specific visual perception, deep knowledge, and complex reasoning.

389

1,081

09 Dec 2023

computer-science computation-and-language human-ai-interaction

Mind2Web: Towards a Generalist Agent for the Web

The Ohio State University

Yu Su

MIND2WEB introduces a large-scale dataset of over 2,000 human-demonstrated web interaction tasks collected from 137 real-world websites, designed to benchmark generalist agents. The paper also presents MINDACT, a two-stage LLM-based framework that achieves up to 55.1% element accuracy and 52.0% step success rate in cross-task generalization, though overall task success rates are in single digits.

779

1,021

01 Apr 2025

computer-science artificial-intelligence

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

The Ohio State University Orby.ai

Yu Su

Boyu Gou

Researchers introduce WEBDREAMER, a model-based planning framework that uses large language models as world models to simulate web environment dynamics and evaluate future states before executing actions. This approach improves web agent performance and efficiency, demonstrating that a specialized 7B-parameter model, Dreamer-7B, can achieve performance comparable to GPT-4o on real-world web tasks.

872

13 Oct 2025

chain-of-thought computer-science computation-and-language

ARM: Adaptive Reasoning Model

Fudan University

The Ohio State University

The Adaptive Reasoning Model (ARM) enables large reasoning models to adaptively select appropriate reasoning formats based on task difficulty, reducing token generation by an average of 30% and up to 70% on easy tasks while maintaining strong performance. This approach, using an adapted reinforcement learning algorithm, mitigates the "overthinking" problem and achieves a ~2x training speedup.

947

05 Oct 2025

chain-of-thought computer-science computation-and-language

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Imperial College London

University of Southern California Tongji University

Nanjing University

University of Michigan

ByteDance

University of Minnesota

The University of Hong Kong

Duke University Case Western Reserve University

The Ohio State University Kean University

SRPO enhances multimodal large language models by integrating explicit self-reflection and self-correction capabilities through a two-stage training framework. The approach achieves state-of-the-art performance among open-source models, scoring 78.5% on MathVista with SRPO-32B, and showing competitive results against leading closed-source models across diverse reasoning benchmarks.

3,372

23 Oct 2024

computer-science conversational-ai computation-and-language

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Fudan University

An Illusion of Progress? Assessing the Current State of Web Agents

UC Berkeley

The Ohio State University

Yu Su

Boyu Gou

Researchers from The Ohio State University and UC Berkeley rigorously assessed the current state of web agents, revealing that frontier models exhibit substantially lower success rates on a new, more challenging online benchmark compared to previous reports. They introduced Online-Mind2Web, a diverse benchmark of 300 tasks on 136 live websites, and WebJudge, an automatic evaluation method that achieves 85.7% agreement with human judgments.

1,117

11 Jun 2025

chain-of-thought computer-science machine-learning

A Closer Look at TabPFN v2: Understanding Its Strengths and Extending Its Capabilities

Nanjing University

The Ohio State University

An investigation into TabPFN v2 illuminates its ability to handle data heterogeneity and its potential as a feature encoder, while proposing test-time divide-and-conquer strategies that extend its applicability to high-dimensional, multi-class, and large-scale tabular datasets.

264

27 Nov 2025

agents computer-science artificial-intelligence

Watch and Learn: Learning to Use Computers from Online Videos

Google DeepMind

The Ohio State University Google Cloud AI Research

Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Existing datasets are narrow, static, and costly to annotate, while synthetic data often yields oversimplified or misaligned behaviors. We present Watch & Learn (W&L), a framework that converts readily available Internet videos of human computer use into executable UI trajectories at scale. Instead of directly generating actions or relying on handcrafted heuristics, we cast trajectory annotation as an inverse dynamics problem that predicts user actions from consecutive screen states, which simplifies learning and generalizes across domains. Through a task-aware retrieval and labeling pipeline, W&L yields over 53K high-quality trajectories that enhance CUAs both as in-context exemplars and as supervised training data. On OSWorld, it consistently improves general-purpose and specialized CUAs, while on WindowsAgentArena it achieves state-of-the-art performance among 7B-scale models under the 15-step limit. These results show that web-scale human demonstration videos can serve as a practical and scalable foundation for advancing real-world CUAs.

1,880

17 Jun 2025

computer-science artificial-intelligence computation-and-language

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

The Ohio State University Orby.ai

Yu Su

Demi Ruohan Wang

This research introduces UGround, a universal visual grounding model, and integrates it into the SeeAct-V framework, enabling GUI agents to interact with digital environments purely through visual observation and pixel-level operations. The method demonstrates superior performance over state-of-the-art text-based approaches across web, desktop, and mobile GUI tasks, primarily due to its robust grounding capabilities developed from a large synthetic dataset.

280

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Agent Learning via Early Experience

PIAD-SRNN: Physics-Informed Adaptive Decomposition in State-Space RNN

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

AgentBench: Evaluating LLMs as Agents

A Comprehensive Characterization of Galaxy-cool CGM Connections at $z<0.4$ with DESI Year 1 Data

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Mind2Web: Towards a Generalist Agent for the Web

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

ARM: Adaptive Reasoning Model

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

An Illusion of Progress? Assessing the Current State of Web Agents

A Closer Look at TabPFN v2: Understanding Its Strengths and Extending Its Capabilities

Watch and Learn: Learning to Use Computers from Online Videos

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Events

AI for Law

Personalize Your Feed

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Agent Learning via Early Experience

PIAD-SRNN: Physics-Informed Adaptive Decomposition in State-Space RNN

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

AgentBench: Evaluating LLMs as Agents

A Comprehensive Characterization of Galaxy-cool CGM Connections at z&lt;0.4 with DESI Year 1 Data

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Mind2Web: Towards a Generalist Agent for the Web

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

ARM: Adaptive Reasoning Model

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

An Illusion of Progress? Assessing the Current State of Web Agents

A Closer Look at TabPFN v2: Understanding Its Strengths and Extending Its Capabilities

Watch and Learn: Learning to Use Computers from Online Videos

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Events

AI for Law

Personalize Your Feed

A Comprehensive Characterization of Galaxy-cool CGM Connections at $z<0.4$ with DESI Year 1 Data