alphaXiv

History

Papers Benchmarks

Om AI Research

17,443

14 Apr 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Zhejiang University Binjiang Institute of Zhejiang University Om AI Research

Ruochen Xu

正阳金

VLM-R1 introduces an open-source framework that applies rule-based reinforcement learning to Vision-Language Models (VLMs), enhancing their visual reasoning and generalization abilities on tasks like Referring Expression Comprehension and Open-Vocabulary Object Detection. The approach demonstrates improved out-of-domain performance compared to supervised fine-tuning and showcases emergent reasoning behaviors.

5,381

298

30 Sep 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

Binjiang Institute of Zhejiang University Om AI Research College of Computer Science and Technology, Zhejiang University

VLM-FO1, a plug-and-play framework from Om AI Research and Zhejiang University, enhances pre-trained Vision-Language Models with fine-grained perception by bridging high-level reasoning and precise spatial localization. It achieves state-of-the-art performance across object grounding (44.4 mAP on COCO), regional understanding, and visual reasoning benchmarks, while effectively preserving the base VLM's general capabilities.

1,366

01 Sep 2025

attention-mechanisms computer-science computer-vision-security

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Zhejiang University Binjiang Institute of Zhejiang University Om AI Research

Ruochen Xu

正阳金

ZoomEye, a training-free and model-agnostic framework, enhances Multimodal Large Language Models (MLLMs) with human-like zooming capabilities through tree-based image exploration. This approach substantially improves MLLM performance on high-resolution visual tasks, enabling smaller models (3B-8B parameters) to surpass larger commercial models like GPT-4o on specific detail-oriented benchmarks.

337

24 Dec 2024

autonomous-vehicles computer-science computer-vision-security

GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

Zhejiang University Binjiang Institute of Zhejiang University Om AI Research

正阳金

konka zhao

Researchers at Zhejiang University and Om AI Research introduced GUI Testing Arena (GTArena), a unified, end-to-end benchmark for autonomous GUI testing. The framework formalizes the testing process and evaluates state-of-the-art multimodal large language models, revealing a substantial performance gap between current AI capabilities and real-world applicability.

134

19 Feb 2025

computer-science artificial-intelligence computation-and-language

The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?

Zhejiang University Binjiang Institute of Zhejiang University Om AI Research

Ruochen Xu

Self-improving large language models (LLMs) -- i.e., to improve the performance of an LLM by fine-tuning it with synthetic data generated by itself -- is a promising way to advance the capabilities of LLMs while avoiding extensive supervision. Existing approaches to self-improvement often rely on external supervision signals in the form of seed data and/or assistance from third-party models. This paper presents Crescent -- a simple yet effective framework for generating high-quality synthetic question-answer data in a fully autonomous manner. Crescent first elicits the LLM to generate raw questions via a bait prompt, then diversifies these questions leveraging a rejection sampling-based self-deduplication, and finally feeds the questions to the LLM and collects the corresponding answers by means of majority voting. We show that Crescent sheds light on the potential of true self-improvement with zero external supervision signals for math reasoning; in particular, Crescent-generated question-answer pairs suffice to (i) improve the reasoning capabilities of an LLM while preserving its general performance (especially in the 0-shot setting); and (ii) distil LLM knowledge to weaker models more effectively than existing methods based on seed-dataset augmentation.

30 May 2025

agentic-frameworks agents chain-of-thought

Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research

Binjiang Institute of Zhejiang University Om AI Research College of Computer Science and Technology, Zhejiang University

Ruochen Xu

正阳金

Researchers from Om AI Research and Zhejiang University introduce AGORA, a unified framework that enables standardized development and comprehensive evaluation of diverse language agent algorithms through a modular, graph-based architecture. Extensive experiments on mathematical reasoning and high-resolution image question-answering reveal that simpler algorithms often demonstrate robust performance with lower computational overhead, and prompt engineering significantly impacts results.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?

Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research

Events

AI for Law

Personalize Your Feed