alphaXiv

History

Papers Benchmarks

Binjiang Institute of Zhejiang University

17,443

14 Apr 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Zhejiang University Binjiang Institute of Zhejiang University Om AI Research

Ruochen Xu

正阳金

VLM-R1 introduces an open-source framework that applies rule-based reinforcement learning to Vision-Language Models (VLMs), enhancing their visual reasoning and generalization abilities on tasks like Referring Expression Comprehension and Open-Vocabulary Object Detection. The approach demonstrates improved out-of-domain performance compared to supervised fine-tuning and showcases emergent reasoning behaviors.

5,381

298

30 Sep 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

Binjiang Institute of Zhejiang University Om AI Research College of Computer Science and Technology, Zhejiang University

VLM-FO1, a plug-and-play framework from Om AI Research and Zhejiang University, enhances pre-trained Vision-Language Models with fine-grained perception by bridging high-level reasoning and precise spatial localization. It achieves state-of-the-art performance across object grounding (44.4 mAP on COCO), regional understanding, and visual reasoning benchmarks, while effectively preserving the base VLM's general capabilities.

1,366

01 Sep 2025

attention-mechanisms computer-science computer-vision-security

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Zhejiang University Binjiang Institute of Zhejiang University Om AI Research

Ruochen Xu

正阳金

ZoomEye, a training-free and model-agnostic framework, enhances Multimodal Large Language Models (MLLMs) with human-like zooming capabilities through tree-based image exploration. This approach substantially improves MLLM performance on high-resolution visual tasks, enabling smaller models (3B-8B parameters) to surpass larger commercial models like GPT-4o on specific detail-oriented benchmarks.

178

22 Aug 2025

adversarial-attacks adversarial-robustness agents

MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications

Zhejiang University

The Chinese University of Hong Kong

Shandong University Binjiang Institute of Zhejiang University

Researchers developed MCP-Guard, a multi-layered defense framework protecting Large Language Model-tool interactions against threats like prompt injection, achieving an average 98.47% recall and 89.07% F1-score. This system significantly reduced detection latency by up to 12x compared to existing baselines, and introduced MCP-AttackBench, a comprehensive dataset of over 70,000 attack samples for robust evaluation.

14 Oct 2025

computer-science computer-vision-and-pattern-recognition knowledge-distillation

Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval

University of Science and Technology of China Hefei University of Technology

Peking University Binjiang Institute of Zhejiang University Zhejiang Gongshang University Wangxuan Institute of Computer Technology GenTel.io

Researchers introduce DL-DKD++, a framework for partially relevant video retrieval that employs dual learning with dynamic knowledge distillation from large vision-language models and dynamic soft alignment for nuanced relevance scoring. The approach establishes new state-of-the-art performance on datasets like TVR (SumR 184.8), ActivityNet-Captions (SumR 149.9), and Charades-STA.

242

02 Jan 2024

computer-science artificial-intelligence computation-and-language

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

Zhejiang University Binjiang Institute of Zhejiang University

Researchers at Zhejiang University introduced RS5M, the first large-scale (5 million pairs) remote sensing image-text dataset, and GeoRSCLIP, a specialized Vision-Language Model. GeoRSCLIP, fine-tuned on RS5M, achieved 3-20% performance improvements over baselines on various remote sensing tasks, effectively bridging the domain gap for general VLMs and advancing GeoAI applications.

284

230

12 Nov 2024

agent-based-systems computer-science computation-and-language

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

Binjiang Institute of Zhejiang University Linker Technology Research

OmAgent, a multi-modal agent framework developed by Om AI Research and Zhejiang University, improves long-form video understanding by addressing information loss through a novel "rewinder" tool integrated within an autonomous agent. The framework outperformed existing methods on general problem-solving benchmarks, achieving 88.3% on MBPP and 79.7% on FreshQA, and demonstrated enhanced comprehension across various complex long video understanding tasks.

337

24 Dec 2024

autonomous-vehicles computer-science computer-vision-security

GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

Zhejiang University Binjiang Institute of Zhejiang University Om AI Research

正阳金

konka zhao

Researchers at Zhejiang University and Om AI Research introduced GUI Testing Arena (GTArena), a unified, end-to-end benchmark for autonomous GUI testing. The framework formalizes the testing process and evaluates state-of-the-art multimodal large language models, revealing a substantial performance gap between current AI capabilities and real-world applicability.

120

18 Dec 2023

computer-science computation-and-language computer-vision-and-pattern-recognition

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

Northwestern Polytechnical University Binjiang Institute of Zhejiang University Linker Technology Research Co. Ltd

A new benchmark, OVDEval, is introduced to provide a comprehensive, fine-grained evaluation of Open-Vocabulary Detection (OVD) models across six linguistic aspects, utilizing meticulously designed hard negatives. This work also proposes NMS-AP, a refined metric addressing the "Inflated AP Problem" in traditional Average Precision, revealing that state-of-the-art OVD models show strong general object detection but poor fine-grained linguistic comprehension.

134

19 Feb 2025

computer-science artificial-intelligence computation-and-language

The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?

Zhejiang University Binjiang Institute of Zhejiang University Om AI Research

Ruochen Xu

Self-improving large language models (LLMs) -- i.e., to improve the performance of an LLM by fine-tuning it with synthetic data generated by itself -- is a promising way to advance the capabilities of LLMs while avoiding extensive supervision. Existing approaches to self-improvement often rely on external supervision signals in the form of seed data and/or assistance from third-party models. This paper presents Crescent -- a simple yet effective framework for generating high-quality synthetic question-answer data in a fully autonomous manner. Crescent first elicits the LLM to generate raw questions via a bait prompt, then diversifies these questions leveraging a rejection sampling-based self-deduplication, and finally feeds the questions to the LLM and collects the corresponding answers by means of majority voting. We show that Crescent sheds light on the potential of true self-improvement with zero external supervision signals for math reasoning; in particular, Crescent-generated question-answer pairs suffice to (i) improve the reasoning capabilities of an LLM while preserving its general performance (especially in the 0-shot setting); and (ii) distil LLM knowledge to weaker models more effectively than existing methods based on seed-dataset augmentation.

09 May 2025

computer-science artificial-intelligence computation-and-language

HORAE: A Domain-Agnostic Language for Automated Service Regulation

Zhejiang University Binjiang Institute of Zhejiang University Nanjing University of Information Science and Technology

Monica Chen

Artificial intelligence is rapidly encroaching on the field of service regulation. However, existing AI-based regulation techniques are often tailored to specific application domains and thus are difficult to generalize in an automated manner. This paper presents Horae, a unified specification language for modeling (multimodal) regulation rules across a diverse set of domains. We showcase how Horae facilitates an intelligent service regulation pipeline by further exploiting a fine-tuned large language model named RuleGPT that automates the Horae modeling process, thereby yielding an end-to-end framework for fully automated intelligent service regulation. The feasibility and effectiveness of our framework are demonstrated over a benchmark of various real-world regulation domains. In particular, we show that our open-sourced, fine-tuned RuleGPT with 7B parameters suffices to outperform GPT-3.5 and perform on par with GPT-4o.

26 Aug 2025

computer-science computation-and-language cryptography-and-security

Fingerprint Vector: Enabling Scalable and Efficient Model Fingerprint Transfer via Vector Addition

Zhejiang University Hangzhou Dianzi University Hong Kong Baptist University Binjiang Institute of Zhejiang University GenTel.io

Backdoor-based fingerprinting has emerged as an effective technique for tracing the ownership of large language models. However, in real-world deployment scenarios, developers often instantiate multiple downstream models from a shared base model, and applying fingerprinting to each variant individually incurs prohibitive computational overhead. While inheritance-based approaches -- where fingerprints are embedded into the base model and expected to persist through fine-tuning -- appear attractive, they suffer from three key limitations: late-stage fingerprinting, fingerprint instability, and interference with downstream adaptation. To address these challenges, we propose a novel mechanism called the Fingerprint Vector. Our method first embeds a fingerprint into the base model via backdoor-based fine-tuning, then extracts a task-specific parameter delta as a fingerprint vector by computing the difference between the fingerprinted and clean models. This vector can be directly added to any structurally compatible downstream model, allowing the fingerprint to be transferred post hoc without additional fine-tuning. Extensive experiments show that Fingerprint Vector achieves comparable or superior performance to direct injection across key desiderata. It maintains strong effectiveness across diverse model architectures as well as mainstream downstream variants within the same family. It also preserves harmlessness and robustness in most cases. Even when slight robustness degradation is observed, the impact remains within acceptable bounds and is outweighed by the scalability benefits of our approach.

13 Feb 2025

ai-for-health computer-science artificial-intelligence

SkinGEN: an Explainable Dermatology Diagnosis-to-Generation Framework with Interactive Vision-Language Models

Zhejiang University Hunan University Binjiang Institute of Zhejiang University Shanghai Institute for Advanced Study of Zhejiang University Zhejiang Chinese Medical University The First Affiliated Hospital, Zhejiang University School of Medicine Hangzhou Third People’s Hospital

With the continuous advancement of vision language models (VLMs) technology, remarkable research achievements have emerged in the dermatology field, the fourth most prevalent human disease category. However, despite these advancements, VLM still faces explainable problems to user in diagnosis due to the inherent complexity of dermatological conditions, existing tools offer relatively limited support for user comprehension. We propose SkinGEN, a diagnosis-to-generation framework that leverages the stable diffusion(SD) model to generate reference demonstrations from diagnosis results provided by VLM, thereby enhancing the visual explainability for users. Through extensive experiments with Low-Rank Adaptation (LoRA), we identify optimal strategies for skin condition image generation. We conduct a user study with 32 participants evaluating both the system performance and explainability. Results demonstrate that SkinGEN significantly improves users' comprehension of VLM predictions and fosters increased trust in the diagnostic process. This work paves the way for more transparent and user-centric VLM applications in dermatology and beyond.

186

18 Jun 2025

adversarial-attacks adversarial-robustness computer-science

LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models

Binjiang Institute of Zhejiang University Zhejiang Gongshang University Gentel lnc.

Lin et al. introduce Analyzing-based Jailbreak (ABJ), a method that manipulates Large Language Models' internal reasoning processes to elicit harmful content from neutral inputs. ABJ achieved over 80% attack success rates on state-of-the-art LLMs like GPT-4o and Claude-3-haiku, demonstrating its ability to bypass common input-stage defenses.

11 Jul 2025

adversarial-attacks adversarial-robustness ai-for-cybersecurity

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

Multimodal Large Language Models (MLLMs) have enabled transformative advancements across diverse applications but remain susceptible to safety threats, especially jailbreak attacks that induce harmful outputs. To systematically evaluate and improve their safety, we organized the Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025}. This technical report presents findings from the competition, which involved 86 teams testing MLLM vulnerabilities via adversarial image-text attacks in two phases: white-box and black-box evaluations. The competition results highlight ongoing challenges in securing MLLMs and provide valuable guidance for developing stronger defense mechanisms. The challenge establishes new benchmarks for MLLM safety evaluation and lays groundwork for advancing safer multimodal AI systems. The code and data for this challenge are openly available at this https URL.

11 Aug 2025

agents clustering-algorithms computer-science

HGMF: A Hierarchical Gaussian Mixture Framework for Scalable Tool Invocation within the Model Context Protocol

Zhejiang University Binjiang Institute of Zhejiang University Jimei University GenTel.io

Invoking external tools enables Large Language Models (LLMs) to perform complex, real-world tasks, yet selecting the correct tool from large, hierarchically-structured libraries remains a significant challenge. The limited context windows of LLMs and noise from irrelevant options often lead to low selection accuracy and high computational costs. To address this, we propose the Hierarchical Gaussian Mixture Framework (HGMF), a probabilistic pruning method for scalable tool invocation. HGMF first maps the user query and all tool descriptions into a unified semantic space. The framework then operates in two stages: it clusters servers using a Gaussian Mixture Model (GMM) and filters them based on the query's likelihood. Subsequently, it applies the same GMM-based clustering and filtering to the tools associated with the selected servers. This hierarchical process produces a compact, high-relevance candidate set, simplifying the final selection task for the LLM. Experiments on a public dataset show that HGMF significantly improves tool selection accuracy while reducing inference latency, confirming the framework's scalability and effectiveness for large-scale tool libraries.

14 Feb 2022

ai-for-health computer-science machine-learning

A resource-efficient deep learning framework for low-dose brain PET image reconstruction and analysis

Zhejiang University Binjiang Institute of Zhejiang University Hangzhou Universal Medical Imaging Diagnostic Center

18F-fluorodeoxyglucose (18F-FDG) Positron Emission Tomography (PET) imaging usually needs a full-dose radioactive tracer to obtain satisfactory diagnostic results, which raises concerns about the potential health risks of radiation exposure, especially for pediatric patients. Reconstructing the low-dose PET (L-PET) images to the high-quality full-dose PET (F-PET) ones is an effective way that both reduces the radiation exposure and remains diagnostic accuracy. In this paper, we propose a resource-efficient deep learning framework for L-PET reconstruction and analysis, referred to as transGAN-SDAM, to generate F-PET from corresponding L-PET, and quantify the standard uptake value ratios (SUVRs) of these generated F-PET at whole brain. The transGAN-SDAM consists of two modules: a transformer-encoded Generative Adversarial Network (transGAN) and a Spatial Deformable Aggregation Module (SDAM). The transGAN generates higher quality F-PET images, and then the SDAM integrates the spatial information of a sequence of generated F-PET slices to synthesize whole-brain F-PET images. Experimental results demonstrate the superiority and rationality of our approach.

17 Apr 2023

adversarial-attacks adversarial-robustness computer-science

RNN-Guard: Certified Robustness Against Multi-frame Attacks for Recurrent Neural Networks

Zhejiang University

Shandong University Binjiang Institute of Zhejiang University Penn State University

It is well-known that recurrent neural networks (RNNs), although widely used, are vulnerable to adversarial attacks including one-frame attacks and multi-frame attacks. Though a few certified defenses exist to provide guaranteed robustness against one-frame attacks, we prove that defending against multi-frame attacks remains a challenging problem due to their enormous perturbation space. In this paper, we propose the first certified defense against multi-frame attacks for RNNs called RNN-Guard. To address the above challenge, we adopt the perturb-all-frame strategy to construct perturbation spaces consistent with those in multi-frame attacks. However, the perturb-all-frame strategy causes a precision issue in linear relaxations. To address this issue, we introduce a novel abstract domain called InterZono and design tighter relaxations. We prove that InterZono is more precise than Zonotope yet carries the same time complexity. Experimental evaluations across various datasets and model structures show that the certified robust accuracy calculated by RNN-Guard with InterZono is up to 2.18 times higher than that with Zonotope. In addition, we extend RNN-Guard as the first certified training method against multi-frame attacks to directly enhance RNNs' robustness. The results show that the certified robust accuracy of models trained with RNN-Guard against multi-frame attacks is 15.47 to 67.65 percentage points higher than those with other training methods.

30 May 2025

agentic-frameworks agents chain-of-thought

Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research

Binjiang Institute of Zhejiang University Om AI Research College of Computer Science and Technology, Zhejiang University

Ruochen Xu

正阳金

Researchers from Om AI Research and Zhejiang University introduce AGORA, a unified framework that enables standardized development and comprehensive evaluation of diverse language agent algorithms through a modular, graph-based architecture. Extensive experiments on mathematical reasoning and high-resolution image question-answering reveal that simpler algorithms often demonstrate robust performance with lower computational overhead, and prompt engineering significantly impacts results.

06 Jul 2024

active-learning attention-mechanisms computer-science

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

Binjiang Institute of Zhejiang University Linker Technology Research

Ruochen Xu

We introduce OmChat, a model designed to excel in handling long contexts and video understanding tasks. OmChat's new architecture standardizes how different visual inputs are processed, making it more efficient and adaptable. It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities. OmChat utilizes an active progressive multimodal pretraining strategy, which gradually increases the model's capacity for long contexts and enhances its overall abilities. By selecting high-quality data during training, OmChat learns from the most relevant and informative data points. With support for a context length of up to 512K, OmChat demonstrates promising performance in tasks involving multiple images and videos, outperforming most open-source models in these benchmarks. Additionally, OmChat proposes a prompting strategy for unifying complex multimodal inputs including single image text, multi-image text and videos, and achieving competitive performance on single-image benchmarks. To further evaluate the model's capabilities, we proposed a benchmark dataset named Temporal Visual Needle in a Haystack. This dataset assesses OmChat's ability to comprehend temporal visual details within long videos. Our analysis highlights several key factors contributing to OmChat's success: support for any-aspect high image resolution, the active progressive pretraining strategy, and high-quality supervised fine-tuning datasets. This report provides a detailed overview of OmChat's capabilities and the strategies that enhance its performance in visual understanding.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications

Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?

HORAE: A Domain-Agnostic Language for Automated Service Regulation

Fingerprint Vector: Enabling Scalable and Efficient Model Fingerprint Transfer via Vector Addition

SkinGEN: an Explainable Dermatology Diagnosis-to-Generation Framework with Interactive Vision-Language Models

LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

HGMF: A Hierarchical Gaussian Mixture Framework for Scalable Tool Invocation within the Model Context Protocol

A resource-efficient deep learning framework for low-dose brain PET image reconstruction and analysis

RNN-Guard: Certified Robustness Against Multi-frame Attacks for Recurrent Neural Networks

Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

Events

AI for Law

Personalize Your Feed