MC-Bench | alphaXiv

MC-Bench

A benchmark for evaluating multimodal large language models on visual grounding tasks that require localizing instances across multiple images using open-ended text prompts.

Image

2 stars

3 votes

494 views

16 Oct 2024

The primary instance-level metric for MC-Bench. It evaluates how precisely models can locate target instances within images. A predicted bounding box is considered correct if its Intersection over Union (IoU) with a ground-truth box is at least 0.5.

Overview

MC-Bench is a benchmark for multi-context visual grounding that evaluates multimodal large language models' (MLLMs) ability to localize instances across multiple images based on open-ended text prompts. The benchmark addresses a critical gap by combining multi-image inputs with instance-level visual grounding tasks, moving beyond existing benchmarks that focus on either single-image grounding or multi-image understanding without precise localization.

MC-Bench positioning chart Figure showing MC-Bench's unique position combining multi-image input with instance-level grounding tasks

Key Specifications

MC-Bench contains 2,000 multi-context samples with 3,200 language-grounded bounding boxes spanning diverse domains including natural images, document photos, webpage screenshots, scientific diagrams, and artwork. The dataset features three text prompt styles:

Referring (17.3%): Direct identification using category, attribute, or positional information
Comparison (40.5%): Cross-image comparisons of visual content like object quantity or attributes
Reasoning (42.2%): Complex descriptions requiring external knowledge and multi-hop reasoning

The benchmark covers 20 practical skills from document photo comprehension to forensic detection, with text prompts averaging 7.2 words. Each sample consists of image pairs with instance-level bounding box annotations.

MC-Bench statistics overview Comprehensive overview of MC-Bench's 2,000 multi-context samples across various skills and domains

Data Examples

Example 1 - Referring Style:

Images: Two golf course photos
Text prompt: "People with flags in their hands"
Task: Locate all people holding flags across both images
Expected output: Bounding boxes for flag-holding individuals in both images

Example 2 - Reasoning Style:

Images: Two interview scenarios (TV studio setup, historical photos)
Text prompt: "People with the same role in the interview"
Task: Group people by their roles (interviewer, cameraman, interviewee) across contexts
Expected output: Grouped bounding boxes identifying role-based matches

Example results showing model outputs Examples showing successful grounding (left) and common failure modes (right) for different prompt styles

Significance

MC-Bench reveals substantial limitations in current MLLMs for multi-context visual grounding. The benchmark's comprehensive evaluation of over 20 models shows that even the best-performing systems significantly lag behind human performance (89.5% vs 69.7% accuracy). Key findings include:

Agentic approaches (GPT-4o + G-DINO) outperform end-to-end MLLMs, achieving 66.8% accuracy and 36.2% AP50
Scale benefits: Larger models like Qwen2-VL-72B show marked improvement over smaller variants
Critical weaknesses: Models struggle with instance grouping, small object detection, and negative sample rejection

The benchmark highlights that current MLLMs predominantly excel at image-level understanding but fail at precise instance localization in multi-image contexts, indicating fundamental architectural limitations.

Usage

MC-Bench is publicly available at https://xuyunqiu.github.io/MC-Bench with planned leaderboard updates. The benchmark uses standard evaluation metrics including Accuracy for image-level performance and AP50 (Average Precision at IoU 0.5) for instance-level localization. Models are evaluated on their ability to identify relevant images and precisely localize target instances with bounding boxes in [x, y, w, h] format. The benchmark serves as an evaluation-only dataset with all 2,000 samples designated for testing.

View Source

A benchmark for evaluating multimodal large language models on structured visual reasoning tasks across multiple images, covering spatial, temporal, and semantic reasoning capabilities.

358 views

24 likes

MIBench

A comprehensive benchmark for evaluating multimodal large language models on tasks requiring understanding and reasoning across multiple images simultaneously.

286 views

13 likes

MMVM

A benchmark that evaluates visual correspondence capabilities of multimodal large language models through cross-image object matching tasks.

287 views

15 likes

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Overview

Key Specifications

Data Examples

Significance

Usage