Benchmarks/AutoBench-V
AutoBench-V

An automated benchmark for evaluating large vision-language models across visual understanding capabilities through on-demand test case generation.

Image
4 votes
901 views
28 Oct 2024

This is the overall average accuracy of nine Large Vision-Language Models (LVLMs) across all five evaluation dimensions (Basic, Spatial, Semantic, Reasoning, Atmospheric) and all three difficulty levels (Easy, Medium, Hard) on the AutoBench-V benchmark. It represents the most comprehensive single metric for model performance in the paper.

Overview

AutoBench-V is an automated benchmark framework designed to evaluate Large Vision-Language Models (LVLMs) across diverse multimodal capabilities. The system addresses key limitations of existing LVLM benchmarks by providing on-demand, automated evaluation that reduces human costs while maintaining rigorous assessment standards. The framework generates synthetic test cases using text-to-image models and orchestrates the entire evaluation pipeline through LVLMs themselves.

AutoBench-V Framework Overview The complete AutoBench-V pipeline from user input to final scoring

Key Specifications

AutoBench-V evaluates LVLMs across five core capability dimensions: Basic Understanding (object/scene recognition), Spatial Understanding (spatial relationships and positioning), Semantic Understanding (higher-level meaning interpretation), Reasoning Capacity (logical inference and causal analysis), and Atmospheric Understanding (mood and emotional ambiance). The framework employs a four-module architecture consisting of hierarchical aspect generation, guided description generation with semantic graph constraints, image generation with self-validation, and test case generation with bias mitigation.

The benchmark generates 720 test images per user input across three difficulty levels (easy, medium, hard), with each level containing 240 images. The system uses established models including GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet as examiners, while Flux-1.1-Pro handles image generation. Evaluation covers nine representative LVLMs including both closed-source and open-source models.

Data Examples

Example 1: Spatial Understanding - Object Layering (Easy)

The benchmark tests spatial comprehension with questions like:

Question: What object is placed underneath the shiny green apple?

Choices:

  • A: A circular wooden coaster
  • B: A plain white ceramic plate
  • C: A checkered tablecloth
  • D: A small black napkin

Image Description: A shiny green apple with a stem and a single leaf rests on a circular wooden coaster.

Correct Answer: A

Example 2: Reasoning Capacity - Behavioral Prediction (Medium)

Question: Based on the scene depicted, what is the most likely reason the child is running towards the woman with a playful expression?

Choices:

  • A: The child wants the ice cream cone
  • B: The child is likely excited to see their mother/caregiver and is rushing for a warm welcome or hug after being apart
  • C: The child wants to play a game
  • D: The child is probably eager to show something to the woman, such as a new toy or accomplishment

Image Description: A woman holding a half-eaten ice cream cone, looking intently at a child running towards her with a playful expression, a colorful playground in the background.

Correct Answer: A

Significance

AutoBench-V represents a significant advancement in LVLM evaluation methodology. The benchmark reveals critical performance patterns: all models show substantial accuracy drops as difficulty increases (from 73.79% to 36.88% average on basic tasks), with spatial understanding emerging as the most challenging capability (20.78% average on hard tasks). The framework's bias mitigation strategies successfully prevent answer leakage, as evidenced by dramatic performance drops when models attempt text-only reasoning.

The benchmark identifies Claude-3.5-Sonnet as the strongest performer overall (51.81% average accuracy), while Llama-3.2-90B-Vision shows the weakest performance (31.75% average). Human evaluation confirms high alignment rates between generated content and intended assessments (95.20% for easy tasks, declining to 84.55% for hard tasks), validating the framework's quality.

Usage

The paper does not provide explicit access information for AutoBench-V, suggesting it remains a research framework demonstrated in the publication rather than a publicly available benchmark. The system appears designed for research environments with access to the component models (examiner LVLMs and text-to-image generators) rather than as a standalone evaluation tool. Researchers interested in implementing similar automated evaluation frameworks can reference the detailed methodology and hyperparameter specifications provided in the paper.