MM-HELIX | alphaXiv

MM-HELIX

A benchmark for evaluating multimodal large language models' long-chain reflective reasoning capabilities across 42 challenging tasks requiring iterative thinking and visual understanding.

Image

56 stars

36 votes

320 views

09 Oct 2025

Overall accuracy on the MM-HELIX benchmark using multimodal (image and text) inputs. This is the main metric, evaluating long-chain reflective reasoning across 42 tasks in four categories: Algorithms, Graphs, Puzzles, and Games. Higher scores indicate better performance.

Overview

MM-HELIX is a benchmark designed to evaluate the multimodal long-chain reflective reasoning capabilities of Multimodal Large Language Models (MLLMs). The benchmark addresses a critical gap in current MLLM evaluation: while these models show proficiency in direct reasoning tasks, their ability to perform iterative thinking, self-correction, and backtracking remains largely unexplored. MM-HELIX consists of 42 carefully designed tasks across four categories (Algorithms, Graphs, Puzzles, Games) that require models to comprehend complex rules, perform visual observation, and engage in multi-step thought processes.

MM-HELIX Overview Figure 1: Overview of the MM-HELIX benchmark framework, showing the four task categories, data generation pipeline, and the proposed Adaptive Hybrid Policy Optimization (AHPO) training method.

Key Specifications

Dataset Size: 1,260 evaluation instances (30 instances per task across 5 difficulty levels) and MM-HELIX-100K training dataset containing 100,000 high-quality reflective reasoning traces

Task Categories:

Algorithms (9 tasks): Mathematical/computational challenges like "24 Points", "Best Time to Buy and Sell Stock", "Container With Most Water"
Graphs (8 tasks): Graph analysis tasks including "Eulerian Cycle", "Max Flow", "Shortest Distance"
Puzzles (19 tasks): Logic puzzles such as "Sudoku", "Nonogram", "Bridges", "Kakuro"
Games (6 tasks): Strategic planning games like "Sokoban", "Minesweeper", "Tower of Hanoi"

Difficulty Levels: Five programmatically generated levels (1: very easy to 5: very hard) based on task-specific parameters like number of reasoning steps

Input Format: Multimodal inputs combining textual problem descriptions with visual representations (images of game boards, charts, puzzles, etc.)

Evaluation Metric: Accuracy, determined by exact-match comparison for simple answers or algorithmic verification through rule simulation for complex multi-step solutions

Task Examples Figure 2: Examples of tasks across all four categories in MM-HELIX, showing the diversity of visual and reasoning challenges.

Data Examples

Example 1: Aquarium Puzzle (Level 1)

Image: 4x4 grid with numbers indicating water levels per row/column
Question: Determine which cells are filled with water based on rules:
1. Each region must be filled to uniform water level
2. Water cannot float - filled cells must have support below
3. Numbers indicate filled cells per row/column
4. Regions separated by thick black lines

Answer Format: List coordinates of filled cells
Reference Answer: [(2,1), (3,1), (0,2), (3,2), (0,3), (1,3)]

Example 2: Nibbles Game (Level 5)

Image: Grid showing a snake and multiple apples
Goal: Find sequence of moves to eat all apples
Game Rules:
1. Control snake with up/down/left/right commands
2. Snake must eat all apples on grid  
3. Snake grows longer by one segment when eating
4. Snake cannot collide with walls or itself
5. Snake moves one cell at a time

Output Format: Sequence of moves
Example: "up right down left up"

Example Tasks Figure 3: Detailed example of the Nibbles task showing the multimodal input format and expected reasoning process.

Significance

MM-HELIX reveals a profound deficit in current MLLMs' reflective reasoning capabilities. Even state-of-the-art models like GPT-5 achieve only 58.1% accuracy on multimodal inputs, while the best open-source model reaches 33.3%. The benchmark demonstrates a significant modality gap, with text-only performance substantially higher (e.g., GPT-5: 84.5% vs 58.1%), indicating that visual comprehension remains a major bottleneck for complex reasoning.

The benchmark introduces several key innovations:

Programmatic Generation Framework: Automated instance generation with deterministic solvers and verifiers enables scalable evaluation
Long-chain Reasoning Focus: Average chain-of-thought traces exceed 4,000 tokens, requiring sustained coherent reasoning
Hierarchical Difficulty: Progressive complexity allows fine-grained analysis of model capabilities and failure modes

The work also contributes MM-HELIX-100K, a large-scale dataset of reflective reasoning traces generated through the Step-Elicited Response Generation (SERG) pipeline, which reduces generation time by 90% while achieving 99.8% success rate compared to 25% for unconstrained generation.

Usage

The MM-HELIX benchmark and related resources are available at: https://mm-helix.github.io/

Resources Available:

Benchmark Dataset: MM-HELIX evaluation set with 1,260 instances
Training Data: MM-HELIX-100K with high-quality reasoning traces
Code Repository: Implementation of evaluation framework, data generation pipeline (SERG), and training methods
Model Checkpoints: Pre-trained MM-HELIX-7B-Thinking model demonstrating +18.6% improvement over baseline

The benchmark uses standard evaluation protocols with temperature settings of 0.6 for thinking models and 0.0 for non-thinking models. The framework supports both multimodal and text-only evaluation modes, enabling comprehensive analysis of model capabilities across modalities.

View Source

Similar

MIRA

A benchmark for evaluating multimodal language models on complex reasoning tasks requiring intermediate visual representations and visual chain-of-thought capabilities.

333 views

24 likes

MMReason

A challenging benchmark for evaluating multi-step reasoning capabilities of multimodal large language models on problems requiring both visual and textual analysis.

183 views

17 likes

EMMA

A benchmark for evaluating multimodal large language models' reasoning capabilities across mathematics, physics, chemistry, and coding with visual-textual integration tasks.

1,071 views

44 likes

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Overview

Key Specifications

Data Examples

Significance

Usage