Overview
Event-Bench is a benchmark dataset designed to evaluate video multimodal large language models (MLLMs) on event-oriented understanding in long videos. The benchmark addresses the "shortcut bias" in existing video understanding datasets, where questions can often be answered using just a few frames rather than requiring comprehension of the entire video sequence. Event-Bench focuses specifically on evaluating models' ability to understand sequential and interconnected events across extended temporal contexts.

Key Specifications
Dataset Size: 2,190 test instances across six sub-tasks
Video Duration: 2-1,088 seconds (significantly longer than most existing benchmarks)
Task Structure: Hierarchical taxonomy with three levels:
- Atomic Events (468 instances): Event Description (ED)
- Composite Events (800 instances): Temporal Reasoning (TR) and Causal Reasoning (CR)
- Overall Events (922 instances): Contextual Reasoning (CU), Episodic Reasoning (ER), and Counter-intuitive Reasoning (CIR)
Format: Multiple-choice questions with 4 options per question
Data Sources: Videos from existing datasets (NExT-QA, STAR, EgoSchema, FunQA, YouTube) and manually annotated complex scenarios from YouTube
Evaluation Metric: Accuracy using circular evaluation strategy to mitigate option order bias
Data Examples
Temporal Reasoning Example: Video shows a sequence of actions with a cat and human interaction. Question: "What did the human do when the cat bit the hands?" Options: A. Look at bird. B. Push it. C. Play with cat. D. Stand still. Answer: C
Episodic Reasoning Example: Video shows Bean in a restaurant scenario across multiple scenes. Question: "What led to Bean deciding to quickly leave the restaurant?" Options: A. The waiter brought him more seafood. B. The lady's phone rang, causing a distraction. C. He saw the lady discovering the oysters in her bag. D. The lady's phone conversation ended suddenly. Answer: B
Significance
Event-Bench represents a significant advancement in video understanding evaluation by:
- Addressing shortcut bias: Robust filtering using multiple image MLLMs ensures questions cannot be answered from single frames
- Focusing on events: First benchmark specifically designed around event comprehension rather than incidental event content
- Enabling long-context evaluation: Video durations up to 18+ minutes test models' ability to maintain context over extended sequences
- Hierarchical assessment: Three-level taxonomy allows granular evaluation of different reasoning capabilities
The benchmark reveals substantial performance gaps between current models and human-level video understanding, with the best proprietary model (GPT-4o) achieving only 53.33% accuracy.
Usage
The benchmark is publicly available through GitHub at https://github.com/RUCAIBox/Event-Bench. The repository includes the complete dataset, evaluation code, and the proposed VIM (Video Instruction Merging) method that achieves state-of-the-art performance among open-source models at 41.64% accuracy. Models are evaluated using a circular evaluation strategy where each question is presented multiple times with shuffled option orders to ensure robust assessment.