Benchmarks/SeriesBench
SeriesBench

A comprehensive benchmark for evaluating multimodal large language models on narrative-driven drama series understanding across visual, script, audio, and comprehension tasks.

Image
11 votes
106 views
30 Apr 2025

Overall accuracy on the SeriesBench benchmark, which evaluates MLLMs on narrative-driven drama series understanding. This metric is an average across five dimensions: Visuals, Script, Audio, Augmentation, and Comprehension. Results are for tasks that are either multiple-choice or judgement-based. Models with '+ PC-DCoT' use the paper's proposed Plot & Character Dual Chain of Thought framework.

Overview

SeriesBench is a benchmark designed to evaluate Multi-modal Large Language Models (MLLMs) on narrative-driven drama series understanding. Unlike existing video benchmarks that focus on standalone clips or static images, SeriesBench addresses the critical gap in evaluating models' ability to comprehend complex, continuous narratives, character development across multiple episodes, and multimodal elements beyond visual frames. The benchmark spans 105 narrative-driven series with 1,072 videos across diverse genres including daily life, anime, time-travel, historical drama, and fantasy.

Long-span Narrative Annotation

Key Specifications

SeriesBench organizes evaluation across five primary dimensions with 28 fine-grained tasks:

  • [Visuals]: Analyzes frames, figures (actions, interactions), scenes (transitions, spatiotemporal shifts), and objects (presence, interaction)
  • [Script]: Focuses on background (world-building, time/location), plot development, and character dynamics
  • [Audio]: Interprets dialogue attribution, music atmosphere, and sound effects
  • [Augmentation]: Examines post-production elements like subtitles, labels, and visual effects
  • [Comprehension]: Integrates all elements for overall narrative understanding

The dataset includes subtitles for each video (average 250.2 tokens per video) plus detailed thematic and character background information. Tasks use multiple evaluation formats: multiple-choice and judgment tasks (measured by accuracy), and open-ended questions (evaluated using BLEU-2, METEOR, and BERTScore F1).

Data Examples

Example 1: Visual Action Recognition

Video frames showing classroom scene with students
Question: "What was Zhao Dezhu's action when facing the room check?"
Options: (A) Hiding behind the door (B) Acting on the floor (C) Jumping out of the window
Answer: **C**

Example 2: Plot Development (Open-ended)

Video sequence showing character interactions
Question: "What was Yingyan's wish after passing the exam, and what did the director decide?"
Ground Truth: "Yingyan wanted air conditioning or an eye cure and the Director agreed."

Task Distribution

Significance

SeriesBench represents a major advancement in video understanding evaluation by being the first benchmark specifically designed for narrative-driven series content. Key innovations include:

  • Comprehensive Multimodal Coverage: The only benchmark covering all five dimensions (visuals, script, audio, augmentation, comprehension) compared to existing benchmarks that focus primarily on visual understanding
  • Long-span Narrative Annotation: Novel annotation methodology requiring labelers to track events and characters across extended temporal spans and interconnected videos
  • PC-DCoT Framework: Introduces Plot & Character Dual Chain of Thought reasoning that consistently improves model performance by 10-14% across all evaluated models

The benchmark reveals significant limitations in current MLLMs - even GPT-4o achieves only 62.8% accuracy compared to 95.8% human performance, highlighting the challenge of true narrative understanding versus simple visual recognition.

Usage

SeriesBench is publicly available on GitHub with comprehensive documentation. The benchmark uses an 8:1:1 train/validation/test split via stratified sampling. Models are evaluated using their official inference configurations, with frame sampling varying by model (32-128 frames). The PC-DCoT framework can be applied during inference to enhance performance, requiring a finetuned CN-CLIP retriever for constructing plot-event and character-temporal chains from video content.