Overview
SIV-Bench (Social Interaction Video Benchmark) is a comprehensive evaluation framework designed to assess Multimodal Large Language Models' (MLLMs) understanding of human social dynamics in video content. The benchmark evaluates models across three core dimensions: Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). Unlike existing benchmarks that focus on general video understanding, SIV-Bench specifically targets the complex, multi-layered nature of human social interactions through a structured analytical framework.
Key Specifications
Dataset Composition: 2,792 real-world video clips sourced from TikTok and YouTube, featuring diverse genres, presentation styles, and cultural backgrounds. The dataset includes 8,728 multiple-choice question-answer pairs generated through a human-LLM collaborative pipeline.
Task Structure: The benchmark decomposes social interaction understanding into 10 fine-grained sub-tasks:
- SSU (4 tasks): Action Recognition, Environment Perception, Facial Expression Recognition, Human Attribute Identification
- SSR (4 tasks): Intent Inference, Emotion Inference, Attitude Inference, Relation Inference
- SDP (2 tasks): Factual Prediction, Counterfactual Prediction
Social Relations Framework: Built on Fiske's Relational Models Theory, categorizing interactions across 14 specific relationship types (parent-child, friends, colleagues, etc.).
Evaluation Conditions: Three subtitle conditions test the impact of textual cues: 'origin' (original video), '+sub' (added transcribed dialogue), and '-sub' (removed on-screen text).
Data Examples
Social Scene Understanding - Action Recognition:
Question: "What action does the child perform with his hand?"
Options:
A. Touches his lips with his hand
B. Waves his hand in the air
C. Claps his hands together
D. Points to the sky with his finger
E. Shakes his hand playfully
Social State Reasoning - Relation Inference:
Question: "What social relationship does this video mainly represent?"
Options:
A. Couple
B. Parent-Child
C. Siblings
D. Grandparent-Child
...
L. Teammates
M. Service
N. Transactional
Social Dynamics Prediction - Counterfactual:
Question: "How might the caretaker's approach differ if the patient were more receptive to verbal instructions?"
Options:
A. Increased reliance on non-verbal cues
B. More frequent use of physical restraints
C. Less physical intervention
D. Heightened emphasis on medication administration
E. Greater focus on establishing a strict schedule
Significance
SIV-Bench addresses a critical gap in MLLM evaluation by focusing specifically on social intelligence - a crucial yet underexplored domain. Current MLLMs demonstrate strong performance on visual perception tasks (SSU) but struggle significantly with inferential reasoning about social states and relationships (SSR). The benchmark reveals that even state-of-the-art models like Gemini-2.5-Pro achieve only 76.50% overall accuracy, with particularly poor performance on relation inference tasks.
The benchmark's key contributions include: (1) a structured framework that decomposes complex social understanding into measurable components, (2) systematic evaluation of linguistic cue dependencies, showing that transcribed dialogue consistently improves performance on reasoning tasks, and (3) detailed failure pattern analysis revealing specific bottlenecks in current MLLM capabilities, particularly in differentiating primary from secondary relationships and integrating commonsense social knowledge.
Usage
SIV-Bench is publicly available with complete dataset, code, and evaluation protocols. Researchers can access the benchmark at https://kfq20.github.io/sivbench/ with the dataset hosted on Hugging Face and code repository on GitHub. The evaluation uses VLMEvalKit with standardized prompting across all models. Performance is measured using accuracy (%) with a robust two-stage answer matching procedure to handle diverse model output formats. The benchmark serves both as a diagnostic tool for identifying current MLLM limitations and as a development target for advancing social AI capabilities.