SIV-Bench | alphaXiv

SIV-Bench

A benchmark for evaluating multimodal large language models on social interaction understanding and reasoning using real-world video clips.

Image

11 votes

183 views

05 Jun 2025

Overall accuracy of Multimodal Large Language Models (MLLMs) on the SIV-Bench, which evaluates understanding of human social interactions. This evaluation uses the '+sub' condition, where videos are supplemented with transcribed and translated dialogue, providing models with explicit linguistic cues. This represents the models' performance with the most complete information.

Overview

SIV-Bench (Social Interaction Video Benchmark) is a comprehensive evaluation framework designed to assess Multimodal Large Language Models' (MLLMs) understanding of human social dynamics in video content. The benchmark evaluates models across three core dimensions: Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). Unlike existing benchmarks that focus on general video understanding, SIV-Bench specifically targets the complex, multi-layered nature of human social interactions through a structured analytical framework.

Key Specifications

Dataset Composition: 2,792 real-world video clips sourced from TikTok and YouTube, featuring diverse genres, presentation styles, and cultural backgrounds. The dataset includes 8,728 multiple-choice question-answer pairs generated through a human-LLM collaborative pipeline.

Task Structure: The benchmark decomposes social interaction understanding into 10 fine-grained sub-tasks:

SSU (4 tasks): Action Recognition, Environment Perception, Facial Expression Recognition, Human Attribute Identification
SSR (4 tasks): Intent Inference, Emotion Inference, Attitude Inference, Relation Inference
SDP (2 tasks): Factual Prediction, Counterfactual Prediction

Social Relations Framework: Built on Fiske's Relational Models Theory, categorizing interactions across 14 specific relationship types (parent-child, friends, colleagues, etc.).

Evaluation Conditions: Three subtitle conditions test the impact of textual cues: 'origin' (original video), '+sub' (added transcribed dialogue), and '-sub' (removed on-screen text).

Data Examples

Social Scene Understanding - Action Recognition:

Question: "What action does the child perform with his hand?"
Options: 
A. Touches his lips with his hand
B. Waves his hand in the air  
C. Claps his hands together
D. Points to the sky with his finger
E. Shakes his hand playfully

Social State Reasoning - Relation Inference:

Question: "What social relationship does this video mainly represent?"
Options:
A. Couple
B. Parent-Child  
C. Siblings
D. Grandparent-Child
...
L. Teammates
M. Service
N. Transactional

Social Dynamics Prediction - Counterfactual:

Question: "How might the caretaker's approach differ if the patient were more receptive to verbal instructions?"
Options:
A. Increased reliance on non-verbal cues
B. More frequent use of physical restraints
C. Less physical intervention
D. Heightened emphasis on medication administration
E. Greater focus on establishing a strict schedule

Significance

SIV-Bench addresses a critical gap in MLLM evaluation by focusing specifically on social intelligence - a crucial yet underexplored domain. Current MLLMs demonstrate strong performance on visual perception tasks (SSU) but struggle significantly with inferential reasoning about social states and relationships (SSR). The benchmark reveals that even state-of-the-art models like Gemini-2.5-Pro achieve only 76.50% overall accuracy, with particularly poor performance on relation inference tasks.

The benchmark's key contributions include: (1) a structured framework that decomposes complex social understanding into measurable components, (2) systematic evaluation of linguistic cue dependencies, showing that transcribed dialogue consistently improves performance on reasoning tasks, and (3) detailed failure pattern analysis revealing specific bottlenecks in current MLLM capabilities, particularly in differentiating primary from secondary relationships and integrating commonsense social knowledge.

Usage

SIV-Bench is publicly available with complete dataset, code, and evaluation protocols. Researchers can access the benchmark at https://kfq20.github.io/sivbench/ with the dataset hosted on Hugging Face and code repository on GitHub. The evaluation uses VLMEvalKit with standardized prompting across all models. Performance is measured using accuracy (%) with a robust two-stage answer matching procedure to handle diverse model output formats. The benchmark serves both as a diagnostic tool for identifying current MLLM limitations and as a development target for advancing social AI capabilities.

View Source

Similar

MTMEUR

A benchmark for evaluating multimodal large language models on complex emotion understanding and reasoning tasks using real-life video scenarios.

44 views

3 likes

HumanSense

A benchmark for evaluating multimodal large language models' abilities to understand human behavior, emotions, and social interactions from video and audio data.

93 views

8 likes

MOMENTS

A comprehensive multimodal benchmark for evaluating Theory of Mind capabilities in large language models using video clips and multiple-choice questions.

45 views

8 likes

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Overview

Key Specifications

Data Examples

Significance

Usage