MultiVENT-G | alphaXiv

MultiVENT-G

A multilingual benchmark for evaluating AI systems' ability to ground partially-defined events by extracting information from video-text pairs across multiple languages.

Image

2 votes

46 views

07 Oct 2024

Measures the CEAF-RME F1-score (FC) for the text span retrieval task on the MultiVENT-G benchmark. This metric uses the Kuhn-Munkres algorithm for maximum bipartite matching between predicted and ground-truth spans, allowing for partial overlaps. It is considered a more robust measure of retrieval quality.

Overview

MultiVENT-G (MultiVENT-Grounded) is a multimodal benchmark designed to evaluate AI systems' ability to extract "partially-defined events" from collections of unstructured video and text data. Unlike traditional event extraction tasks where events are fully contained within the data, MultiVENT-G focuses on events that exist outside the provided media - each video-text pair offers only partial observations of larger, ongoing real-world events. The benchmark formulates event extraction as a three-stage span retrieval task across text, temporal, and spatial modalities.

Partially-defined event structure Figure 1: Illustration of partially-defined events where multiple videos (A, B, C) provide incomplete observations of a larger event through different sub-events and roles.

Key Specifications

Dataset Size: 1,168 densely annotated video-text pairs across 5 languages (Arabic, Chinese, English, Korean, Russian)

Languages: Multilingual coverage with English comprising the largest subset (414 videos), followed by Chinese (234), Korean (208), Russian (187), and Arabic (125)

Event Categories: Seven event templates covering Emergency/Disaster, Election, Political Development, Demonstration, Social Event, Sports, and Discovery/Launch

Annotation Density: 22,800+ labeled event-centric entities with professional linguist annotations, including natural language descriptions, OCR flags, and human confidence scores

Task Structure: Three sequential stages:

Stage 1: Text span retrieval from accompanying documents
Stage 2: Temporal span retrieval from video content
Stage 3: Spatial span retrieval via bounding box localization

Three-stage pipeline Figure 2: The three-stage span retrieval pipeline showing how each stage processes different modalities to extract role-filling information.

Data Examples

Example 1 - Emergency Event (Notre Dame Fire):

Text Input: "Massive plumes of smoke and intense flames pouring from the centuries-old #NotreDame cathedral in #Paris were captured on camera."

Template Question: "What emergency/disaster is occurring?"
Text Span Output: ["fire", "flames"]

Temporal Question: "Where is the emergency/disaster occurring?" 
Temporal Span Output: ['15.2 - 28.0 seconds']

Spatial Question: "What was the outcome of the emergency/disaster?"
Spatial Output: Bounding boxes around "smoke plumes" and "cathedral damage"

Example 2 - Demonstration Event:

Video showing protest scene with police presence

Template Questions:
- "Who are the protesters?" → Text/Visual spans: ["demonstrators", "crowd"]
- "What law enforcement was involved?" → Spatial spans: Bounding boxes around police officers
- "When did the protest occur?" → Text spans: ["afternoon", "October 2023"]

Significance

MultiVENT-G addresses a critical gap in multimodal AI evaluation by focusing on realistic, incomplete information scenarios that mirror human news consumption and event understanding. The benchmark's key contributions include:

Novel Task Formulation: Shifts from template-filling to span retrieval, requiring models to ground conclusions in specific data segments rather than generating answers from implicit knowledge.

Multilingual Multimodal Coverage: Provides one of the first dense multilingual video-text datasets for event extraction, crucial for global AI applications.

Granular Evaluation Framework: The three-stage decomposition allows detailed analysis of model capabilities across different modalities and reasoning types.

Real-world Complexity: Uses authentic news content with inherent noise, ambiguity, and partial information that reflects genuine information processing challenges.

Usage

MultiVENT-G is released as an open dataset for academic research. The benchmark includes standardized evaluation metrics for each stage:

Text: Span-based precision/recall/F1 and CEAF-RME scores
Temporal: Role-filling IoU at multiple thresholds (0.5, 0.7, 1.0)
Spatial: Modified IoU metrics and semantic similarity scores for caption grounding

Initial baselines show significant room for improvement: GPT-4o achieves 67.2 F1 on text retrieval, TimeChat variants reach ~33 F1 on temporal grounding, and spatial grounding remains challenging with best IoU scores around 24. The benchmark establishes that current multimodal models struggle with partial event understanding, particularly in cross-modal reasoning and precise spatial localization.

Similar

SOK-Bench

A benchmark for evaluating multimodal AI models' ability to perform commonsense reasoning on video content by integrating visual information with external knowledge.

638 views

2 likes

VideoVista-CulturalLingo

A multilingual and multicultural benchmark for evaluating large multimodal models on video comprehension across diverse cultural contexts and domains.

51 views

16 likes

VAGUE

A benchmark for evaluating multimodal AI systems' ability to infer speakers' hidden intentions from ambiguous text using visual context cues.

198 views

13 likes

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Overview

Key Specifications

Data Examples

Significance

Usage