Benchmarks/MultiVENT-G
MultiVENT-G

A multilingual benchmark for evaluating AI systems' ability to ground partially-defined events by extracting information from video-text pairs across multiple languages.

Image
2 votes
46 views
07 Oct 2024

Measures the CEAF-RME F1-score (FC) for the text span retrieval task on the MultiVENT-G benchmark. This metric uses the Kuhn-Munkres algorithm for maximum bipartite matching between predicted and ground-truth spans, allowing for partial overlaps. It is considered a more robust measure of retrieval quality.

Overview

MultiVENT-G (MultiVENT-Grounded) is a multimodal benchmark designed to evaluate AI systems' ability to extract "partially-defined events" from collections of unstructured video and text data. Unlike traditional event extraction tasks where events are fully contained within the data, MultiVENT-G focuses on events that exist outside the provided media - each video-text pair offers only partial observations of larger, ongoing real-world events. The benchmark formulates event extraction as a three-stage span retrieval task across text, temporal, and spatial modalities.

Partially-defined event structure Figure 1: Illustration of partially-defined events where multiple videos (A, B, C) provide incomplete observations of a larger event through different sub-events and roles.

Key Specifications

Dataset Size: 1,168 densely annotated video-text pairs across 5 languages (Arabic, Chinese, English, Korean, Russian)

Languages: Multilingual coverage with English comprising the largest subset (414 videos), followed by Chinese (234), Korean (208), Russian (187), and Arabic (125)

Event Categories: Seven event templates covering Emergency/Disaster, Election, Political Development, Demonstration, Social Event, Sports, and Discovery/Launch

Annotation Density: 22,800+ labeled event-centric entities with professional linguist annotations, including natural language descriptions, OCR flags, and human confidence scores

Task Structure: Three sequential stages:

  • Stage 1: Text span retrieval from accompanying documents
  • Stage 2: Temporal span retrieval from video content
  • Stage 3: Spatial span retrieval via bounding box localization

Three-stage pipeline Figure 2: The three-stage span retrieval pipeline showing how each stage processes different modalities to extract role-filling information.

Data Examples

Example 1 - Emergency Event (Notre Dame Fire):

Text Input: "Massive plumes of smoke and intense flames pouring from the centuries-old #NotreDame cathedral in #Paris were captured on camera."

Template Question: "What emergency/disaster is occurring?"
Text Span Output: ["fire", "flames"]

Temporal Question: "Where is the emergency/disaster occurring?" 
Temporal Span Output: ['15.2 - 28.0 seconds']

Spatial Question: "What was the outcome of the emergency/disaster?"
Spatial Output: Bounding boxes around "smoke plumes" and "cathedral damage"

Example 2 - Demonstration Event:

Video showing protest scene with police presence

Template Questions:
- "Who are the protesters?" → Text/Visual spans: ["demonstrators", "crowd"]
- "What law enforcement was involved?" → Spatial spans: Bounding boxes around police officers
- "When did the protest occur?" → Text spans: ["afternoon", "October 2023"]

Significance

MultiVENT-G addresses a critical gap in multimodal AI evaluation by focusing on realistic, incomplete information scenarios that mirror human news consumption and event understanding. The benchmark's key contributions include:

Novel Task Formulation: Shifts from template-filling to span retrieval, requiring models to ground conclusions in specific data segments rather than generating answers from implicit knowledge.

Multilingual Multimodal Coverage: Provides one of the first dense multilingual video-text datasets for event extraction, crucial for global AI applications.

Granular Evaluation Framework: The three-stage decomposition allows detailed analysis of model capabilities across different modalities and reasoning types.

Real-world Complexity: Uses authentic news content with inherent noise, ambiguity, and partial information that reflects genuine information processing challenges.

Usage

MultiVENT-G is released as an open dataset for academic research. The benchmark includes standardized evaluation metrics for each stage:

  • Text: Span-based precision/recall/F1 and CEAF-RME scores
  • Temporal: Role-filling IoU at multiple thresholds (0.5, 0.7, 1.0)
  • Spatial: Modified IoU metrics and semantic similarity scores for caption grounding

Initial baselines show significant room for improvement: GPT-4o achieves 67.2 F1 on text retrieval, TimeChat variants reach ~33 F1 on temporal grounding, and spatial grounding remains challenging with best IoU scores around 24. The benchmark establishes that current multimodal models struggle with partial event understanding, particularly in cross-modal reasoning and precise spatial localization.