Benchmarks/NTSEBENCH
NTSEBENCH

A benchmark for evaluating cognitive multimodal reasoning and problem-solving capabilities of vision language models using questions from India's talent search examination.

Image
3 votes
66 views
05 Jan 2025

Measures the zero-shot accuracy of LLMs and VLMs on the 1,199 text-only questions from the NTSEBENCH dataset. The 'Standard QA' strategy provides the question and options as direct text input, serving as a fundamental test of textual cognitive reasoning without visual or OCR-related complexities. OpenAI o1-preview was evaluated as a specialized 'Advanced Reasoning Model'.

Overview

NTSEBENCH is a comprehensive cognitive reasoning benchmark designed to evaluate the problem-solving and multimodal reasoning capabilities of large language models and vision-language models. The dataset comprises 2,728 multiple-choice questions sourced from India's Nationwide Talent Search Examination (NTSE), spanning 26 distinct cognitive reasoning categories that test abilities ranging from pattern recognition to spatial reasoning.

Key Specifications

Dataset Size: 2,728 questions with 4,642 accompanying images

Categories: 26 problem types including:

  • Text-only: Series, Alphabet Test, Analogy, Coding-Decoding, Blood Relations, Mathematical Operations, Syllogisms (12 categories, 1,517 questions)
  • Vision+Text: Non-Verbal Series, Missing Character, Paper Folding & Cutting, Venn Diagrams, Cube and Dice, Direction Sense (14 categories, 1,211 questions)

Format: Multiple-choice questions with 4 options each, requiring single correct answer selection

Modality Distribution:

  • Pure text questions: 1,199
  • Questions with images in solutions: 381
  • Questions with images in options: 70
  • Fully multimodal questions: 1,078

Data Examples

Text-Only Series Problem:

Question: Find the missing element in the following series: 4, 6, 6, 15, 8, 28, 10, ____
Options: 1. 36, 2. 39, 3. 45, 4. 38
Answer: **3 (45)**
Explanation: Two interleaved series - 4,6,8,10 and 6,15,28,?. 
Second series differences: 9,13,17 (increasing by 4), so next term is 28+17=45.

Vision+Text Missing Character Problem:

Directions: Find the missing term in the following figure (shows three circular diagrams with numbers and letters, one missing a number)
Options: 1. 1, 2. 2, 3. 4, 4. 10
Answer: **3 (4)**  
Explanation: Pattern rule is (upper number + alphabetical position + lower number) × right number - left number = center number.

Significance

NTSEBENCH addresses a critical gap in AI evaluation by focusing specifically on cognitive reasoning abilities rather than domain knowledge or common sense. The benchmark reveals substantial limitations in current models - even state-of-the-art systems struggle significantly, with the best proprietary model (Gemini 1.5 Pro) achieving only 62.22% on text-only questions and 42.06% on multimodal questions, while human annotators achieve over 80% accuracy on both. The dataset's real-world grounding through NTSE exam questions ensures practical relevance and high-quality, expert-designed problems that genuinely test abstract reasoning capabilities essential for artificial general intelligence.

Usage

The dataset and evaluation code are publicly available at https://ntsebench.github.io/. The benchmark supports multiple modeling strategies including Standard QA (direct text input), Image-Only (OCR-based processing), Interleaved (separate text and image contexts), and Standard VQA (composite image with structured prompts). Models are evaluated using percentage accuracy across zero-shot and few-shot Chain-of-Thought prompting scenarios, with comprehensive analysis revealing systematic biases and reasoning failures that provide actionable insights for model improvement.