Video-MME-Logical : A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

*Equal contribution. Project Lead. Corresponding Author.
Overview of Video-MME-Logical task categories and benchmark design

Video-MME-Logical is a controllable benchmark for video temporal-logical reasoning with 25 tasks, spanning final-answer evaluation, intermediate-state diagnostics, and difficulty-controlled settings.

Abstract

Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events in individual frames? This ability, which we refer to as video temporal-logical reasoning, requires models to maintain, update, and compose evidence as visual states evolve across frames. Existing video benchmarks often conflate this capability with scene complexity, static recognition, or uncontrolled temporal variation. To isolate this capability, we introduce Video-MME-Logical, a controlled benchmark organized around five temporal-logical operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition. The benchmark contains 25 fine-grained task categories generated with controlled object states, transitions, temporal dependencies, and logical compositions. It enables difficulty-controlled final-answer evaluation by varying temporal horizon and reasoning complexity, and supports intermediate-state diagnostics by verifying whether models recover the required logical reasoning trace before producing the final answer. Experiments with state-of-the-art MLLMs reveal a substantial human-model gap, especially as temporal-logical complexity increases. Supervised fine-tuning on up to 500K generated samples improves performance but remains insufficient to close the reasoning gap, positioning Video-MME-Logical as a scalable testbed for analyzing and improving temporal-logical reasoning in MLLMs.

Benchmark at a Glance

503,750
Total videos
500K
Training videos
3,750
Test videos
25
Task categories
8
Step-diagnostic tasks

Why Video-MME-Logical

Existing video benchmarks often conflate temporal-logical reasoning with general temporal understanding, scene recognition, or uncontrolled visual variation. This leaves three gaps: reasoning categories are often under-specified, difficulty is hard to interpret because it co-varies with natural-video complexity, and final-answer-only evaluation cannot verify whether a model follows the correct temporal evidence trace. Video-MME-Logical addresses these gaps with operation-centric tasks, controlled difficulty, and verifiable intermediate states.

Video-MME-Logical is designed as both a diagnostic benchmark and a controllable training resource.
Benchmark #Tasks #Videos #Train #Test Control Difficulty Intermediate
TOMATO 6 1,417 0 1,417 No No No
TempCompass 5 410 0 410 No No No
ReXTime 3 12,759 9,695 3,064 No No No
V-STaR 2 2,094 0 2,094 No No No
Video-MME-Logical 25 503,750 500,000 3,750 Yes Yes Yes

Taxonomy

We organize Video-MME-Logical around five temporal-logical operations. State Tracking tests whether models maintain hidden or latent object states across visual transformations. Sequential Counting requires accumulating discrete evidence over time. Temporal Ordering asks models to recover the order of state changes, revealed symbols, or event sequences. Dynamic Spatiality evaluates geometric and motion-based inference, while Structural Composition requires composing spatial structures across viewpoints, occlusions, and partial observations.

Taxonomy of Video-MME-Logical grouped into five temporal-logical abilities

The taxonomy covers 25 fine-grained tasks and distinguishes direct-answer tasks from the intermediate-state diagnostic subset.

Construction Pipeline

Each task category is implemented as an executable program with four components: temporal transition, scene configuration, metadata construction, and video rendering. The recorded metadata supports video generation, question construction, exact answer computation, difficulty control, and intermediate-state supervision. Easy, medium, and hard settings are defined by increasing temporal horizon and reasoning complexity.

Video-MME-Logical construction pipeline

Programmatic generation supports reproducible task construction, controllable difficulty, and exact answer verification.

Leaderboard

Main results on Video-MME-Logical. E/M/H denote easy, medium, and hard settings; category columns report the average over E/M/H.
Models Overall Avg. State. Count. Order. Spat. Struct.
EMH
Human Level 95.9 98.4 95.9 93.4 96.4 95.3 96.0 96.3 95.2
Open-source Instruct Models
Qwen3-VL-8B-Instruct 11.9 13.4 12.8 9.6 8.2 3.3 19.3 13.0 15.8
Qwen3-VL-30B-A3B-Instruct 11.8 14.5 12.4 8.7 8.5 4.0 17.2 17.2 12.5
Qwen3-Omni-30B-A3B-Instruct 5.8 6.3 6.1 4.9 2.9 1.1 8.7 6.5 9.7
Qwen2.5-VL-3B-Instruct 1.9 3.1 1.5 1.3 0.7 1.9 0.3 0.5 6.3
Qwen2.5-VL-7B-Instruct 7.4 10.3 7.7 4.3 6.2 2.5 0.0 19.8 8.7
Qwen2.5-VL-72B-Instruct 12.5 15.2 13.1 9.1 5.6 4.1 18.5 18.0 16.2
InternVL3.5-8B-Instruct 12.1 13.8 13.5 8.9 4.6 4.3 15.7 18.0 17.8
InternVL3.5-30B-A3B-Instruct 8.7 9.5 9.7 7.0 6.6 4.4 2.5 15.8 14.3
LLaVA-Video-7B-Qwen2 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0
LLaVA-Video-72B-Qwen2 2.4 4.7 1.9 0.7 4.3 5.1 0.0 2.8 0.0
KimiVL-16B-A3B-Instruct 2.9 5.4 2.2 0.9 4.4 3.9 2.3 0.3 3.3
Open-source Thinking Models
Qwen3-VL-8B-Think 6.6 7.9 5.8 6.0 1.1 3.2 5.2 16.0 7.3
Qwen3-VL-30B-A3B-Think 10.3 16.0 8.7 6.1 3.9 10.8 24.0 7.7 5.0
Qwen3-Omni-30B-A3B-Think 6.2 6.6 5.2 6.8 2.4 2.0 12.5 6.3 7.7
KimiVL-16B-A3B-Think 7.6 10.2 6.8 5.8 5.3 1.5 5.0 14.5 11.7
Proprietary Models
GPT-5.4 22.7 31.7 20.3 16.1 5.7 26.1 35.7 25.2 20.8
Gemini-3.1 Pro 28.6 33.1 24.1 20.6 8.9 41.7 35.0 32.8 24.3

SFT Scaling

Supervised fine-tuning improves Video-MME-Logical performance with more generated data, but the gains saturate before closing the human gap. The thinking variant peaks at 39.2% overall accuracy with 375K samples and drops to 37.7% at 500K, while both SFT variants remain far below the 95.9% human reference.

SFT Thinking SFT Instruct Human Level
SFT scaling on Video-MME-Logical Overall accuracy of SFT Thinking and SFT Instruct across 25K to 500K training samples, with human-level performance at 95.9 percent. 20 30 40 95 Human 95.9 25K 125K 250K 375K 500K 39.2 peak Training samples Overall accuracy (%)

Each card shows one representative example from a fine-grained task category. Use the arrows to browse the examples in a single horizontal row.

Visual Diversity Gallery

Each category shows controlled visual variations while preserving the underlying temporal-logical task.

Keyboard layouts, materials, colors, and word targets vary while the ordering task stays fixed.

Intermediate-State Diagnostics

Final-answer accuracy can hide intermediate-state failures. In Video-MME-Logical-S, models must output structured intermediate information in the same answer tag, and predictions are scored by exact match against program-recorded states. The qualitative example shows that a model can predict the correct final location while producing an incorrect swap trace, whereas a successful model must recover both the intermediate sequence and the final answer.

Qualitative example of intermediate-state evaluation on a state-tracking task

Intermediate-state diagnostics distinguish correct final answers from flawed temporal evidence traces.

Key Findings

Citation

@misc{videommelogical2026,
title = {Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning},
author = {Kwan, Hohin and Li, Hongyu and Zhang, Ray and Zhang, Manyuan and Kong, Xianghao and Rao, Anyi and Xie, Jiahao and Liu, Si},
year = {2026},
}