Video-MME-Logical : A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Hohin Kwan ^1* , Hongyu Li ^2*† , Ray Zhang ³ , Manyuan Zhang , Xianghao Kong ¹ , Anyi Rao ¹ , Jiahao Xie ^2‡ , Si Liu ²

¹ HKUST ² Colab, Beihang University ³ CUHK

^*Equal contribution. ^†Project Lead. ^‡Corresponding Author.

Paper Code Dataset

Overview of Video-MME-Logical task categories and benchmark design — Video-MME-Logical is a controllable benchmark for video temporal-logical reasoning with 25 tasks, spanning final-answer evaluation, intermediate-state diagnostics, and difficulty-controlled settings.

Abstract

Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events in individual frames? This ability, which we refer to as video temporal-logical reasoning, requires models to maintain, update, and compose evidence as visual states evolve across frames. Existing video benchmarks often conflate this capability with scene complexity, static recognition, or uncontrolled temporal variation. To isolate this capability, we introduce Video-MME-Logical, a controlled benchmark organized around five temporal-logical operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition. The benchmark contains 25 fine-grained task categories generated with controlled object states, transitions, temporal dependencies, and logical compositions. It enables difficulty-controlled final-answer evaluation by varying temporal horizon and reasoning complexity, and supports intermediate-state diagnostics by verifying whether models recover the required logical reasoning trace before producing the final answer. Experiments with state-of-the-art MLLMs reveal a substantial human-model gap, especially as temporal-logical complexity increases. Supervised fine-tuning on up to 500K generated samples improves performance but remains insufficient to close the reasoning gap, positioning Video-MME-Logical as a scalable testbed for analyzing and improving temporal-logical reasoning in MLLMs.

Benchmark at a Glance

503,750

Total videos

500K

Training videos

3,750

Test videos

Task categories

Step-diagnostic tasks

Why Video-MME-Logical

Existing video benchmarks often conflate temporal-logical reasoning with general temporal understanding, scene recognition, or uncontrolled visual variation. This leaves three gaps: reasoning categories are often under-specified, difficulty is hard to interpret because it co-varies with natural-video complexity, and final-answer-only evaluation cannot verify whether a model follows the correct temporal evidence trace. Video-MME-Logical addresses these gaps with operation-centric tasks, controlled difficulty, and verifiable intermediate states.

Video-MME-Logical is designed as both a diagnostic benchmark and a controllable training resource.
Benchmark	#Tasks	#Videos	#Train	#Test	Control	Difficulty	Intermediate
TOMATO	6	1,417	0	1,417	No	No	No
TempCompass	5	410	0	410	No	No	No
ReXTime	3	12,759	9,695	3,064	No	No	No
V-STaR	2	2,094	0	2,094	No	No	No
Video-MME-Logical	25	503,750	500,000	3,750	Yes	Yes	Yes

Taxonomy

We organize Video-MME-Logical around five temporal-logical operations. State Tracking tests whether models maintain hidden or latent object states across visual transformations. Sequential Counting requires accumulating discrete evidence over time. Temporal Ordering asks models to recover the order of state changes, revealed symbols, or event sequences. Dynamic Spatiality evaluates geometric and motion-based inference, while Structural Composition requires composing spatial structures across viewpoints, occlusions, and partial observations.

Construction Pipeline

Each task category is implemented as an executable program with four components: temporal transition, scene configuration, metadata construction, and video rendering. The recorded metadata supports video generation, question construction, exact answer computation, difficulty control, and intermediate-state supervision. Easy, medium, and hard settings are defined by increasing temporal horizon and reasoning complexity.

Leaderboard

Main results on Video-MME-Logical. E/M/H denote easy, medium, and hard settings; category columns report the average over E/M/H.
Models	Overall	Avg.			State.	Count.	Order.	Spat.	Struct.
Models	Overall	E	M	H	State.	Count.	Order.	Spat.	Struct.
Human Level	95.9	98.4	95.9	93.4	96.4	95.3	96.0	96.3	95.2
Open-source Instruct Models
Qwen3-VL-8B-Instruct	11.9	13.4	12.8	9.6	8.2	3.3	19.3	13.0	15.8
Qwen3-VL-30B-A3B-Instruct	11.8	14.5	12.4	8.7	8.5	4.0	17.2	17.2	12.5
Qwen3-Omni-30B-A3B-Instruct	5.8	6.3	6.1	4.9	2.9	1.1	8.7	6.5	9.7
Qwen2.5-VL-3B-Instruct	1.9	3.1	1.5	1.3	0.7	1.9	0.3	0.5	6.3
Qwen2.5-VL-7B-Instruct	7.4	10.3	7.7	4.3	6.2	2.5	0.0	19.8	8.7
Qwen2.5-VL-72B-Instruct	12.5	15.2	13.1	9.1	5.6	4.1	18.5	18.0	16.2
InternVL3.5-8B-Instruct	12.1	13.8	13.5	8.9	4.6	4.3	15.7	18.0	17.8
InternVL3.5-30B-A3B-Instruct	8.7	9.5	9.7	7.0	6.6	4.4	2.5	15.8	14.3
LLaVA-Video-7B-Qwen2	0.0	0.1	0.0	0.0	0.1	0.0	0.0	0.0	0.0
LLaVA-Video-72B-Qwen2	2.4	4.7	1.9	0.7	4.3	5.1	0.0	2.8	0.0
KimiVL-16B-A3B-Instruct	2.9	5.4	2.2	0.9	4.4	3.9	2.3	0.3	3.3
Open-source Thinking Models
Qwen3-VL-8B-Think	6.6	7.9	5.8	6.0	1.1	3.2	5.2	16.0	7.3
Qwen3-VL-30B-A3B-Think	10.3	16.0	8.7	6.1	3.9	10.8	24.0	7.7	5.0
Qwen3-Omni-30B-A3B-Think	6.2	6.6	5.2	6.8	2.4	2.0	12.5	6.3	7.7
KimiVL-16B-A3B-Think	7.6	10.2	6.8	5.8	5.3	1.5	5.0	14.5	11.7
Proprietary Models
GPT-5.4	22.7	31.7	20.3	16.1	5.7	26.1	35.7	25.2	20.8
Gemini-3.1 Pro	28.6	33.1	24.1	20.6	8.9	41.7	35.0	32.8	24.3

SFT Scaling

Supervised fine-tuning improves Video-MME-Logical performance with more generated data, but the gains saturate before closing the human gap. The thinking variant peaks at 39.2% overall accuracy with 375K samples and drops to 37.7% at 500K, while both SFT variants remain far below the 95.9% human reference.

SFT Thinking SFT Instruct Human Level

Task Gallery

Each card shows one representative example from a fine-grained task category. Use the arrows to browse the examples in a single horizontal row.

State Tracking MC

Cup Trick

Locate the hidden ball after ordered cup swaps.

Temporal Ordering Fill-in

Keyboard Sequence

Identify the ordered letter sequence from key activations.

Structural Composition Fill-in

Falling Shape Count

Count falling target shapes in a dynamic scene.

Sequential Counting Fill-in

Symbol

Count matching symbols over time while filtering distractors.

Dynamic Spatiality Fill-in

Maze Trace

Count turns along a moving route.

State Tracking Fill-in

Cup Trick-S

Recover the ordered sequence of cup-position swaps.

Temporal Ordering Fill-in

Keyboard Sequence-S

Recover the full key activation order.

Structural Composition MC

3D Maze Route

Match the route through a 3D maze.

Sequential Counting Fill-in

Symbol-S

Recover the ordered target-symbol reveal sequence.

Dynamic Spatiality MC

Rotation Center

Infer the center of image rotation.

State Tracking Fill-in

Card Relocation-S

Recover the target card position history.

Temporal Ordering MC

Neon Word

Identify a word from sequential neon flashes.

Structural Composition Fill-in

Occlusion Object Count

Count objects hidden or partially occluded over time.

Sequential Counting Fill-in

Cube Structure Count

Count unit cubes in a 3D structure across viewpoints.

Dynamic Spatiality MC

Trajectory Intersection

Count intersections between moving trajectories.

State Tracking Fill-in

Card Shuffle-S

Recover the ordered card move sequence.

Temporal Ordering Fill-in

Neon Word-Step

Recover the word formation sequence.

Structural Composition MC

Hidden Container Inference

Infer the hidden container shape from partial evidence.

Sequential Counting Fill-in

Grid Activation

Count unique grid cells activated over time.

Dynamic Spatiality MC

Speed Comparison

Compare relative object speeds over time.

Sequential Counting Fill-in

Grid Activation-S

Recover the grid-cell activation trace.

Visual Diversity Gallery

Each category shows controlled visual variations while preserving the underlying temporal-logical task.

Keyboard layouts, materials, colors, and word targets vary while the ordering task stays fixed.

Alphabet #1

Seed 326048001

theme_id=rose_gloss_phone | material_family=glossy_plastic

Alphabet #2

Seed 326048002

theme_id=mint_ceramic_wide | material_family=ceramic

Alphabet #3

Seed 326048003

theme_id=steel_metal_large | material_family=brushed_metal

Alphabet #4

Seed 326048004

theme_id=walnut_wood_tall | material_family=painted_wood

Alphabet #5

Seed 326048005

theme_id=midnight_gloss_slim | material_family=glossy_plastic

Alphabet #6

Seed 326048006

theme_id=cobalt_metal_extra_wide | material_family=brushed_metal

Alphabet #7

Seed 326048007

theme_id=sand_plastic_tablet | material_family=matte_plastic

Alphabet #8

Seed 326048008

theme_id=graphite_ceramic_dense | material_family=ceramic

Alphabet #9

Seed 326048009

theme_id=coral_wood_compact | material_family=painted_wood

Alphabet #10

Seed 326048010

theme_id=ivory_plastic_compact | material_family=matte_plastic

Cup, ball, and table styles vary across controlled state-tracking scenes.

Cups #1

Seed 326414241

Cups #2

Seed 326414242

Cups #3

Seed 326414243

Cups #4

Seed 326414244

Cups #5

Seed 326414245

Cups #6

Seed 326414246

Cups #7

Seed 326414247

Cups #8

Seed 326414248

Cups #9

Seed 326414249

Cups #10

Seed 326414250

Intermediate-State Diagnostics

Final-answer accuracy can hide intermediate-state failures. In Video-MME-Logical-S, models must output structured intermediate information in the same answer tag, and predictions are scored by exact match against program-recorded states. The qualitative example shows that a model can predict the correct final location while producing an incorrect swap trace, whereas a successful model must recover both the intermediate sequence and the final answer.

Qualitative example of intermediate-state evaluation on a state-tracking task — Intermediate-state diagnostics distinguish correct final answers from flawed temporal evidence traces.

Key Findings

Human performance reaches 95.9% overall accuracy, while the strongest evaluated zero-shot model, Gemini-3.1 Pro, reaches 28.6%.
Controlled difficulty exposes sharp degradation: GPT-5.4 drops from 31.7% on easy tasks to 16.1% on hard tasks, while Gemini-3.1 Pro drops from 33.1% to 20.6%.
Intermediate-state evaluation remains difficult: GPT-5.4 reaches 17.4% and Gemini-3.1 Pro reaches 10.8% on Video-MME-Logical-S, far below the 96.1% human reference.
Explicit thinking does not consistently improve temporal-logical reasoning; the reasoning trace must remain grounded in the correct visual evidence.
SFT scaling improves performance up to 39.2%, but the full 500K setting drops to 37.7%, suggesting that naive supervised scaling saturates.

Citation

@misc{videommelogical2026,
  title = {Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning},
  author = {Kwan, Hohin and Li, Hongyu and Zhang, Ray and Zhang, Manyuan and Kong, Xianghao and Rao, Anyi and Xie, Jiahao and Liu, Si},
  year = {2026},
}

Abstract

Benchmark at a Glance

Why Video-MME-Logical

Taxonomy

Construction Pipeline

Leaderboard

SFT Scaling

Task Gallery

Cup Trick

Keyboard Sequence

Falling Shape Count

Symbol

Maze Trace

Cup Trick-S

Keyboard Sequence-S

3D Maze Route

Symbol-S

Rotation Center

Card Relocation-S

Neon Word

Occlusion Object Count

Cube Structure Count

Trajectory Intersection

Card Shuffle-S

Neon Word-Step

Hidden Container Inference

Grid Activation

Speed Comparison

Grid Activation-S

Visual Diversity Gallery

Seed 326048001

Seed 326048002

Seed 326048003

Seed 326048004

Seed 326048005

Seed 326048006

Seed 326048007

Seed 326048008

Seed 326048009

Seed 326048010

Seed 326445281

Seed 326445282

Seed 326445283

Seed 326445284

Seed 326445285

Seed 326445286

Seed 326445287

Seed 326445288

Seed 326445289

Seed 326445290

Seed 326414241

Seed 326414242

Seed 326414243

Seed 326414244

Seed 326414245

Seed 326414246

Seed 326414247

Seed 326414248

Seed 326414249

Seed 326414250

Seed 327279841

Seed 327279842

Seed 327279843

Seed 327279844

Seed 327279845

Seed 327279846

Seed 327279847

Seed 327279848

Seed 327279849

Seed 327279850

Seed 327423041

Seed 327423042

Seed 327423043

Seed 327423044

Seed 327423045

Seed 327423046

Seed 327423047

Seed 327423048

Seed 327423049

Seed 327423050