Cup Trick
Locate the hidden ball after ordered cup swaps.
Video-MME-Logical is a controllable benchmark for video temporal-logical reasoning with 25 tasks, spanning final-answer evaluation, intermediate-state diagnostics, and difficulty-controlled settings.
Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events in individual frames? This ability, which we refer to as video temporal-logical reasoning, requires models to maintain, update, and compose evidence as visual states evolve across frames. Existing video benchmarks often conflate this capability with scene complexity, static recognition, or uncontrolled temporal variation. To isolate this capability, we introduce Video-MME-Logical, a controlled benchmark organized around five temporal-logical operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition. The benchmark contains 25 fine-grained task categories generated with controlled object states, transitions, temporal dependencies, and logical compositions. It enables difficulty-controlled final-answer evaluation by varying temporal horizon and reasoning complexity, and supports intermediate-state diagnostics by verifying whether models recover the required logical reasoning trace before producing the final answer. Experiments with state-of-the-art MLLMs reveal a substantial human-model gap, especially as temporal-logical complexity increases. Supervised fine-tuning on up to 500K generated samples improves performance but remains insufficient to close the reasoning gap, positioning Video-MME-Logical as a scalable testbed for analyzing and improving temporal-logical reasoning in MLLMs.
Existing video benchmarks often conflate temporal-logical reasoning with general temporal understanding, scene recognition, or uncontrolled visual variation. This leaves three gaps: reasoning categories are often under-specified, difficulty is hard to interpret because it co-varies with natural-video complexity, and final-answer-only evaluation cannot verify whether a model follows the correct temporal evidence trace. Video-MME-Logical addresses these gaps with operation-centric tasks, controlled difficulty, and verifiable intermediate states.
| Benchmark | #Tasks | #Videos | #Train | #Test | Control | Difficulty | Intermediate |
|---|---|---|---|---|---|---|---|
| TOMATO | 6 | 1,417 | 0 | 1,417 | No | No | No |
| TempCompass | 5 | 410 | 0 | 410 | No | No | No |
| ReXTime | 3 | 12,759 | 9,695 | 3,064 | No | No | No |
| V-STaR | 2 | 2,094 | 0 | 2,094 | No | No | No |
| Video-MME-Logical | 25 | 503,750 | 500,000 | 3,750 | Yes | Yes | Yes |
We organize Video-MME-Logical around five temporal-logical operations. State Tracking tests whether models maintain hidden or latent object states across visual transformations. Sequential Counting requires accumulating discrete evidence over time. Temporal Ordering asks models to recover the order of state changes, revealed symbols, or event sequences. Dynamic Spatiality evaluates geometric and motion-based inference, while Structural Composition requires composing spatial structures across viewpoints, occlusions, and partial observations.
The taxonomy covers 25 fine-grained tasks and distinguishes direct-answer tasks from the intermediate-state diagnostic subset.
Each task category is implemented as an executable program with four components: temporal transition, scene configuration, metadata construction, and video rendering. The recorded metadata supports video generation, question construction, exact answer computation, difficulty control, and intermediate-state supervision. Easy, medium, and hard settings are defined by increasing temporal horizon and reasoning complexity.
Programmatic generation supports reproducible task construction, controllable difficulty, and exact answer verification.
| Models | Overall | Avg. | State. | Count. | Order. | Spat. | Struct. | ||
|---|---|---|---|---|---|---|---|---|---|
| E | M | H | |||||||
| Human Level | 95.9 | 98.4 | 95.9 | 93.4 | 96.4 | 95.3 | 96.0 | 96.3 | 95.2 |
| Open-source Instruct Models | |||||||||
| Qwen3-VL-8B-Instruct | 11.9 | 13.4 | 12.8 | 9.6 | 8.2 | 3.3 | 19.3 | 13.0 | 15.8 |
| Qwen3-VL-30B-A3B-Instruct | 11.8 | 14.5 | 12.4 | 8.7 | 8.5 | 4.0 | 17.2 | 17.2 | 12.5 |
| Qwen3-Omni-30B-A3B-Instruct | 5.8 | 6.3 | 6.1 | 4.9 | 2.9 | 1.1 | 8.7 | 6.5 | 9.7 |
| Qwen2.5-VL-3B-Instruct | 1.9 | 3.1 | 1.5 | 1.3 | 0.7 | 1.9 | 0.3 | 0.5 | 6.3 |
| Qwen2.5-VL-7B-Instruct | 7.4 | 10.3 | 7.7 | 4.3 | 6.2 | 2.5 | 0.0 | 19.8 | 8.7 |
| Qwen2.5-VL-72B-Instruct | 12.5 | 15.2 | 13.1 | 9.1 | 5.6 | 4.1 | 18.5 | 18.0 | 16.2 |
| InternVL3.5-8B-Instruct | 12.1 | 13.8 | 13.5 | 8.9 | 4.6 | 4.3 | 15.7 | 18.0 | 17.8 |
| InternVL3.5-30B-A3B-Instruct | 8.7 | 9.5 | 9.7 | 7.0 | 6.6 | 4.4 | 2.5 | 15.8 | 14.3 |
| LLaVA-Video-7B-Qwen2 | 0.0 | 0.1 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 |
| LLaVA-Video-72B-Qwen2 | 2.4 | 4.7 | 1.9 | 0.7 | 4.3 | 5.1 | 0.0 | 2.8 | 0.0 |
| KimiVL-16B-A3B-Instruct | 2.9 | 5.4 | 2.2 | 0.9 | 4.4 | 3.9 | 2.3 | 0.3 | 3.3 |
| Open-source Thinking Models | |||||||||
| Qwen3-VL-8B-Think | 6.6 | 7.9 | 5.8 | 6.0 | 1.1 | 3.2 | 5.2 | 16.0 | 7.3 |
| Qwen3-VL-30B-A3B-Think | 10.3 | 16.0 | 8.7 | 6.1 | 3.9 | 10.8 | 24.0 | 7.7 | 5.0 |
| Qwen3-Omni-30B-A3B-Think | 6.2 | 6.6 | 5.2 | 6.8 | 2.4 | 2.0 | 12.5 | 6.3 | 7.7 |
| KimiVL-16B-A3B-Think | 7.6 | 10.2 | 6.8 | 5.8 | 5.3 | 1.5 | 5.0 | 14.5 | 11.7 |
| Proprietary Models | |||||||||
| GPT-5.4 | 22.7 | 31.7 | 20.3 | 16.1 | 5.7 | 26.1 | 35.7 | 25.2 | 20.8 |
| Gemini-3.1 Pro | 28.6 | 33.1 | 24.1 | 20.6 | 8.9 | 41.7 | 35.0 | 32.8 | 24.3 |
Supervised fine-tuning improves Video-MME-Logical performance with more generated data, but the gains saturate before closing the human gap. The thinking variant peaks at 39.2% overall accuracy with 375K samples and drops to 37.7% at 500K, while both SFT variants remain far below the 95.9% human reference.
Each card shows one representative example from a fine-grained task category. Use the arrows to browse the examples in a single horizontal row.
Each category shows controlled visual variations while preserving the underlying temporal-logical task.
Keyboard layouts, materials, colors, and word targets vary while the ordering task stays fixed.
Cube structures vary in shape, material, color, and voxel-count configuration.
Cup, ball, and table styles vary across controlled state-tracking scenes.
Card backs, palettes, patterns, and target cards vary while relocation logic stays fixed.
Maze themes, grid shapes, path templates, and route endpoints vary across dynamic spatial scenes.
Final-answer accuracy can hide intermediate-state failures. In Video-MME-Logical-S, models must output structured intermediate information in the same answer tag, and predictions are scored by exact match against program-recorded states. The qualitative example shows that a model can predict the correct final location while producing an incorrect swap trace, whereas a successful model must recover both the intermediate sequence and the final answer.
Intermediate-state diagnostics distinguish correct final answers from flawed temporal evidence traces.
@misc{videommelogical2026, title = {Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning}, author = {Kwan, Hohin and Li, Hongyu and Zhang, Ray and Zhang, Manyuan and Kong, Xianghao and Rao, Anyi and Xie, Jiahao and Liu, Si}, year = {2026},}