Are Video Models Ready as Zero-Shot Reasoners?

An Empirical Study with the MME-COF Benchmark

¹CUHK IMIXR & ²MMLab ³Peking University ⁴Northeastern University

^*Equal Contribution ^†Project Lead ^‡Corresponding Author

TL;DR

We investigate a key question: Are Video Models Ready as Zero-Shot Reasoners? While modern video models can “see the world” and show promising ability to perceive, understand, and manipulate complex visual scenes, their actual reliability in visual reasoning remains unverified.
We conduct a comprehensive Chain-of-Frame (CoF) evaluation of the leading model Veo-3 across 12 core dimensions and introduce MME-CoF, a compact and standardized benchmark for systematic CoF reasoning assessment. Our findings show that current video models are not yet dependable standalone zero-shot reasoners, but they demonstrate strong potential as powerful visual perception and scene-understanding modules to complement dedicated reasoning systems.

Introduction

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-COF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models.

Reasoning Video Showcase

Visual Detail Reasoning

Veo-3 performs well in fine-grained attribute and spatial reasoning for salient, well-grounded targets, but fails when objects are small, occluded, or cluttered. It sometimes exhibits stylistic generation biases that lead to plausible yet instruction-divergent outcomes.

Visual Trace Reasoning

Veo-3 can produce locally coherent, short-horizon trace animations in simple, low-branching scenarios, but it does not reliably execute long-horizon plans or rule-grounded sequences.

Real-world Spatial Reasoning

While Veo-3 exhibits an emerging ability for simple real-world spatial reasoning, its capability remains insufficient for handling more complex spatial understanding tasks.

3D Geometry Reasoning

Veo-3 exhibits emerging reasoning potential on basic 3D transformations but breaks down on complex or multi-step geometry, often yielding misaligned or self-intersecting structures. Its 3D geometric reasoning remains fragile, revealing substantial gaps in its ability to function as a reliable 3D geometry reasoner.

2D Geometry Reasoning

Veo-3 shows initial 2D geometric reasoning ability but still falls short of consistent, constraint-aware geometric understanding, remaining far from a robust geometric reasoner.

Physics-based Reasoning

Veo-3 often generates visually plausible short-term dynamics, but it systematically fails to preserve quantitative physical constraints (energy, momentum), causal ordering, and contact mechanics in frictional, force-driven, or mechanically constrained scenarios. Thus, its outputs are somewhat useful for qualitative illustration but are not reliable for quantitative physics inference or causal prediction.

Rotation Reasoning

Veo-3 exhibits only a superficial understanding of rotation reasoning. While it can approximate small planar rotations, it fails to preserve geometric consistency under larger or compound transformations.

Table and Chart Reasoning

Veo-3 demonstrates emerging competence and potential in structured visual understanding, but still falls short of functioning as a precise and reliable chart-table reasoner.

Object Counting Reasoning

Veo-3 demonstrates basic counting capability but lacks the spatial control and robustness required for reliable object enumeration in dynamic or complex scenes.

GUI Reasoning

Select a case to view its question and prompt.

Veo-3 demonstrates a limited awareness of GUI click actions, imitating interaction behaviors without fully grasping the underlying functional logic.

Embodied Reasoning

Select a case to view its question and prompt.

Veo-3's capabilities are currently limited to basic object recognition rather than true embodied reasoning. It lacks the necessary planning and stability to reliably interpret and act upon dynamic or spatially constrained instructions, indicating its limitations in understanding and reasoning of real-world interactions.

Medical Reasoning

Veo-3's failure to handle the reasoning in the medical domain, causing distortion even on simple zoom-ins, highlights its limited grasp of specialized, non-general knowledge.

Leaderboard

Model-level Overall and Per-dimension Performance on MME-CoF. Mean scores and standard deviations are reported on a 0–4 scale, as graded by Gemini-2.5-Pro.

#	Model	Overall	Instruction Alignment	Temporal Consistency	Visual Stability	Content Fidelity	Focus Relevance
1	Kling-v1	0.64 ± 0.91	0.01 ± 0.09	0.15 ± 0.75	2.43 ± 1.86	0.21 ± 0.79	0.43 ± 1.07
2	Seedance-1.0-pro	1.41 ± 1.51	0.30 ± 0.86	1.65 ± 1.57	2.00 ± 1.72	1.13 ± 1.65	1.98 ± 1.75
3	Veo-3.0-fast	1.44 ± 1.51	0.56 ± 1.09	1.37 ± 1.51	1.88 ± 1.73	1.10 ± 1.52	2.27 ± 1.69
4	Veo-3.0-preview	1.45 ± 1.50	0.54 ± 1.06	1.43 ± 1.53	1.89 ± 1.71	1.12 ± 1.49	2.26 ± 1.73
5	Sora-2-pro	1.66 ± 1.53	0.48 ± 0.96	1.36 ± 1.59	2.39 ± 1.65	1.64 ± 1.72	2.44 ± 1.73
6	Sora-2	1.72 ± 1.59	0.59 ± 1.12	1.52 ± 1.69	2.32 ± 1.68	1.62 ± 1.75	2.52 ± 1.71

#	Category	Kling-v1	Seedance-1.0 Pro	Veo-3.0 Fast	Veo-3.0 Preview	Sora-2	Sora-2 Pro
1	Visual Detail	0.72 ± 0.69	1.37 ± 1.39	1.10 ± 1.24	1.59 ± 1.68	1.14 ± 1.32	1.08 ± 1.89
2	Visual Trace	0.49 ± 0.65	1.23 ± 1.13	1.43 ± 1.26	1.48 ± 1.24	1.51 ± 1.37	1.75 ± 1.31
3	Real-world Spatial	0.77 ± 0.76	1.79 ± 1.53	2.07 ± 1.54	2.10 ± 1.46	1.84 ± 1.43	1.77 ± 1.35
4	3D Geometry	0.61 ± 0.58	1.95 ± 1.64	1.71 ± 1.54	1.54 ± 1.43	1.37 ± 1.49	1.42 ± 1.45
5	2D Geometry	0.49 ± 0.67	0.96 ± 1.11	1.18 ± 1.15	1.27 ± 1.20	1.77 ± 1.45	1.77 ± 1.21
6	Physics-based	0.60 ± 0.62	1.27 ± 1.25	1.44 ± 1.39	1.44 ± 1.35	2.13 ± 1.32	2.10 ± 1.33
7	Rotation	0.22 ± 0.34	2.30 ± 1.46	1.83 ± 1.44	1.60 ± 1.29	1.62 ± 1.37	1.44 ± 1.28
8	Table & Chart	0.87 ± 0.72	0.71 ± 1.18	0.82 ± 1.30	0.96 ± 1.44	1.84 ± 1.61	1.48 ± 1.59
9	GUI	1.09 ± 0.51	0.70 ± 0.76	1.11 ± 1.09	1.18 ± 0.89	1.88 ± 1.64	1.52 ± 1.48
10	Object Counting	0.64 ± 0.58	1.15 ± 0.97	2.03 ± 1.42	1.84 ± 1.42	2.06 ± 1.48	1.86 ± 1.41
11	Embodied	0.80 ± 0.00	1.82 ± 1.67	1.33 ± 1.57	1.18 ± 1.46	1.30 ± 1.51	1.40 ± 1.42
12	Medical	1.15 ± 1.17	1.56 ± 1.41	0.27 ± 0.39	0.30 ± 0.58	2.08 ± 1.56	1.81 ± 1.42

Category

Kling-v1

Seedance-1.0 Pro

Veo-3.0 Fast

Veo-3.0 Preview

Sora-2

Sora-2 Pro

Visual Detail

0.72 ± 0.69

1.37 ± 1.39

1.10 ± 1.24

1.59 ± 1.68

1.14 ± 1.32

1.08 ± 1.89

Visual Trace

0.49 ± 0.65

1.23 ± 1.13

1.43 ± 1.26

1.48 ± 1.24

1.51 ± 1.37

1.75 ± 1.31

Real-world Spatial

0.77 ± 0.76

1.79 ± 1.53

2.07 ± 1.54

2.10 ± 1.46

1.84 ± 1.43

1.77 ± 1.35

3D Geometry

0.61 ± 0.58

1.95 ± 1.64

1.71 ± 1.54

1.54 ± 1.43

1.37 ± 1.49

1.42 ± 1.45

2D Geometry

0.49 ± 0.67

0.96 ± 1.11

1.18 ± 1.15

1.27 ± 1.20

1.77 ± 1.45

1.77 ± 1.21

Physics-based

0.60 ± 0.62

1.27 ± 1.25

1.44 ± 1.39

1.44 ± 1.35

2.13 ± 1.32

2.10 ± 1.33

Rotation

0.22 ± 0.34

2.30 ± 1.46

1.83 ± 1.44

1.60 ± 1.29

1.62 ± 1.37

1.44 ± 1.28

Table & Chart

0.87 ± 0.72

0.71 ± 1.18

0.82 ± 1.30

0.96 ± 1.44

1.84 ± 1.61

1.48 ± 1.59

GUI

1.09 ± 0.51

0.70 ± 0.76

1.11 ± 1.09

1.18 ± 0.89

1.88 ± 1.64

1.52 ± 1.48

Object Counting

0.64 ± 0.58

1.15 ± 0.97

2.03 ± 1.42

1.84 ± 1.42

2.06 ± 1.48

1.86 ± 1.41

Embodied

0.80 ± 0.00

1.82 ± 1.67

1.33 ± 1.57

1.18 ± 1.46

1.30 ± 1.51

1.40 ± 1.42

Medical

1.15 ± 1.17

1.56 ± 1.41

0.27 ± 0.39

0.30 ± 0.58

2.08 ± 1.56

1.81 ± 1.42

BibTeX

@article{guo2025mme-cof,
  title={Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-COF Benchmark},
  author={Guo, Ziyu and Chen, Xinyan and Zhang, Renrui and An, Ruichuan and Qi, Yu and Jiang, Dongzhi and Li, Xiangtai and Zhang, Manyuan and Li, Hongsheng and Heng, Pheng-Ann},
  journal={arXiv preprint arXiv:2510.26802},
  year={2025}
}