Are Video Models Ready as Zero-Shot Reasoners?

An Empirical Study with the MME-COF Benchmark

1CUHK IMIXR  &   2MMLab    3Peking University    4Northeastern University
*Equal Contribution     Project Lead     Corresponding Author

TL;DR

We investigate a key question: Are Video Models Ready as Zero-Shot Reasoners? While modern video models can “see the world” and show promising ability to perceive, understand, and manipulate complex visual scenes, their actual reliability in visual reasoning remains unverified.
We conduct a comprehensive Chain-of-Frame (CoF) evaluation of the leading model Veo-3 across 12 core dimensions and introduce MME-CoF, a compact and standardized benchmark for systematic CoF reasoning assessment. Our findings show that current video models are not yet dependable standalone zero-shot reasoners, but they demonstrate strong potential as powerful visual perception and scene-understanding modules to complement dedicated reasoning systems.

introduction


Overview of Our Study on the Reasoning Potential of Video Models.



Introduction

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-COF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models.

Deep-Dive Analysis on Veo-3

We provide the first investigation of video models (Veo-3) to analyze their visual reasoning potential, detailing representative successes, characteristic errors, and the conditions under which CoF reasoning emerges, holds, or breaks.

Reasoning Video Showcase

MME-CoF Benchmark

We curate MME-CoF, a compact benchmark providing a standardized taxonomy and an evaluation protocol aligned with CoF reasoning, enabling consistent and category-wise assessment beyond surface-level visual fidelity.



evaluation radar map

Evaluation Radar Map on MME-CoF.

category distribution

Category Distribution of MME-CoF.

wordcloud

Word Cloud of MME-CoF.

Leaderboard

Model-level Overall and Per-dimension Performance on MME-CoF. Mean scores and standard deviations are reported on a 0–4 scale, as graded by Gemini-2.5-Pro.

# Model Overall Instruction
Alignment
Temporal
Consistency
Visual
Stability
Content
Fidelity
Focus
Relevance
1Kling-v1 0.64 ± 0.91 0.01 ± 0.09 0.15 ± 0.75 2.43 ± 1.86 0.21 ± 0.79 0.43 ± 1.07
2Seedance-1.0-pro 1.41 ± 1.51 0.30 ± 0.86 1.65 ± 1.57 2.00 ± 1.72 1.13 ± 1.65 1.98 ± 1.75
3Veo-3.0-fast 1.44 ± 1.51 0.56 ± 1.09 1.37 ± 1.51 1.88 ± 1.73 1.10 ± 1.52 2.27 ± 1.69
4Veo-3.0-preview 1.45 ± 1.50 0.54 ± 1.06 1.43 ± 1.53 1.89 ± 1.71 1.12 ± 1.49 2.26 ± 1.73
5Sora-2-pro 1.66 ± 1.53 0.48 ± 0.96 1.36 ± 1.59 2.39 ± 1.65 1.64 ± 1.72 2.44 ± 1.73
6Sora-2 1.72 ± 1.59 0.59 ± 1.12 1.52 ± 1.69 2.32 ± 1.68 1.62 ± 1.75 2.52 ± 1.71

Per-category Scores on MME-CoF. Mean scores and standard deviations are reported on a 0–4 scale, as graded by Gemini-2.5-Pro.

# Category Kling-v1 Seedance-1.0 Pro Veo-3.0 Fast Veo-3.0 Preview Sora-2 Sora-2 Pro
1Visual Detail0.72 ± 0.691.37 ± 1.391.10 ± 1.241.59 ± 1.681.14 ± 1.321.08 ± 1.89
2Visual Trace0.49 ± 0.651.23 ± 1.131.43 ± 1.261.48 ± 1.241.51 ± 1.371.75 ± 1.31
3Real-world Spatial0.77 ± 0.761.79 ± 1.532.07 ± 1.542.10 ± 1.461.84 ± 1.431.77 ± 1.35
43D Geometry0.61 ± 0.581.95 ± 1.641.71 ± 1.541.54 ± 1.431.37 ± 1.491.42 ± 1.45
52D Geometry0.49 ± 0.670.96 ± 1.111.18 ± 1.151.27 ± 1.201.77 ± 1.451.77 ± 1.21
6Physics-based0.60 ± 0.621.27 ± 1.251.44 ± 1.391.44 ± 1.352.13 ± 1.322.10 ± 1.33
7Rotation0.22 ± 0.342.30 ± 1.461.83 ± 1.441.60 ± 1.291.62 ± 1.371.44 ± 1.28
8Table & Chart0.87 ± 0.720.71 ± 1.180.82 ± 1.300.96 ± 1.441.84 ± 1.611.48 ± 1.59
9GUI1.09 ± 0.510.70 ± 0.761.11 ± 1.091.18 ± 0.891.88 ± 1.641.52 ± 1.48
10Object Counting0.64 ± 0.581.15 ± 0.972.03 ± 1.421.84 ± 1.422.06 ± 1.481.86 ± 1.41
11Embodied0.80 ± 0.001.82 ± 1.671.33 ± 1.571.18 ± 1.461.30 ± 1.511.40 ± 1.42
12Medical1.15 ± 1.171.56 ± 1.410.27 ± 0.390.30 ± 0.582.08 ± 1.561.81 ± 1.42

BibTeX

@article{guo2025mme-cof,
  title={Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-COF Benchmark},
  author={Guo, Ziyu and Chen, Xinyan and Zhang, Renrui and An, Ruichuan and Qi, Yu and Jiang, Dongzhi and Li, Xiangtai and Zhang, Manyuan and Li, Hongsheng and Heng, Pheng-Ann},
  journal={arXiv preprint arXiv:2510.26802},
  year={2025}
}