Transcript
John: Welcome to Advanced Topics in Multimodal AI. Today's lecture is on "OpenMOSS: Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm." We've seen a lot of work recently, like "Thinking with Generated Images" and "DeepEyes," that focuses on getting models to reason with static visuals. This new paper from researchers at Fudan University and collaborating institutions proposes moving beyond static frames. It argues that for true multimodal reasoning, we need to think in terms of dynamic processes, which is where video comes in.
John: Yes, Noah?
Noah: Hi Professor. So, is the core idea just extending the chain-of-thought from text or images to a sequence of frames? How is that fundamentally different from just generating a series of related images?
John: That's the central question. The authors argue it is fundamentally different. Static images, even in a sequence, capture discrete moments. They struggle to represent continuous change, motion, or the act of creation, like drawing a line to solve a puzzle. The "Thinking with Video" paradigm posits that the generation process itself can serve as a reasoning mechanism. The model isn't just showing steps; it's simulating a dynamic process over time, much like human mental visualization. This allows for a more unified temporal framework that can integrate both visual actions and textual explanations within a single, continuous output.
John: To test this, they introduce a new benchmark called VideoThinkBench. Its purpose is to systematically evaluate these dynamic reasoning capabilities in a verifiable way, which has been a major gap in video model assessment. The benchmark is divided into two main categories: vision-centric tasks, which require spatial and inductive reasoning, and text-centric tasks, which test mathematical and general knowledge.
Noah: So for the vision-centric tasks, are we talking about things like tracking objects?
John: It's more about active problem-solving. For instance, they created 'Eyeballing Puzzles,' where the model has to generate a video of it drawing a line to find a circle's center or tracing the path of a light ray after it reflects off a mirror. They also included maze-solving and abstract pattern completion adapted from the ARC-AGI benchmark. These tasks require the model to not just perceive but to actively construct a visual solution. For the text-centric tasks, they fed the model problems from benchmarks like GSM8K and MMLU and asked it to generate a video showing the written-out solution while speaking the final answer.
Noah: And how did the model, Sora-2, actually perform?
John: The results were quite interesting. On the vision-centric Eyeballing Puzzles, Sora-2's performance was notably strong, achieving 40.2% accuracy and outperforming leading VLMs like GPT-5 and Claude-4.5-Sonnet. It excelled at tasks requiring geometric construction. However, it struggled with more complex spatial reasoning, solving 40% of square mazes but failing entirely on circular or hexagonal ones. The most surprising finding was its performance on text-centric tasks. While the visually rendered, written-out reasoning in the video was often flawed or unreadable, the transcribed audio of the final spoken answer was highly accurate. On GSM8K, its audio accuracy was nearly 99%, comparable to state-of-the-art models.
Noah: Wait, so the model could say the correct answer but couldn't reliably write it down in the video? What does that suggest about where the reasoning is actually happening?
John: Exactly. That disparity points to a potential decoupling between the reasoning and generation modules. The authors speculate, based on experiments with an open-source model called Wan2.5, that Sora-2 might use a powerful internal prompt rewriter—essentially a VLM—that solves the problem first. This module then translates the reasoning steps into a detailed set of instructions for the video generation component. This implies the core reasoning might not be an emergent property of the video generation process itself, but rather a capability of an integrated, text-and-vision-proficient component that precedes it. This is a crucial insight for how we might build future unified multimodal models.
Noah: So this 'Thinking with Video' might actually be more like 'Thinking with a VLM, then Visualizing with Video'.
John: That's a good way to frame the current hypothesis. The significance here is twofold. First, it establishes video generation as a valid and powerful output modality for complex reasoning, even if the reasoning core is a separate module. Second, it highlights a path forward. The study suggests that to create truly unified models, we may need to train them on data where text and dynamic visuals are inherently linked, perhaps by converting vast text corpora into video-form training data, like videos of text being written or spoken. The work also showed that techniques from the LLM world, like self-consistency, can be applied here. Aggregating answers over multiple generation attempts significantly boosted accuracy.
John: So, to wrap up, the "Thinking with Video" paradigm introduces a compelling new direction for multimodal AI. The VideoThinkBench provides a much-needed tool for rigorous evaluation. And the analysis of Sora-2 reveals a surprisingly capable reasoner, while also offering critical clues about its internal architecture. The key takeaway is that the future of multimodal AI may lie not just in processing different data types, but in generating dynamic, integrated outputs that more closely mimic human thought and communication.
John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.