Cambrian-S: Towards Spatial Supersensing in Video

BibTex

Copy

@misc{yangThu Nov 06 2025 18:55:17 GMT+0000 (Coordinated Universal Time)cambrianstowardsspatial,
      title={Cambrian-S: Towards Spatial Supersensing in Video},
      author={Shusheng Yang and Jihan Yang and Pinzhi Huang and Ellis Brown and Zihao Yang and Yue Yu and Shengbang Tong and Zihan Zheng and Yifan Xu and Muhan Wang and Daohan Lu and Rob Fergus and Yann LeCun and Li Fei-Fei and Saining Xie},
      year={Thu Nov 06 2025 18:55:17 GMT+0000 (Coordinated Universal Time)},
      eprint={2511.04670},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.04670},
}

GitHub

cambrian-s

HTTPS

https://github.com/cambrian-mllm/cambrian-s

SSH

git@github.com:cambrian-mllm/cambrian-s.git

CLI

gh repo clone cambrian-mllm/cambrian-s

AI Audio Lecture + Q&A

0:00 / 0:00

Cambrian-S: Towards Spatial Supersensing in Video

Transcript

John: Welcome to Advanced Topics in Multimodal AI. Today's lecture is on 'Cambrian-S: Towards Spatial Supersensing in Video.' We've seen a lot of recent work, like 'Thinking in Space', trying to benchmark the spatial intelligence of MLLMs. This new paper from researchers at NYU and Stanford argues that the entire field needs to move beyond simple semantic perception and brute-force scaling. It proposes we shift our focus toward building models that can actively construct and update an internal model of the world from a continuous video stream. Yes, Noah? Noah: Hi Professor. So is this basically a follow-up to their previous work on VSI-Bench and 'Cambrian-1'? It sounds like they're diagnosing the same problem: MLLMs are good at describing images, but bad at reasoning about space over time. John: That's a good way to put it. It builds directly on that foundation. But where 'Thinking in Space' diagnosed the problem with a benchmark, Cambrian-S attempts to define a new research direction and offer a potential solution. The authors propose a four-stage hierarchy for what they call 'spatial supersensing'. First is semantic perception, which is what current models do well—naming objects. The second stage is streaming event cognition, which is about always-on sensing and maintaining memory over time. The third is implicit 3D spatial cognition, which means understanding that a video is just a 2D projection of a 3D world. And finally, the fourth stage is predictive world modeling. Noah: Can you clarify that last stage? How is a 'predictive world model' different from just having good long-term memory of what you've seen? John: It's an active process rather than a passive one. A predictive model doesn't just store what it sees; it constantly tries to anticipate the next sensory input. The key idea is that the model learns by being 'surprised'—when its prediction doesn't match reality. This prediction error, or surprise, becomes a signal to pay more attention, update the internal world model, and decide what's important enough to store in memory. It's a move from a reactive system that answers questions about the past to a proactive one that anticipates the future. Noah: So they're saying that simply scaling up context windows, like we see with models like Gemini, is the wrong approach? John: Exactly. That's a central argument. To prove this, they introduce a new benchmark called VSI-SUPER, which is designed to be resistant to these brute-force methods. It has two parts. The first is Visual Spatial Recall, or VSR. They create extremely long videos, up to four hours, by stitching clips together and inserting unusual objects at certain locations. The task is to recall the sequence of these objects and their locations. The sheer length makes it impossible for current context windows. Noah: Hold on, stringing together unrelated walkthrough videos feels a bit artificial. Does performance on that task really translate to understanding a coherent, real-world environment? John: That's a valid critique of the ecological validity. The authors' defense would be that it's designed specifically to isolate and test long-horizon spatial memory while preventing the model from using linguistic or semantic shortcuts. The second task, Visual Spatial Counting or VSC, is a bit more grounded. It involves counting a specific object across multiple room tours, forcing the model to maintain a cumulative count across changing scenes and viewpoints. Again, it tests continuous reasoning, not just single-instance perception. Noah: And they trained their own model, Cambrian-S, for this? John: Correct. They developed Cambrian-S and curated a massive, spatially-focused dataset called VSI-590K to train it. This dataset combines annotated real-world 3D scans, data from embodied simulators, and even pseudo-annotated YouTube videos. The goal was to push the current MLLM paradigm as far as it could go. And on existing spatial benchmarks like VSI-Bench, Cambrian-S does achieve state-of-the-art results, significantly outperforming even proprietary models. Noah: So the current paradigm, with better data, can get you pretty far. But how did it do on their new, harder VSI-SUPER benchmark? John: It failed, just as they predicted. Performance on the long-recall task dropped sharply as the video length increased, confirming their hypothesis that simply scaling up isn't enough. This is where their proof-of-concept for the new paradigm comes in. They added a small module to Cambrian-S, a 'Latent Frame Prediction' head, which is trained to predict the features of the next video frame. Noah: So that's how they calculate 'surprise'—it's the error between the predicted next frame and the actual next frame. John: Precisely. And they showed this 'surprise' signal can be used in practical ways. For the recall task, they used it to manage memory: non-surprising, redundant frames are compressed or dropped, while surprising frames are prioritized. This kept the memory footprint stable even for multi-hour videos. For the counting task, high-surprise frames were used as natural event boundaries to segment the video, allowing the model to count within segments and aggregate the results. This predictive approach dramatically outperformed both the standard Cambrian-S and Gemini. Noah: This idea of using prediction error sounds very similar to the core principles of world models in reinforcement learning or papers like 'Long-Context State-Space Video World Models'. Is this just applying that concept to MLLMs? John: It is a direct connection. This paper effectively provides a concrete, MLLM-based implementation of that broader concept. It bridges the gap between the abstract theory of world models and a practical mechanism for video understanding. The significance here is twofold. First, it provides a clear new benchmark, VSI-SUPER, that exposes the limitations of the current scaling-focused paradigm. Second, it offers 'predictive sensing' as a promising alternative, showing that even a simple implementation can lead to substantial gains in long-horizon reasoning. It argues for a fundamental shift in how we build these models. John: To wrap up, the Cambrian-S paper is important not just for the model it builds, but for the questions it forces us to ask. It challenges the community to move beyond passive, reactive models and towards proactive systems that build internal, predictive models of the world. The main takeaway is that for AI to achieve true spatial intelligence, especially for embodied agents, it may need to learn to be surprised. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Cambrian-S: Towards Spatial Supersensing in Video