alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

We're hiring
PaperBlogResources

StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

BibTex
Copy
@misc{xingWed Dec 10 2025 06:50:16 GMT+0000 (Coordinated Universal Time)stereoworldgeometryawaremonoculartostereo,
      title={StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation},
      author={Ke Xing and Longfei Li and Yuyang Yin and Hanwen Liang and Guixun Luo and Chen Fang and Jue Wang and Konstantinos N. Plataniotis and Xiaojie Jin and Yao Zhao and Yunchao Wei},
      year={Wed Dec 10 2025 06:50:16 GMT+0000 (Coordinated Universal Time)},
      eprint={2512.09363},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.09363},
}
AI Audio Lecture + Q&A
0:00 / 0:00
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
Transcript
John: Welcome to Advanced Topics in Computer Vision. Today's lecture is on 'StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation.' We've seen a surge in generative models for 3D-aware tasks, with papers like 'StereoDiffusion' exploring training-free methods. This work, from researchers at Beijing Jiaotong University and the University of Toronto, takes a different, training-based approach to tackle a major bottleneck for XR devices: the scarcity of immersive stereo video content. It proposes a way to convert standard monocular videos into stereo. John: Yes, Noah? Noah: Excuse me, Professor. You mentioned XR devices like the Vision Pro. Is the core problem that there just isn't enough native 3D content for them? John: Exactly. Producing it requires expensive dual-camera setups. This research aims to democratize that by algorithmically converting the massive existing library of 2D videos. That’s why it’s a significant problem to solve. John: The main contribution of StereoWorld is its departure from traditional multi-stage pipelines. Previously, you'd have methods that first try to reconstruct the 3D scene geometry and then render a new view. This is prone to errors, especially with dynamic scenes. Another common approach is a depth-warping-inpainting pipeline, where you estimate a depth map, warp the image, and then use a model to fill in the gaps, or occlusions. This often leads to artifacts and geometric inconsistencies. Noah: So those pipelines are brittle because errors accumulate at each stage? John: Precisely. If your depth estimation is off, the warp is wrong, and the inpainting has to work harder, often hallucinating content that isn't geometrically sound. StereoWorld proposes an end-to-end diffusion framework. It takes the left-eye video and directly generates the right-eye video. It does this by leveraging a powerful pretrained video diffusion model and fine-tuning it with what they call 'geometry-aware regularization.' A huge part of their contribution is also curating a new dataset, StereoWorld-11M, which is specifically aligned with human interpupillary distance, or IPD, to ensure the resulting stereo effect is comfortable to view, not jarring. Noah: That dataset sounds critical. Most stereo datasets I know are for autonomous driving with very wide baselines. John: It is. Using those datasets for this task would produce an exaggerated 3D effect that causes eye strain. This new dataset is tailored for the human visual system, which is a key practical insight. John: Let's look at the technical approach. The foundation is a pretrained Diffusion Transformer, or DiT, model. To adapt it for stereo generation, they use a clever but simple trick for conditioning. Instead of complex architectural changes, they take the latent representations of the left and right videos and concatenate them along the frame dimension. The model's existing spatio-temporal attention layers then naturally learn the relationship between the two views, almost as if they were just subsequent frames in time. Noah: Wait, so the model learns the stereo correspondence without being explicitly told which is the left view and which is the right? John: Not entirely. That's where the geometry-aware regularization comes in. Relying on the RGB reconstruction loss alone is not enough to learn precise 3D structure. So, they introduce two additional supervisory signals during training. First is a disparity loss. They use a pre-trained model to compute the ground-truth disparity between the real stereo pair and guide the generator to produce a right view that matches this disparity. This ensures pixel correspondence in overlapping regions. Noah: And what about non-overlapping regions? Disparity can't supervise what's occluded in one view. John: Excellent point. That's why they add a second signal: a depth supervision loss. They use another state-of-the-art model to pre-compute depth maps for the right-view video. The model is then trained to jointly predict both the RGB right-view and its corresponding depth map. This ensures the model learns a complete geometric understanding of the scene, not just the parts visible in both views. The ablation studies in the paper show that both of these supervisors are essential; using just one or the other results in significantly worse geometric accuracy. John: The implications here are significant. By moving to an end-to-end generative framework, StereoWorld produces results that are not only more visually faithful but also much more geometrically consistent than prior methods like StereoCrafter. The quantitative results show a marked improvement in metrics like End-Point-Error, which directly measures disparity accuracy. This translates to a more comfortable and convincing 3D experience for the viewer. Human evaluations confirmed this, with participants rating StereoWorld highest on stereo effect, visual quality, and both binocular and temporal consistency. Noah: So this approach of directly generating the view, guided by geometry, is more robust than trying to explicitly reconstruct it first? John: That seems to be the case. It lets the powerful priors of the diffusion model handle the texture and semantics, while the explicit geometric losses keep the 3D structure grounded and accurate. However, the paper does note limitations. The generation speed is slow, around six minutes per clip, which is a hurdle for practical application. Also, the stereo baseline isn't explicitly controllable; it's learned from the data. Future work could focus on distillation for speed and methods for user-controlled disparity. Noah: That makes sense. Having control over the baseline would be important for artistic expression or user comfort settings. John: Indeed. To wrap up, StereoWorld presents a robust, end-to-end framework for monocular-to-stereo conversion. Its main takeaways are the effectiveness of combining a pretrained video model with dual geometry-aware supervision, and the critical importance of a large-scale, IPD-aligned dataset for training models intended for human perception. This work significantly lowers the barrier to creating immersive content, which could have a major impact on the XR ecosystem. John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.