alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

We're hiring
PaperBlogResources

VGGT: Visual Geometry Grounded Transformer

BibTex
Copy
@Inproceedings{Wang2025VGGTVG,
 author = {Jianyuan Wang and Minghao Chen and Nikita Karaev and Andrea Vedaldi and Christian Rupprecht and David Novotný},
 title = {VGGT: Visual Geometry Grounded Transformer},
 year = {2025}
}
GitHub
vggt
9237
HTTPS
https://github.com/facebookresearch/vggt
SSH
git@github.com:facebookresearch/vggt.git
CLI
gh repo clone facebookresearch/vggt
AI Audio Lecture + Q&A
0:00 / 0:00
VGGT: Visual Geometry Grounded Transformer
Transcript
John: Welcome to Computer Vision and Neural Networks. Today's lecture is on the paper 'VGGT: Visual Geometry Grounded Transformer' from researchers at the University of Oxford's Visual Geometry Group and Meta AI. We've seen a trend in recent work, like 'No Pose, No Problem', moving towards 3D reconstruction from unposed images without relying on traditional, slow pipelines. This paper pushes that idea forward by proposing a single, large transformer model that does almost everything in one go. It represents a significant move towards a 'neural-first' paradigm for 3D scene understanding. Yes, Noah? Noah: Hi Professor. When you say 'neural-first', does that mean it completely replaces methods like Bundle Adjustment that we see in classic Structure-from-Motion? John: That's the core objective. The motivation is to overcome the limitations of those traditional methods, which are robust but computationally intensive and slow. The authors aim to 'eschew geometry post-processing almost entirely.' Instead of a multi-stage pipeline involving feature matching, triangulation, and iterative optimization, VGGT is a feed-forward network. You give it a set of images, and in a single pass, it directly predicts the camera parameters, depth maps, and a 3D point map for the entire scene. Noah: So it's not just predicting one thing, like depth, but the entire scene geometry at once? John: Exactly. And that's a key part of their contribution. They call it 'over-complete predictions'. Geometrically, if you have camera poses and depth maps, you can calculate the 3D point cloud. But they found that by explicitly training the model to predict all three—cameras, depth, and point maps—using a multi-task loss, the overall performance improves. The model learns the inherent relationships between these outputs, and each task effectively regularizes the others. Noah: Wait, I'm a bit confused. How can a standard transformer architecture handle hundreds of images and their spatial relationships without getting overwhelmed? Does it have some special 3D inductive bias built in? John: That's an excellent question, and it leads directly to their main architectural innovation. The authors intentionally use a fairly standard large transformer with minimal 3D-specific biases to test its power. The key is a mechanism they call Alternating-Attention. Instead of having every token attend to every other token across all images all the time, which would be computationally prohibitive, the transformer layers alternate. One layer performs self-attention within each frame's tokens, processing the image content. The next layer performs global self-attention, allowing tokens to integrate information across all frames. This balances per-image feature extraction with multi-view geometric reasoning. Noah: So it's a strategy to make the cross-view communication more efficient. What about the inputs? Are the raw pixels fed into this transformer? John: No, they leverage a pre-trained DINOv2 model as a feature extractor. Each image is converted into a set of patch tokens. These, along with a special 'camera token' for each view, are the inputs to the main VGGT transformer. After the transformer processes these tokens, specialized prediction heads take over. The refined camera tokens go to a head that predicts pose and intrinsics, while the image tokens go to DPT-style heads that generate the dense depth and point maps. Noah: Given its goal of being a general-purpose model, the training data must be incredibly diverse. Was it trained on a specific type of scene? John: It was trained on a massive and varied collection of datasets—everything from indoor scenes like ScanNet and Replica to outdoor collections like MegaDepth and Mapillary, plus a lot of synthetic data. This diversity is crucial for the model's strong generalization, which is one of its most significant findings. For example, it achieves state-of-the-art results on the RealEstate10K dataset, which it was never trained on, and it does so in about 0.2 seconds. For comparison, optimization-based methods can take over 20 seconds. Noah: That speed is impressive. But does this feed-forward approach sacrifice accuracy? I know some other methods, like in 'Pow3R' or 'MapAnything', also use transformers but may incorporate priors or refinement steps. How does VGGT's raw output compare to something that includes a final optimization step? John: That's the critical trade-off to consider. On its own, VGGT's feed-forward performance is already superior to many prior methods, including those that use costly post-processing. However, the authors also show that its predictions provide an excellent initialization for a very fast Bundle Adjustment. This 'VGGT + BA' approach yields even higher accuracy, often setting a new state of the art, while still being an order of magnitude faster than traditional pipelines that start from scratch. It essentially gets the best of both worlds: a fast, high-quality initial guess from the network, followed by a quick refinement. Noah: So it's not just a replacement for optimization, but a powerful accelerator for it. And you mentioned it can be used for other tasks? John: Correct. This points to its potential as a foundational model for 3D vision. The authors demonstrate that the feature backbone, once trained, can be finetuned for other tasks. For instance, they integrate it into a tracker to improve performance on dynamic point tracking in videos, and they use it for feed-forward novel view synthesis. This versatility is arguably its biggest impact—it's a step toward a universal, pre-trained backbone for 3D perception. John: To wrap up, VGGT demonstrates that a large, feed-forward transformer can directly and efficiently infer comprehensive 3D scene geometry. It shifts the paradigm from slow, complex optimization pipelines toward fast, unified neural inference. This speed and versatility unlock potential for real-time applications in robotics and augmented reality that were previously impractical. The main takeaway is that with sufficient scale and the right architecture, neural networks can learn to solve complex, multi-view geometry problems in a single shot. John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.