alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

We're hiring
PaperBlogResources

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

AI Audio Lecture + Q&A
0:00 / 0:00
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Transcript
John: Welcome to CS 7643: Advanced Topics in Multimodal AI. Today's lecture is on 'Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training.' We've seen a lot of recent work, like 'LLaVA-MORE' and 'VILA,' focusing on optimizing multimodal instruction tuning. This paper, from researchers at Meta Superintelligence Labs and the University of Oxford, takes a step back. It asks a more fundamental question: how much does a language model already know about the visual world before it ever sees a single image? Yes, Noah? Noah: Excuse me, Professor. So the core idea is that LLMs are learning about vision without ever seeing an image? Just from text? John: Precisely. That's the paradox that motivates this work. We're observing that LLMs trained only on text develop what the authors call rich 'visual priors.' For instance, they can generate code to render 3D scenes or adapt to vision tasks with very little multimodal data, which implies a strong visual foundation is already there. The main objective of this paper was to systematically figure out where these priors come from. Are they a single block of knowledge, or can they be broken down? And most importantly, can we deliberately cultivate them during pre-training to build better multimodal models? Noah: That's a huge claim. How did they even begin to test this? It seems like there would be too many confounding variables to isolate the source of these priors. John: Their approach was a massive, controlled ablation study. We're talking over 100 experiments that consumed half a million GPU-hours. They trained several Llama-3 style models at different scales, from 340 million to 13 billion parameters. The key was the data. They used 16 different text sources—academic papers, code, math, literature, web crawls—and trained models on specific diets of this data. Then, they would adapt each of these text-only LLMs into a Multimodal LLM using a standardized, two-stage process and evaluate them on a suite of 16 different visual question answering benchmarks. This allowed them to connect specific pre-training data choices to specific downstream visual capabilities. Noah: Wait, how did they separate the effect of the pre-training data from the multimodal tuning data? Wouldn't the vision-language instruction data just teach the model what it needs to know, masking the original priors? John: That's a critical point. They addressed this by intentionally using a smaller, curated dataset for the multimodal adaptation phase. By reducing the amount of visual instruction data, the differences in the foundational LLMs became much more apparent. It amplified the signal from the pre-existing priors. This methodology led to their main findings. They discovered that reasoning-centric data, like code and mathematics, was a powerful driver for enhancing complex visual reasoning. In contrast, a small amount of text describing the visual world was necessary for baseline perception, but its benefits saturated quickly. Essentially, the model's ability to reason about what it sees is more dependent on its language pre-training than its ability to just perceive it. Noah: So, they're saying the visual prior isn't monolithic. It has separate parts? John: Correct. They propose that it decomposes into at least two components. First, a 'Perception Prior,' which seems to emerge from broad, diverse data like a general web crawl. This covers basic object recognition and properties. Second, and more critically, a 'Reasoning Prior,' which is cultivated by and scales with reasoning-centric data. This prior is more abstract and modality-agnostic, handling things like spatial relationships and multi-step visual logic. They found this reasoning prior was highly transferable, benefiting the MLLM regardless of which vision encoder they attached. Noah: That's interesting. The separation into perception and reasoning sounds a lot like the ideas in 'Bring Reason to Vision,' which used model merging. And this whole concept seems to support the Platonic Representation Hypothesis they mentioned. Is that the main theoretical takeaway? John: Exactly. Their findings provide strong empirical support for that hypothesis—the idea that as models scale, they learn an underlying, unified world model, and modalities like text or images are just different 'projections' of it. This work shows how the structure of that world model can be recovered from text alone. The biggest impact here is a data-centric roadmap for building better MLLMs. It shifts the paradigm from hoping visual capabilities emerge serendipitously to cultivating them deliberately. Their optimal 'vision-favorable' data mixture wasn't just web data; it was heavily skewed, with about 60% reasoning content and 15% visual world descriptions. Noah: So the recipe is to just use more code and math? Did they find any trade-offs? Does creating a 'vision-favorable' LLM hurt its core language abilities? John: A valid concern. They tested this by creating a 'balanced mixture.' When they scaled this up to a 7B model trained on a trillion tokens, it actually outperformed their 'language-favorable' baseline on all the visual benchmarks while also achieving better performance on standard language tasks like perplexity. So, with the right calibration, you can improve visual priors without a significant compromise, and perhaps even with a synergistic benefit to core language skills. John: So, the key takeaway is that the path to better multimodal models might not start with more images, but with more thoughtfully curated text. By focusing on the right linguistic diet, we can build a stronger reasoning foundation into the language model itself. It suggests that to teach a model to see, you should first teach it how to think. The visual world, it seems, is deeply encoded in the abstract structure of our language and logic. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.