TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

BibTex

Copy

@misc{liuMon Dec 01 2025 18:59:51 GMT+0000 (Coordinated Universal Time)tunatamingunified,
      title={TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models},
      author={Zhiheng Liu and Weiming Ren and Haozhe Liu and Zijian Zhou and Shoufa Chen and Haonan Qiu and Xiaoke Huang and Zhaochong An and Fanny Yang and Aditya Patel and Viktar Atliha and Tony Ng and Xiao Han and Chuyan Zhu and Chenyang Zhang and Ding Liu and Juan-Manuel Perez-Rua and Sen He and Jürgen Schmidhuber and Wenhu Chen and Ping Luo and Wei Liu and Tao Xiang and Jonas Schult and Yuren Cong},
      year={Mon Dec 01 2025 18:59:51 GMT+0000 (Coordinated Universal Time)},
      eprint={2512.02014},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.02014},
}

AI Audio Lecture + Q&A

0:00 / 0:00

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Transcript

John: Welcome to Advanced Topics in Multimodal AI. Today's lecture is on 'Tuna: Taming Unified Visual Representations for Native Unified Multimodal Models'. We've seen a surge in unified models recently, like BLIP3-o, that try to handle both understanding and generation. This work, primarily from Meta BizAI, takes on a core challenge: can one visual representation effectively serve both tasks without compromise? The field is pushing towards single, general-purpose models, and Tuna proposes a specific path to get there by avoiding the pitfalls of previous approaches. Go ahead, Noah? Noah: Excuse me, Professor. You mentioned 'native' unified models. The paper contrasts this with 'composite' models. Could you quickly clarify that distinction? John: An excellent starting point. A 'composite' model essentially bolts together separate, pre-trained understanding and generation models with some adapter layers. It's practical, but the core components never truly learn together. A 'native' model, which is what Tuna is, is trained from the ground up as a single, cohesive system on both types of tasks. The goal is to achieve true synergy, where learning to generate better visuals actually helps the model understand them better, and vice-versa. Noah: And previous native models struggled with this? John: Correct. They typically followed one of two paths, both with issues. The first used 'decoupled' visual encoders—one for understanding, one for generation. This led to what the authors call 'representation mismatches,' where the outputs of the two encoders were formatted differently, creating conflicts and increasing model size. The second path used a single 'unified' encoder, but these often had imbalanced performance, becoming good at one task at the expense of the other. Tuna's core idea is to create a new kind of unified representation that solves this. It cascades a VAE encoder with a representation encoder, like SigLIP 2. Noah: So they're feeding the output of one encoder into another? Why not just use one powerful encoder for everything? John: Because different encoders are good at different things. VAEs excel at compressing an image into a latent space and reconstructing it with high fidelity, which is critical for quality generation. But their latent features aren't always rich in semantic meaning. Representation encoders like SigLIP 2, on the other hand, are trained to extract powerful semantic features, which is exactly what you need for understanding tasks. By feeding the VAE's continuous latents through the representation encoder, Tuna gets the best of both worlds: a feature space that is grounded in reconstructive detail but also enriched with high-level semantics. This unified representation serves as the foundation for both the text-generating LLM and the image-generating flow matching head. Noah: Okay, that makes sense. But how does the training process manage to balance these two very different objectives? John: They use a carefully structured three-stage training pipeline. In Stage One, they freeze the main LLM decoder and only train the vision components—the representation encoder, the connector, and the flow matching head. This stage uses image captioning and text-to-image generation data. Its purpose is to adapt the vision encoder to the dual-task setup and get the generation head started on solid footing. In Stage Two, they unfreeze the LLM and continue pretraining on a wider mix of data, including image editing and video captioning. This allows the entire model to learn more complex reasoning. Finally, Stage Three is a supervised finetuning pass on a curated set of high-quality instruction-following datasets to refine its capabilities. Noah: Wait, I'm a bit confused about Stage One. If the LLM is frozen, how does the model learn the crucial vision-language alignment needed for instruction following? John: That's a sharp question. The alignment begins in Stage One, but it's primarily focused on the visual side. By training on image captioning, the representation encoder learns to produce features that are semantically meaningful and align with text. And by training on text-to-image generation, the generation gradients flow back to the representation encoder, teaching it what features are important for creating images. So, it's about getting the visual representation 'ready' to be used by the LLM, which then gets fully integrated during Stage Two. This progressive approach seems to prevent the training from becoming unstable. Noah: So, how does this cascaded design really differ from a competitor like Show-o2, which the paper mentions also tried a unified approach? John: The key difference is in how and when the features are fused. Show-o2 uses a dual-path architecture with a 'late-fusion' mechanism, combining features from its understanding and generation branches at the end. The Tuna paper's analysis suggests this resulted in a representation heavily biased towards the semantic, understanding-focused features, which limited its generation quality. Tuna's approach is a 'deep fusion' by design. Since the representation encoder processes the VAE latents directly, the generation and understanding signals are integrated early and at every layer of the encoder. This creates a more balanced and robust representation space. Noah: And the results support this idea of mutual enhancement? John: They do. The ablation studies are quite clear on this. When they trained the Tuna architecture jointly on both understanding and generation tasks, it performed better on understanding benchmarks than a version trained only on understanding, and better on generation benchmarks than a version trained only on generation. This provides direct evidence for the synergy they were aiming for. It suggests the model genuinely learns a more powerful and generalizable visual representation because of the dual objectives, rather than being hindered by them. John: So, the main takeaway here is that Tuna provides a successful and surprisingly efficient blueprint for native unified multimodal models. By cascading a VAE with a representation encoder, it creates a single, continuous visual representation that elegantly bridges the gap between understanding and generation. This design not only avoids the conflicts of previous architectures but actively fosters mutual enhancement between tasks, achieving state-of-the-art results with relatively modest model sizes. It's a significant step toward truly general-purpose multimodal AI. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models