alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

We're hiring
PaperBlogResources

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

BibTex
Copy
@misc{huangWed Oct 08 2025 02:50:14 GMT+0000 (Coordinated Universal Time)mingunivisionjointimage,
      title={Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer},
      author={Ziyuan Huang and DanDan Zheng and Cheng Zou and Rui Liu and Xiaolong Wang and Kaixiang Ji and Weilong Chai and Jianxin Sun and Libin Wang and Yongjie Lv and Taozhi Huang and Jiajia Liu and Qingpei Guo and Ming Yang and Jingdong Chen and Jun Zhou},
      year={Wed Oct 08 2025 02:50:14 GMT+0000 (Coordinated Universal Time)},
      eprint={2510.06590},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.06590},
}
GitHub
Ming-UniVision
82
HTTPS
https://github.com/inclusionAI/Ming-UniVision
SSH
git@github.com:inclusionAI/Ming-UniVision.git
CLI
gh repo clone inclusionAI/Ming-UniVision
AI Audio Lecture + Q&A
0:00 / 0:00
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
Transcript
Speaker 1: So, we've seen a lot of progress lately in vision-language models, but a persistent challenge has been truly unifying image understanding and generation. Typically, these tasks rely on fundamentally different visual representations and tokenization schemes, leading to architectural complexities and inefficiencies. This paper, Ming-UniVision, from Inclusion AI at Ant Group, really aims to tackle this head-on by introducing a novel continuous visual tokenizer, MingTok, within a single autoregressive framework. It's about bridging that gap with a more coherent, unified approach. Speaker 2: Right, so instead of separate pipelines for, say, image captioning and text-to-image synthesis, they're proposing a single system. It sounds like they're trying to overcome the limitations of those discrete tokens we often see, which can lose a lot of information. Is that the core idea? Speaker 1: Exactly. The core contribution is MingTok, their unified continuous tokenizer, and the Ming-UniVision model that leverages it. Traditionally, discrete tokenization schemes, while good for aligning with large language models, introduce quantization errors. Think of it like trying to describe a smooth curve using only straight line segments – you lose fidelity. Ming-UniVision moves to a continuous latent space, aiming to eliminate that data loss. The real innovation is how MingTok manages to reconcile the often-competing demands of understanding and generation. Understanding often needs high-dimensional, discriminative semantic features, while generation requires compact, low-dimensional latent codes that meticulously preserve fine-grained visual details. MingTok uses a three-stage sequential architecture to simultaneously cater to both. This allows the unified multimodal model to formulate both understanding, like visual question answering, and generation, like text-to-image, as 'next-token prediction' within this single, shared continuous space. It's a significant step towards truly integrated multimodal AI, leveraging the in-context learning and compositional reasoning abilities of LLMs directly in the visual domain without needing costly conversions between distinct latent representations. Speaker 2: Okay, so the continuous nature avoids the 'pixelation' or information loss of discrete tokens, and the multi-stage MingTok itself is designed to output both the compressed details for generation and the rich semantics for understanding. That's a clever way to address the divergent needs without having separate tokenizers. So, how does this actually play out in practice? What does MingTok's architecture look like to achieve this dual purpose? Speaker 1: The methodology is quite intricate. MingTok employs a three-stage sequential architecture. First, a low-level encoder takes raw image pixels and compresses them into compact, continuous latent embeddings, optimizing for efficient autoregressive generation. This is like a high-efficiency visual summarizer. Then, a semantic decoder expands these compact latents into high-dimensional semantic features suitable for understanding tasks, using causal attention for autoregressive generation. Finally, a pixel decoder reconstructs the original image from these high-dimensional features, recovering fine visual details. The training is also quite clever: they optimize MingTok end-to-end with a multi-task learning framework based on masked image modeling. This ensures structural compactness for the low-level encoder, semantic expressiveness for the semantic decoder, and high-fidelity reconstruction for the pixel decoder. For instance, the low-level latent space is regularized by predicting features from a pre-trained DINOv2 model, while the semantic features are supervised by CLIP-aligned backbones. This multi-objective training ensures that the continuous latent space is rich and structured for both tasks. When it comes to the Ming-UniVision model itself, it takes these high-level semantic features from MingTok for both understanding and generation, unifying them as 'next-token prediction.' For visual generation, a vision head predicts compact continuous latents using a rectified flow prediction objective for faster convergence, which are then expanded by the semantic decoder and fed to the LLM. This unified input representation and prediction objective are critical for efficient multi-round, in-context multimodal interaction, reducing visual token counts and overhead significantly. Speaker 2: That's a very elegant solution to the dual requirements. The three-stage tokenizer, trained with specific objectives for each stage, creates a latent space that is both compact for synthesis and expressive for analysis. And by using that continuous space and a unified prediction paradigm, they're essentially creating a single 'language' for both tasks. That makes a lot of sense for enabling multi-round interaction, as you're not constantly translating between different representations. So, what's the broader implication for multimodal AI research from this kind of unification? Speaker 1: The implications are substantial. Ming-UniVision significantly shifts the field by validating the feasibility and benefits of a single, shared visual representation that can simultaneously cater to divergent needs. It marks a crucial advancement towards efficient and seamless multi-round interaction. Think of iterative editing, super-resolution, or matting—tasks that currently often require repeated encoding-decoding cycles through distinct pixel and latent spaces. This model drastically reduces computational overhead, improves efficiency, and minimizes cumulative quality degradation by staying purely within its unified latent space. It transitions from fragmented, stateless operations to coherent, stateful visual dialogue. Furthermore, it advances multimodal reasoning and control, demonstrating superior compositional control in image generation. The introduction of Visualized Chain-of-Thought for visual reasoning is particularly noteworthy, offering an interpretable approach where the model's 'thought process'—like highlighted regions—guides the editing, enhancing transparency and predictability in AI applications. Speaker 2: It really sounds like they're enabling a more conversational and intuitive interaction with visual AI, moving beyond one-off commands to iterative, context-aware workflows. That's a huge step forward for practical applications. Speaker 1: Absolutely. This paper is a significant stride towards achieving truly general-purpose visual intelligence. By proving that a unified continuous visual representation can not only reconcile competing task requirements but also lead to state-of-the-art performance in both understanding and generation, Ming-UniVision sets a new standard. The key takeaway here is that a truly unified, continuous visual representation is not just an architectural simplification, but a powerful foundation for more integrated, efficient, and cognitively aligned multimodal AI systems. Speaker 2: Agreed. It's pushing the boundaries of what a single model can accomplish across the entire visual pipeline. Very cool stuff.