Boosting Generative Image Modeling via Joint Image-Feature Synthesis

BibTex

Copy

@misc{gidaris2025boostinggenerativeimage,
      title={Boosting Generative Image Modeling via Joint Image-Feature Synthesis},
      author={Spyros Gidaris and Nikos Komodakis and Ioannis Kakogeorgiou and Theodoros Kouzelis and Efstathios Karypidis},
      year={2025},
      eprint={2504.16064},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.16064},
}

GitHub

ReDi

HTTPS

https://github.com/zelaki/ReDi

SSH

git@github.com:zelaki/ReDi.git

CLI

gh repo clone zelaki/ReDi

AI Audio Lecture + Q&A

0:00 / 0:00

Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Transcript

John: Welcome to Advanced Topics in Generative Models. Today's lecture is on 'Boosting Generative Image Modeling via Joint Image-Feature Synthesis.' We've seen Latent Diffusion Models, or LDMs, become the standard, building on works like 'High-Resolution Image Synthesis with Latent Diffusion Models'. Recently, the focus has shifted to improving their learned representations and training speed, with approaches like 'Representation Alignment for Generation' gaining traction. This paper, from a group at Athena Research Center in Greece, proposes a different path to integrating semantic understanding into the generation process. Yes, Noah? Noah: Excuse me, Professor. You mentioned representation alignment. Is this paper's approach similar, or is it a direct counter-argument to that method? John: That's the central question. It's not a counter-argument so much as an alternative philosophy. While methods like REPA use distillation to align the diffusion model's internal features with those of a pre-trained encoder, this paper argues for a more direct approach: joint modeling. Instead of teaching one model to mimic another, they train a single diffusion model to denoise two things at once: the compressed image latents from a VAE and the high-level semantic features from an encoder like DINOv2. The core idea is to make the model inherently bilingual, fluent in both visual structure and semantic meaning from the very start. Noah: So the hypothesis is that forcing the model to denoise both simultaneously is a more direct and efficient way to embed semantics, rather than adding a separate alignment objective. John: Precisely. This simplifies the training pipeline by removing the need for complex distillation losses. The framework, which they call ReDi, uses a Diffusion Transformer architecture. It takes the VAE image tokens and the DINOv2 feature tokens, adds noise to both sets, and then tasks the transformer with predicting the original, clean versions of both. By doing this, the model learns the statistical dependencies between low-level visual details and high-level semantic concepts organically. Their findings show this not only improves the final image quality, as measured by FID, but also significantly accelerates training convergence compared to baselines. Noah: That makes sense conceptually, but it seems like you'd be mixing very different types of information. The VAE latents encode pixel-level details, while DINO features are abstract. Did they run into issues combining them? John: They did. A critical implementation detail they discuss is the channel imbalance. The DINOv2 feature representations had a much higher dimensionality than the VAE latents, and simply concatenating them degraded performance. Their solution was to first apply Principal Component Analysis, or PCA, to the DINOv2 features. By reducing their dimensionality, they could balance the influence of the two token types, which was crucial for the model to learn effectively from both without one overwhelming the other. Noah: Okay, so PCA solved a key architectural problem. What about at inference time? Is there a way to leverage this dual understanding? John: Yes, and that's their second main contribution. They introduce a novel inference strategy called Representation Guidance. During each denoising step, the model predicts both the clean image latent and the clean semantic features. Representation Guidance uses the model's own prediction of the semantic features to steer, or guide, the update for the image latent. Essentially, it nudges the image generation towards a result that is more semantically coherent according to what the model itself has learned about the relationship between images and features. This further refines the output quality beyond what the joint training alone provides. Noah: How does Representation Guidance differ from standard Classifier-Free Guidance? John: Classifier-Free Guidance, or CFG, typically steers generation based on an external condition, like a class label or a text prompt. Representation Guidance is more of an internal consistency check. It doesn't rely on an external label. Instead, it leverages the model's own internal, joint distribution of images and features to refine the generation. Interestingly, the authors found that applying standard CFG only to the image latent part, and not to the semantic feature part, yielded the best results when they were combined. Noah: So, the implication is that distillation isn't the only, or maybe even the best, way to inject knowledge from powerful pretrained models into a generative process? John: That's a fair interpretation of their claim. This work establishes joint modeling as a strong alternative. By simplifying the training objective and introducing a tailored guidance mechanism, it pushes the state of the art in terms of both image quality and training efficiency for LDMs. It shifts the conversation from 'how do we align representations?' to 'how can we model them jointly from the beginning?' This could influence how we approach integrating different modalities into generative models in the future, potentially for text, audio, or other feature types. Noah: So if this joint modeling works so well, could this approach be extended beyond DINO features? For instance, with CLIP embeddings for text-to-image synthesis? John: That's an excellent thought and a natural next step for this line of research. The framework is quite general. As long as you have a meaningful feature representation to pair with the image latents, the principle of joint denoising should apply. Given that one of the authors is affiliated with valeo.ai, which works on automotive technology, you can also imagine applications in generating highly realistic and semantically consistent synthetic data for training autonomous vehicles. John: To wrap up, this paper presents a compelling framework, ReDi, that effectively integrates semantic understanding into latent diffusion models. Its core contribution is the shift from feature distillation to a simpler, more direct joint modeling of image latents and semantic features. This approach, enabled by a key insight on dimensionality reduction via PCA and enhanced by a novel 'Representation Guidance' technique, leads to faster training and higher-quality image synthesis. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Boosting Generative Image Modeling via Joint Image-Feature Synthesis