What matters for Representation Alignment: Global Information or Spatial Structure? | alphaXiv

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Paper Blog Resources

BibTex

Copy

@misc{singhThu Dec 11 2025 16:39:53 GMT+0000 (Coordinated Universal Time)whatmattersrepresentation,
      title={What matters for Representation Alignment: Global Information or Spatial Structure?},
      author={Jaskirat Singh and Xingjian Leng and Zongze Wu and Liang Zheng and Richard Zhang and Eli Shechtman and Saining Xie},
      year={Thu Dec 11 2025 16:39:53 GMT+0000 (Coordinated Universal Time)},
      eprint={2512.10794},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.10794},
}

GitHub

irepa

0

HTTPS

https://github.com/end2end-diffusion/irepa

SSH

git@github.com:end2end-diffusion/irepa.git

CLI

gh repo clone end2end-diffusion/irepa

AI Audio Lecture + Q&A

0:00 / 0:00

What matters for Representation Alignment: Global Information or Spatial Structure?

Transcript

John: Welcome to Advanced Topics in Generative Models. Today's lecture is on the paper 'WHAT MATTERS FOR REPRESENTATION ALIGNMENT: GLOBAL INFORMATION OR SPATIAL STRUCTURE?' from researchers at Adobe Research, ANU, and NYU. We've seen a lot of recent work on accelerating diffusion transformers, most notably the original REPA paper, which proposed aligning internal features with a pretrained vision encoder. The assumption has always been that a stronger encoder, one with better semantic understanding, leads to better generation. This paper directly challenges that idea. John: Go ahead, Noah. Noah: Excuse me, Professor. Just to be clear, the baseline idea of REPA is to essentially use a powerful pretrained model, like a CLIP or DINO encoder, as a sort of teacher to guide the diffusion model's internal representations at each step, right? John: Exactly. The diffusion model is trained to minimize the distance between its own intermediate features and the features from this powerful, frozen 'teacher' encoder. The community's logic was straightforward: a teacher with a better grasp of global concepts—what we typically measure with ImageNet accuracy—should provide better guidance. This paper begins by showing that this logic is flawed. Their initial experiments revealed that encoders with very high ImageNet scores sometimes produced worse images than encoders with significantly lower scores. For example, an encoder with over eighty percent accuracy was outperformed by one with just over fifty percent. Even more surprisingly, the SAM2 segmentation model's encoder, which has terrible linear probe accuracy, produced quite good results. This discrepancy is the core motivation for their work. Noah: So the main contribution is figuring out why that happens? John: Precisely. The authors hypothesize that the critical factor isn't the encoder's global semantic knowledge, but rather its understanding of spatial structure—the local relationships and coherence between different patch tokens in the image representation. They argue that for a diffusion model trying to build an image pixel by pixel, local coherence is far more important than a high-level label. To test this, they proposed several new metrics to quantify this 'spatial structure'. For instance, one metric measures if nearby patches have more similar feature representations than distant patches. Another measures how quickly this similarity decays over distance. Across 27 different encoders, they found a very strong correlation between these spatial metrics and the final image quality, and a very weak one with ImageNet accuracy. This holds true even for classical, non-deep features like SIFT and HOG, which also improved performance despite having no real semantic understanding. Noah: Wait, so how did they actually measure this? Are these new spatial metrics complex to compute? John: They are quite intuitive. The main ones are based on cosine similarity between patch tokens. For example, 'Local vs. Distant Similarity' simply compares the average similarity of adjacent patches to the average similarity of patches that are far apart. A good spatial representation should have high local similarity. 'Correlation Decay Slope' measures how sharply this similarity drops off as patches get farther apart. These are straightforward to calculate and don't require any extra model training. Noah: Okay, that makes sense. So after identifying that spatial structure is key, did they propose a new method? John: Yes, and its simplicity is a key part of its appeal. They call it improved REPA, or iREPA. It consists of two small modifications. First, the original REPA used a multi-layer perceptron, an MLP, to project the diffusion model's features to match the teacher's. The authors found this MLP was destroying local spatial information. So, they replaced it with a single, lightweight convolutional layer, which has a natural inductive bias for preserving local neighborhoods. Noah: That seems almost too simple. What was the second change? John: The second change addresses the features of the teacher encoder itself. They observed that patch tokens from many pretrained encoders contain a strong global component—essentially, every patch knows a little bit about the entire image. This reduces the contrast between individual patches. So, they introduced a spatial normalization layer, similar to instance normalization, that subtracts the mean feature across all patches. This effectively removes the dominant global signal and forces the model to focus on the unique spatial characteristics of each patch. Together, these two changes—less than four lines of code—consistently and significantly accelerated training and improved final image quality across all their experiments. Noah: Does this finding suggest that pursuing higher ImageNet accuracy for foundation models is the wrong direction, at least for generative applications? John: It suggests a diversification of goals. For tasks that require high-level classification, ImageNet accuracy remains a valuable metric. However, for generative alignment, this work provides strong evidence that we need different metrics and possibly different training objectives. It reframes what a 'good' representation is. A good generative teacher needs to excel at describing local structure, not just global identity. This might lead to new 'generative-friendly' encoders optimized for spatial coherence. The work also has broad applicability, as they showed iREPA boosts performance in other advanced frameworks like REPA-E and even for pixel-space models. Noah: So it's not just about picking the right off-the-shelf encoder anymore, but potentially designing new ones with this spatial bias in mind. John: Exactly. This research provides a new lens through which to evaluate and design vision models. It shifts the focus from 'what is in the image' at a global level to 'how the parts of the image are arranged' at a local level. John: So to wrap up, this paper makes a significant contribution by overturning a common assumption in the field. It provides both the empirical evidence to identify spatial structure as the key driver for representation alignment and a practical, easy-to-implement solution in iREPA that yields substantial gains. The main takeaway is this: for teaching a diffusion model how to build an image, the teacher's understanding of local spatial arrangement is far more important than its ability to name the object in the image. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.