alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

We're hiring
PaperBlogResources

Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training

BibTex
Copy
@misc{bonnaireTue Oct 28 2025 13:54:07 GMT+0000 (Coordinated Universal Time)whydiffusionmodels,
      title={Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training},
      author={Tony Bonnaire and Raphaël Urfin and Giulio Biroli and Marc Mézard},
      year={Tue Oct 28 2025 13:54:07 GMT+0000 (Coordinated Universal Time)},
      eprint={2505.17638},
      archivePrefix={arXiv},
      primaryClass={cond-mat.dis-nn},
      url={https://arxiv.org/abs/2505.17638},
}
GitHub
Why-Diffusion-Models-Don-t-Memorize
2
HTTPS
https://github.com/tbonnair/Why-Diffusion-Models-Don-t-Memorize
SSH
git@github.com:tbonnair/Why-Diffusion-Models-Don-t-Memorize.git
CLI
gh repo clone tbonnair/Why-Diffusion-Models-Don-t-Memorize
AI Audio Lecture + Q&A
0:00 / 0:00
Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training
Transcript
John: Welcome to Statistical Physics of Machine Learning. Today's lecture is on a paper from researchers at LPENS in Paris and Bocconi University titled, 'Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training.' We've seen a lot of recent work trying to pin down why these models generalize, from papers like 'On the Edge of Memorization' to 'Generalization in diffusion models arises from geometry-adaptive harmonic representations.' This paper takes a different tack, focusing not on the final state of the model, but on the training process itself. It suggests that generalization is a feature of the learning dynamics. John: Yes, Noah? Noah: Hi Professor. So, is the core idea that the optimization process itself, like the path SGD takes, is what prevents memorization, rather than just the model's architecture or the dataset size? John: Precisely. The authors argue that while factors like model capacity matter, they don't tell the whole story. The central contribution here is the identification of two distinct and well-separated timescales in the training dynamics. The first is what they call generalization time, or tau-gen. This is the point where the model starts producing high-quality, novel samples. The second, which occurs much later, is memorization time, or tau-mem. This is when the model begins to explicitly reproduce training examples. The key insight is that these two times scale differently with the training parameters. Noah: So what's the scaling relationship? John: They found that tau-gen, the time to get good samples, is largely independent of the training set size, n. However, tau-mem, the time to start memorizing, increases linearly with n. This means the gap between generating good images and memorizing them—what they call the 'generalization window'—grows wider as you add more data. It creates a robust period where you can stop training and have a model that generalizes well without having memorized its training set. Noah: And what's the proposed mechanism for this separation of timescales? John: It comes down to spectral bias. Neural networks tend to learn smooth, low-frequency functions first. The true, underlying data distribution can be approximated by a smooth score function, which corresponds to these low-frequency components. The model learns this approximation relatively quickly, leading to generalization at tau-gen. The specific details of the training set, however, introduce very high-frequency, irregular components into the empirical score. The model only learns these fine-grained details after much longer training, which is why memorization occurs later, at tau-mem. The more data you have, the more complex these high-frequency details become, and the longer it takes to learn them. John: To validate this, they used a dual approach. First, they ran extensive numerical experiments on both the CelebA dataset with a U-Net and a synthetic Gaussian Mixture Model with a ResNet. They needed a way to quantitatively measure memorization, so they developed a 'memorization fraction' metric. A generated sample was flagged as 'memorized' if its closest neighbor in the training set was significantly closer than its second-closest neighbor. This allowed them to pinpoint the onset of memorization, tau-mem, and compare it to the point where FID or KL divergence improved, which is tau-gen. Their experiments confirmed the scaling laws: tau-mem scaled with n, while tau-gen did not. Noah: But showing this empirically is one thing. How did they support it theoretically? U-Nets and ResNets are notoriously difficult to analyze. John: That's the second part of their methodology. For the theoretical analysis, they simplified the problem by using a Random Features Neural Network, or RFNN. This is a two-layer network where only the second layer is trained, making it analytically tractable. By studying the training dynamics in a high-dimensional limit and applying tools from random matrix theory and the replica method, they analyzed the spectrum of the matrix that governs the learning timescales. They found that this spectrum splits into two distinct bulks. The inverse eigenvalues of these bulks correspond directly to the two timescales, tau-gen and tau-mem, providing a strong theoretical foundation for their empirical findings. Noah: So in practice, does this mean early stopping is the primary defense against memorization in these highly overparameterized models? John: That's a major practical implication. It elevates early stopping from a simple heuristic to a principled strategy grounded in these dynamical properties. If you know that this generalization window exists and widens with your dataset size, you can more confidently train your model and stop it once sample quality is high but before memorization takes over. This is especially critical in data-scarce domains like medical imaging, where you absolutely want to avoid having the model just spit back the few training images it saw. John: This work really shifts the conversation from static properties of a model to the dynamics of the learning process itself. It helps explain the puzzle of 'benign overfitting' in a new context, suggesting generalization is an intermediate phase of training, not necessarily an endpoint. The idea that learning happens in stages, from coarse to fine, or low-frequency to high-frequency, is a powerful concept that likely extends beyond just diffusion models. It connects to broader theories of representation learning, like the work on 'grokking' where delayed generalization is also observed. Noah: How does this idea of a dynamical window compare to the 'Selective Underfitting' paper we discussed? That work also suggested diffusion models avoid memorization by not perfectly fitting the score function. Are these related concepts? John: That's an excellent connection. They are related but distinct. 'Selective Underfitting' focuses more on the idea that the model preferentially underfits certain parts of the score, particularly high-frequency components, as a general property. This paper provides a temporal dimension to that idea. It argues that the underfitting of high-frequency components is not permanent, but delayed. The model can and will eventually learn them, leading to memorization. The key is that there is a long period where it has already learned the low-frequency, generalizable components, creating this exploitable window. John: So, to wrap up, the main takeaway is that diffusion models generalize effectively because of an implicit dynamical regularization. The training process naturally separates the learning of general features from the memorization of specific data points into two different timescales. The time to memorize scales with the amount of data you have, creating a wider window for safe, effective training. Generalization isn't an accident; it's a predictable phase in the learning trajectory. John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.