Deterministic Discrete Denoising

BibTex

Copy

@misc{suzuki2025deterministicdiscretedenoising,
      title={Deterministic Discrete Denoising},
      author={Hideyuki Suzuki and Hiroshi Yamashita},
      year={2025},
      eprint={2509.20896},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.20896},
}

AI Audio Lecture + Q&A

0:00 / 0:00

Deterministic Discrete Denoising

Transcript

John: Welcome to Advanced Generative Models. Today's lecture is on the paper 'Deterministic Discrete Denoising' by Suzuki and Yamashita from The University of Osaka. John: We've seen a clear trend in continuous models, with work like 'Denoising Diffusion Implicit Models' showing how deterministic sampling can massively speed up generation. This paper asks if we can bring those same benefits to the discrete world of text and graphs. John: The field has been trying to improve discrete diffusion, but it's been a challenge. Yes, Noah? Noah: Excuse me, Professor. Why is it so much harder to create a deterministic process for discrete data? With continuous data, you can just follow the gradient, but what's the equivalent for discrete tokens? John: That's the core problem. A naive deterministic approach in discrete spaces—just picking the most likely token at each step—often leads to a severe loss of sample diversity. The model gets stuck in a few modes. To get around this, some methods embed discrete data into a continuous space, but that usually requires significant architectural changes and, crucially, retraining the entire model from scratch. John: This paper's main contribution is a method to directly derandomize the reverse process on the discrete space itself, without any retraining. They propose using an algorithm called 'herding' as a drop-in replacement for the standard probabilistic sampling step in a pre-trained discrete diffusion model. The herding algorithm is a deterministic sampling technique, but it has what the authors call 'weakly chaotic' dynamics. This allows it to generate diverse sequences while remaining fully deterministic. Noah: What does 'weakly chaotic' actually mean here? Is it just a way of saying it avoids simple, repetitive patterns? John: Essentially, yes. It's not random, but it's not simple and predictable either. The algorithm maintains an internal, continuous weight vector. At each step, it doesn't just pick the token with the highest probability from the diffusion model. Instead, it picks the token that best corrects the accumulated error between the sequence generated so far and the target probabilities. This continuous weight vector acts as a memory of past choices, pushing subsequent selections towards underrepresented options. This feedback mechanism creates diverse, high-entropy outputs without injecting any new randomness. Noah: So the only source of randomness is the initial noise state at the very beginning of the process? John: Precisely. The initial noise vector and the initial random state of that continuous weight vector are the only random inputs. From there, the entire trajectory is deterministic. John: The authors apply this to a state-of-the-art model called UDLM, a uniform diffusion language model. They test it on character-level text, word-level text, and categorical image generation with CIFAR-10. One of the key technical details is a 'delayed-switching' mechanism. They introduce a margin, a small threshold called delta, that prevents the chosen token from changing too rapidly at each step. This helps stabilize the process and seems to improve sample quality by preventing what they call 'chattering'. Noah: So this delta parameter sounds like a critical hyperparameter. Does it introduce a trade-off? For instance, does a larger delta improve quality but hurt diversity? John: Exactly. The results show this trade-off clearly. For text generation, using their herding method improved perplexity by up to a factor of ten, which is a substantial improvement in quality. However, it did come with a moderate decrease in entropy, or diversity. The delta parameter allows you to tune that balance. For image generation, the results were even more compelling. The herding method achieved a much better FID score with just 128 steps than the standard stochastic method did with over 1000 steps. That's a significant efficiency gain. Noah: How does this compare to other attempts to improve discrete models, like in 'A Reparameterized Discrete Diffusion Model for Text Generation'? That paper also claimed to improve quality and efficiency. John: That's a good comparison. Works like that one often focus on changing the fundamental process, reparameterizing it to behave more like a continuous model. While effective, it doesn't offer the 'plug-and-play' benefit of this paper. The significance here is that you can take an already trained UDLM model, swap out a single sampling function, and get better, faster results without any retraining. It's the same appeal that made DDIMs so popular for continuous models. John: The main implication of this research is that it validates the idea of deterministic denoising for discrete domains. It was an open question whether the benefits seen in continuous models could be replicated. This paper provides a strong 'yes'. By making discrete diffusion models faster and capable of producing higher-quality samples, this work makes them more competitive with autoregressive models for tasks like text generation. It broadens their applicability to areas like drug discovery or code generation, where discrete structures are fundamental. Noah: Another question. Could you see this herding mechanism as a form of guidance? It sounds related to the ideas in 'Simple Guidance Mechanisms for Discrete Diffusion Models', where they also try to control the sampling path. John: That's a very insightful connection. While it's not 'guidance' in the typical sense of steering generation towards a specific class, it is a form of deterministic control over the sampling path. It's guiding the process to ensure the empirical distribution of generated features matches the target distribution. An interesting future direction would be to modify the herding algorithm's feature function to incorporate explicit external guidance, which seems entirely feasible. John: So, to wrap up, this paper introduces a straightforward yet effective method for deterministic sampling in discrete diffusion models. The key takeaway is that by using the herding algorithm, we can achieve the efficiency and quality gains of deterministic denoising, previously seen in continuous models, as a simple, retraining-free, drop-in replacement. This significantly boosts the practicality of discrete diffusion models. John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Deterministic Discrete Denoising