Researchers at Sapienza University of Rome and Fastweb conducted the first empirical analysis of attention sinks in Diffusion Language Models (DLMs), discovering that DLMs exhibit dynamic, shifting sinks and a surprising robustness to their removal, unlike Autoregressive Models (ARMs). This work offers key insights into DLM internal dynamics, showing that their attention mechanisms are more distributed and flexible.
2
For any sCs \in \mathbb{C} with \Re(s)>0, denote by ηn1(s)\eta_{n-1}(s) the (n1)th(n-1)^{th} partial sum of the Dirichlet series for the eta function η(s)=12s+3s  \eta(s)=1-2^{-s}+3^{-s}-\cdots \;, and by Rn(s)R_n(s) the corresponding remainder. Denoting by un(s)u_n(s) the segment starting at ηn1(s)\eta_{n-1}(s) and ending at ηn(s)\eta_n(s), we first show how, for sufficiently large nn values, the circle of diameter un+2(s)u_{n+2}(s) lies strictly inside the circle of diameter un(s)u_n(s), to then derive the asymptotic relationship Rn(s)(1)n1/nsR_n(s) \sim (-1)^{n-1}/n^s, as nn \rightarrow \infty. Denoting by D=\left\{s \in \mathbb{C}: \; 0< \Re(s) < \frac{1}{2}\right\} the open left half of the critical strip, define for all sDs\in D the ratio χn±(s)=ηn(1s)/ηn(s)\chi_n^{\pm}(s) = \eta_n(1-s) / \eta_n(s). We then prove that the limit $L(s)=\lim_{N(s)
There are no more papers matching your filters at the moment.