urich University of Applied Sciences
This research develops a mathematical framework to analyze the underlying structures of self-attention's query-key (W_qk) matrix, demonstrating that Transformer training objectives (bidirectional vs. autoregressive) intrinsically shape W_qk into symmetric or directional forms, respectively. The empirical validation across diverse models and modalities confirms these emergent structures, and symmetric initialization for encoder-only language models reduced training convergence time by up to 73%.
1
There are no more papers matching your filters at the moment.