self-attention为什么要除以根号d_k

版本前瞻 2025-09-28 20:31:55

self-attention的公式为

attention(Q,K,V)=Softmax(QKdk)Vattention(Q,K,V)=Softmax( \frac{QK}{\sqrt{d_{k}}})Vattention(Q,K,V)=Softmax(dk​​QK​)V

个人理解,除以dk\sqrt{d_{k}}dk​​的原因有两点:

dkd_{k}dk​是词向量/隐藏层的维度

1、首先要除以一个数,防止输入softmax的值过大,导致偏导数趋近于0;

2、选择根号d_k是因为可以使得q*k的结果满足期望为0,方差为1的分布,类似于归一化。

公式分析,首先假设q和k都是服从期望为0,方差为1的独立的随机变量。

Assume:X=qiX=q_{i}X=qi​,Y=kiY=k_{i}Y=ki​,那么:

1、E(XY)=E(X)E(Y)=0∗0=0E(XY)=E(X)E(Y)=0*0=0E(XY)=E(X)E(Y)=0∗0=0

2、D(XY)=E(X2Y2)−[E(XY)]2D(XY)=E(X^{2}Y^{2})-[E(XY)]^{2}D(XY)=E(X2Y2)−[E(XY)]2

=E(X2)E(Y2)−[E(X)E(Y)]2=E(X^{2})E(Y^{2})-[E(X)E(Y)]^{2}=E(X2)E(Y2)−[E(X)E(Y)]2

=E(X2−02)E(Y2−02)−[E(X)E(Y)]2=E(X^{2}-0^{2})E(Y^{2}-0^{2})-[E(X)E(Y)]^{2}=E(X2−02)E(Y2−02)−[E(X)E(Y)]2

=E(X2−[E(X)]2)E(Y2−[E(Y)]2)−[E(X)E(Y)]2=E(X^{2}-[E(X)]^{2})E(Y^{2}-[E(Y)]^{2})-[E(X)E(Y)]^{2}=E(X2−[E(X)]2)E(Y2−[E(Y)]2)−[E(X)E(Y)]2

=[E(X2)−[E(X)]2][E(Y2)−[E(Y)]2]−[E(X)E(Y)]2=[E(X^{2})-[E(X)]^{2}][E(Y^{2})-[E(Y)]^{2}]-[E(X)E(Y)]^{2}=[E(X2)−[E(X)]2][E(Y2)−[E(Y)]2]−[E(X)E(Y)]2

=D(X)D(Y)−[E(X)E(Y)]2=D(X)D(Y)-[E(X)E(Y)]^{2}=D(X)D(Y)−[E(X)E(Y)]2

=1∗1−0∗0=1*1-0*0=1∗1−0∗0

=1=1=1

3、D(QKdk)=dk(dk)2=1D(\frac{QK}{\sqrt{d_{k}}})=\frac{d_{k}}{(\sqrt{d_{k}})^{2}}=1D(dk​​QK​)=(dk​​)2dk​​=1

需要注意的是,D(QK)=D(∑i=0dkqiki)=dk∗1=dkD(QK)=D(\sum_{i=0}^{d_{k}}q_{i}k_{i})=d_{k}*1=d_{k}D(QK)=D(∑i=0dk​​qi​ki​)=dk​∗1=dk​

附:AI工具箱

链接:https://hxmbzkv9u5i.feishu.cn/docx/Mv4Dd8TEYoUmTAxfpLtcUoOKnZc?from=from_copylink