Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Attention Is All You Need

2023-10-02


Keywords: #Transformer

3. Model Architecture

3.1 Encoder and Decoder Stacks

  1. Encoder
    • Multi-head self-attention
    • Residual connection + Layer Normalization → $\text{LayerNorm(x+Sublayer(x))}$
    • Fully connected feed-forward
  2. Decoder
    • Outputs (shifted right) + Masked Multi-head attention: Ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$.
    • Multi-head attention
    • Fully connected feed-forward

3.2 Attention

⚔️ Intuition behind Query (Q), Key (K), Value (V)

  • Find the relations between each token in “I am a teacher.”

⚔️ Scaled Dot-Product Attention

⚔️ Multi-Head Attention

⚔️ Applications of Attention in our Model

  1. Encoder-Decoder Attention
    • Q comes from prev. decoder layer, while K and V come from the output of the encoder.
    • This allows every position in the decoder to attend over all positions in the input sequence.
  2. Encoder Self-Attention
    • K, V, Q come from the same place- output of the previous layer in the encoder.
    • Each position in the encoder can attend to all positions in the previous layer.
  3. Decoder Self-Attention
    • Each position in the decoder to attend to all positions in the decoder up to and including that position.
    • Illegal connections are masked out to $-\inf$ to prevent leftward information flow → Preserve auto-regressive property

3.3 Position-wise Feed-Forward Networks

  • A fully connected feed-forward network is applied to each position separately and identically.
  • Two linear transformations with a RELU activation in between: $FFN(x) = \max{(0, xW_1+b_1)}W_2+b_2$

3.5 Positional Encoding