Attention Lesson 3 of 4

How Attention Works

Query, Key, Value

Attention uses three learned transformations of each word:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I provide?"

Think of it like a search engine: the Query is your search term, Keys are like page titles, and Values are the actual content.

The calculation

Attention(Q, K, V) = softmax(Q × KT / √d) × V

Don't worry about the math details. The key insight is:

  1. Compare each Query with all Keys (dot product)
  2. Convert scores to weights (softmax → sum to 1)
  3. Use weights to combine Values

Step through the process

Attention Calculation Steps

Click "Next" to step through how attention is calculated for "The cat sat".

Why this design?

The Q/K/V design is powerful because:

  • Learnable: The model learns what to query and what to expose as keys
  • Flexible: Any word can attend to any other word
  • Parallel: All attention calculations happen simultaneously

Key Takeaways

  • Q, K, V are three different projections of each word
  • Attention = how much each Query matches each Key
  • Output = weighted sum of Values based on attention