Attention Lesson 4 of 4

Context Changes Everything

Same word, different attention

Now let's see attention in action. Below are two sentences with the same word "bank", but notice how the attention patterns are completely different.

Compare Attention Patterns

See how "bank" pays attention to different words based on context.

Contextual embeddings

After attention, each word gets a new representation that incorporates context. This is called a contextual embedding.

  • Static embedding: "bank" → same vector always
  • Contextual embedding: "bank" → vector depends on surrounding words

This is why modern LLMs like Claude can understand nuance, sarcasm, and ambiguity that simpler models miss.

The full attention heatmap

Attention Heatmap

Each row shows how much that word attends to every other word. Brighter = more attention.

From attention to transformers

Modern LLMs use multi-head attention (multiple attention patterns in parallel) and stack many layers of attention. Each layer refines the understanding.

The Transformer architecture (2017) combined attention with other innovations and became the foundation for GPT, BERT, Claude, and virtually all modern language models.

Key Takeaways

  • Attention creates context-dependent representations
  • The same word gets different vectors in different contexts
  • This is the foundation of modern language understanding