Issue 01 · Transformer

The Transformer,
step by step.

Type a sentence. Watch every part of a transformer translate it into Spanish, with real numbers, real math, and plain‑English explanations of every step.

A transformer is the kind of neural network that powers modern AI like ChatGPT, Claude, and Google Translate. It reads a sentence, builds a rich internal “understanding” of it, and writes a new sentence one word at a time. This page walks through every step on a tiny model so you can see exactly what is happening at each stage. No background in math, neural networks, or language models is required, we will define every term as we go.

Try: hello, how are you good morning thank you hello i am fine
Spanish translation
--
About this toy model: we use a tiny transformer with embedding size d_model = 8, 2 attention heads, and 1 encoder + 1 decoder layer. All weights are produced by a deterministic sin() formula (no randomness), so refreshing the page gives identical numbers. Real systems like GPT use 12 to 96 layers, embeddings of size 768 to 12288, and weights learned from massive text data, but every step you will see below is the same math, just smaller.

One small intervention: at the very last step (output logits), we add +20 to the correct Spanish token so the example phrases translate cleanly, explained openly in Section 21. Every other number on this page is real, unmodified math.
Step 1

1. Tokenization, cutting the sentence into pieces

Turning a string of letters into a list of discrete pieces.

What is this? A sentence is a string of characters. Computers struggle to think about whole sentences directly, they prefer small, atomic pieces. Tokenization is the chopping board: it takes raw text and cuts it into a list of tokens, where each token is one word or one piece of punctuation.

Why do we need it? Every later step works on a fixed list of pieces. We also lowercase the text so “Hello” and “hello” share the same identity. Punctuation like commas and question marks become their own tokens because they carry meaning.

tokens = peel_punct( split_whitespace( lowercase( text ) ) )
In words: lowercase the input, split on spaces, then peel commas and question marks off as their own tokens.

The numbers

Reading this: each pill below is one token. Order is left-to-right, just like English.

A sentence becomes an ordered list of small pieces, the smallest units the model will reason about.

Step 2

2. Token IDs, looking up each token in the vocabulary

Each token becomes a number: its index in the model’s dictionary.

What is this? The model has a fixed vocabulary: a list of every token it knows. Each token has a numeric index in that list (its token ID). To turn our list of words into something we can compute on, we replace each token with its ID.

Why do we need it? Math operates on numbers, not words. An ID is the simplest possible numeric handle. If a word is not in the vocabulary, we use a special <unk> (“unknown”) ID instead.

idi = vocab.indexOf(tokeni)
vocab, the model’s ordered word list (29 entries in our toy: 4 special tokens + 25 words).

The numbers

Reading this: each pill shows the token, then its position in the vocabulary list.

Words become integers, coordinates into the model’s memory.

Step 3

3. Token embeddings, a list of numbers per word

Each ID becomes an 8-dimensional vector that captures meaning.

Vector definition first: a vector is just a fixed-length list of numbers, like (0.3, -0.1, 0.8, ...). An 8-dimensional vector means a list of 8 numbers.

What is this? A single integer like 7 tells us nothing about what “how” means. So we replace each ID with a vector (8 numbers, here). This vector is looked up from a giant table called the embedding matrix WE, with one row per vocabulary word.

A 2D analogy first. Imagine plotting words on a flat map. “king” and “queen” sit near each other; “banana” sits far away. Each word becomes a point with two coordinates, e.g. (0.3, 0.7). We do the same here, but with 8 coordinates instead of 2. “Close” just means the numbers are similar.

Why do we need it? Vectors can encode many features at once. In a real model, the embedding for “king” ends up close to “queen” in this 8-dimensional space, far from “banana”. The model gets to position similar words near each other so later math can compare them.

Ei = row idi of WE
WE, the embedding table. Shape [29, 8] means a grid of 29 rows and 8 columns; each row is one word’s embedding. To embed token i, we just pick the row whose row-number equals its token ID.

The numbers

Reading this: each row is one token’s 8-number embedding. All numbers come from a deterministic sin() formula based on (vocab ID, dimension), so they are reproducible.

Words become small lists of numbers, rich enough to encode meaning, small enough to do math on.

Step 4

4. Positional encoding, telling the model where each word sits

A separate vector per position, made of sine and cosine waves.

What is this? In Step 8 (attention) the model will look at all words simultaneously. That step has no idea which word came first, it sees them as a bag, not a list. So we stamp each word with its position before that step: a second 8-dim vector per position, computed purely from the position number using sines and cosines.

Why do we need it? “Dog bites man” vs “Man bites dog” have the same words in different positions, with very different meanings. Position must be part of the input, or attention is order-blind.

Why sines and cosines? They give every position a unique “fingerprint” and let the model easily learn relationships like “3 positions apart” through simple math.

PE[pos, 2i]   = sin( pos / 100002i / d_model )
PE[pos, 2i+1] = cos( pos / 100002i / d_model )
Don’t memorize this formula. What matters: each position 0, 1, 2, … gets turned into a unique 8-number vector. The 10000 is just an arbitrary large constant the original paper picked, it controls how fast the wavelengths grow across columns.     pos, the token’s position (0, 1, 2, …).   i, the pair index (0 to d_model/2 − 1). The two columns of pair i are 2i and 2i+1; even columns use sin, odd use cos.
Worked example. For pos = 1, i = 1 (so 2i = 2): PE[1, 2] = sin(1 / 100002/8) = sin(1 / 10) = sin(0.1) ≈ 0.0998. Look at the pos 1 row, column d2 below, it should match.

The numbers

Reading this: row pos 0 is the “position 0” vector. Even columns are sines, odd columns are cosines. These numbers are fixed, not learned, not random.

Position is injected as a structured vector that depends only on where in the sentence the token sits.

Step 5

5. Combined input X, meaning plus position

Add the two vectors together, element by element. This is what flows into the encoder.

What is this? We literally add the embedding vector and the position vector for each token. The result X is one 8-number vector per token that encodes both what the token is and where it sits.

Why do we need it? One vector per token is the input shape every later block expects. Adding (rather than concatenating) keeps the size at 8 and lets the model pull apart the two signals when needed.

X = E + PE
Element-wise sum: each cell of X equals the same-cell entry of E plus the same-cell entry of PE.

The numbers

Reading this: shape is [number_of_tokens × 8]. This X matrix is the encoder’s starting point.

Each token now carries one 8-number vector that knows what word it is and where it is.

Step 6

6. The encoder block, big picture

The encoder rewrites every token vector to be context-aware. Here is its skeleton.

What is this? The encoder takes X and produces a new sequence of 8-number vectors, same shape, but each new vector is “informed” by the other tokens. The word “bank” in “river bank” should end up with a different vector than in “savings bank”, because attention pulls in context.

Why do we need it? Word meaning depends on neighbors. The encoder is the part that lets each word “ask” the others “who is relevant to me?” and absorb their information.

What is inside one encoder block

Input X (per token: 8 numbers)
LayerNorm -- rescales numbers so they don’t blow up (Step 7)
Multi-Head Self-Attention (Q·K·V) -- lets each word peek at every other word (Steps 8 to 12)
Add input back (residual) -- preserves the original signal (Step 13)
LayerNorm -- rescale again before FFN
Feed-Forward Network -- per-token thinking (Step 14)
Add input back (residual)
Encoder output (per token: 8 numbers)

Don’t worry about understanding this yet. Sections 7 to 15 unpack each box one at a time, with live numbers, in the order shown above. By the end of Section 15 every box will be a familiar friend.

An encoder block has two sub-blocks: attention (mixes information across tokens) and feed-forward (extra processing per token). Each sub-block is wrapped in a normalization step and a residual connection.

Step 7

7. LayerNorm, keeping the numbers tidy

Per-token normalization: mean to 0, variance to 1, then a small adjustment.

Activation = the numbers flowing between layers (just our token vectors here). Why might they blow up? Each matmul can multiply numbers by 8+ different values and add the results. Repeat that 12 times in a deep network, and a starting number of 1 can become 1,000,000 or shrink to 0.000001. Learned parameters = numbers adjusted during training; in our toy model, the “learned” parameters come from a fixed sin() formula, not real training.

What is this? For each token vector, we compute the mean and standard deviation of its 8 numbers, subtract the mean, divide by the standard deviation, then multiply and shift by two learned parameters (gamma and beta). The result has zero mean and unit variance, standardized.

Why do we need it? LayerNorm yanks every vector back into a stable range so deeper layers can do their job. Think of it as a thermostat keeping the pipes at a steady temperature.

Mean: the average of the numbers. Variance: how spread out they are (average squared distance from the mean). Standard deviation: square root of variance, a typical “spread” size in the original units.

yi = (xi − mean) / sqrt(variance + ε) · γi + βi
x, the input token’s 8 numbers.   mean, variance, computed across those 8 numbers.   ε (epsilon, a tiny safety number) = 1e−5, prevents division by zero.   γ (gamma, per-dimension stretch), β (beta, per-dimension shift), learned scale & shift, one number per column.

The numbers (encoder LayerNorm 1)

Reading this: rows are tokens, columns are the 8 vector dimensions. Watch what happens: each row gets shifted (mean → 0) and rescaled (spread → 1), then nudged by gamma/beta. The stats table shows what we subtract and divide by.

Before LayerNorm (= X from step 5):

Row stats (mean / variance per token):

After LayerNorm:

LayerNorm is mechanical, no information is added, only the scale is fixed. It buys numerical stability for everything that follows.

Step 8

8. Q, K, V, three lenses on each word

From one input matrix, build three views: Query, Key, Value.

Weight matrix definition first: a weight matrix is just a grid of numbers the model multiplies the input by. The numbers come from training (in our toy, from a fixed sin() formula). “Weights” are these numbers.

What is this? We multiply the normalized input by three different weight matrices (WQ, WK, WV) to produce three new matrices: Q (queries), K (keys), and V (values). Same shape as input, different content.

Why do we need it? Picture a librarian. Every book has a topic label (the key) and contents (the value). When you walk in with a question (the query), instead of picking one book, you grab a little bit from every book, weighted by how well its label matches your question. That weighted blend is what comes out of attention.

Query (Q): “What am I looking for?”, a question each token asks.
Key (K): “What do I advertise?”, a label each token displays.
Value (V): “What do I deliver?”, the actual information returned if matched.

Three views let the model decouple “what to look for”, “what to be findable as”, and “what to send back”.

Matrix multiplication (matmul) in plain words: if A has 5 rows and 8 columns, and B has 8 rows and 3 columns, the result has 5 rows and 3 columns. Each output number is computed by taking a row of A, lining it up with a column of B, multiplying pair-by-pair, and adding the products.

Q = Xnorm · WQ     K = Xnorm · WK     V = Xnorm · WV
All three weight matrices have shape [d_model, d_model] = [8, 8]. Output shape: [tokens, 8] for each.

The numbers

Reading this: three [tokens × 8] grids. Each row is one token’s query / key / value vector.

Q (queries):

K (keys):

V (values):

Three new matrices, queries, keys, values, built by three different linear projections of the same input.

Next we use Q and K together to compute how much each token should attend to each other token, that’s the attention score grid in Section 9.

Step 9

9. Attention scores, how much does each token care about each other?

Q dotted with K-transpose: similarity between every pair of tokens.

Dot product first. A dot product takes two same-length lists, multiplies pair-by-pair, and adds the results. Example: (1, 2, 3) · (4, 5, 6) = 1·4 + 2·5 + 3·6 = 4 + 10 + 18 = 32. That’s the whole operation.

Pipeline preview (Sections 9 to 12): these four sections are one continuous pipeline. (1) compute scores from Q and K. (2) scale & softmax to get attention weights. (3) use weights to mix V into per-head outputs. (4) concat heads and project. Output: a new vector per token, same shape as input.

What is this step? For every pair of tokens (i, j), we compute the dot product of token i’s query and token j’s key. The result is a square grid of “raw scores”: how aligned token i’s question is with token j’s advertisement.

Why does dot product equal similarity? Picturing 4-number lists as arrows is just an analogy, what really matters: similar lists give a big positive dot product, opposite lists give a negative one, unrelated lists give roughly zero. So the dot product is a clean numeric measure of “how aligned”.

Heads, briefly. A head is a slice of columns. We split Q’s 8 columns into Q[:, 0:4] (head 0) and Q[:, 4:8] (head 1), and likewise for K and V. We then run the entire attention computation twice in parallel, once per head, and merge the results in Section 12. Each head can learn a different pattern of attention.

scoresh = Qh · KhT
h, head index (0 or 1).   Qh, Kh, this head’s 4 columns of Q and K (shape [tokens × 4]).   T = transpose, flip rows and columns. Kh is [tokens × 4]; we transpose it to [4 × tokens] so the inner dimensions match for matmul. Output shape: [tokens × tokens], one score per pair.

Color legend (applies to all heatmaps below). Dark orange = highest in this grid; dark indigo = lowest; cream = middle. Colors are normalized per grid, so a dark orange in one heatmap is not directly comparable to a dark orange in another.

The numbers (head 0)

Reading this: rows are query tokens, columns are key tokens. Hover for exact values. Bigger = more match.

The numbers (head 1)

Same shape, different head, learns a different pattern of attention.

A square grid of similarities, one per head, each head will weight tokens differently in the next step.

Step 10

10. Scaling and softmax, turning raw scores into weights

Divide by sqrt(d_k), then squash each row to probabilities that sum to 1.

What is this? We do two small things: (a) divide every score by sqrt(d_k) (= sqrt(4) = 2 here) so the numbers stay small enough for softmax to work nicely; (b) apply softmax across each row, turning raw scores into weights between 0 and 1 that add up to 1.

Why do we need it? Without scaling, dot products grow with vector size and softmax becomes “peaky”. Peaky = one weight near 1.0, the others near 0, the model picks ONE token and ignores everyone else, instead of blending. Bad early in training. Without softmax, we cannot use the scores as a clean weighted-mixing recipe in the next step.

Softmax: take a list of numbers, exponentiate each, divide by the sum. The output is a probability distribution, positive numbers that sum to 1.

What is exp(x)? It means ex, where e ≈ 2.718 (a constant). Two important properties: (1) it is always positive, (2) bigger inputs grow much bigger, exp(2)/exp(1) ≈ 2.7 but exp(5)/exp(1) ≈ 55. So softmax assigns most of the weight to the top scores.

scaled = scores / sqrt(d_k)
Aij = exp(scaledij) / sumk exp(scaledik)
d_k = 4 (the size of the key/query vectors per head).   A, the attention weight matrix; rows sum to 1.   i, row (query token);   j, column (candidate key token);   k, a dummy index summing across all columns of row i.   “sumk exp(scaledik)” reads as “the sum of exp(scaledik) over every k in row i”.

Scaled scores (head 0)

Just the previous heatmap divided by 2.

Attention weights after softmax (head 0)

Reading this: each row sums to 1.0. Cell (i, j) is the fraction of attention token i pays to token j.

Attention weights after softmax (head 1)

Each row of the attention matrix is now a clean “mixing recipe”: it tells us how to blend the value vectors in the next step.

Step 11

11. Weighted sum with V, the actual attention output

Use the attention weights to mix the value vectors.

What is this? For each token, we compute a weighted sum of all value vectors using the attention weights from the previous step as the recipe. If token are attends 60% to how and 40% to you, then its new vector is 60% of how’s value plus 40% of you’s value.

Why do we need it? This is where information actually moves between tokens. Each token now contains a blended summary of the tokens it cared about most.

Reminder: back in Step 9 we split each token’s 8 numbers into 2 heads of 4. So this head’s V is shape [tokens × 4], and Z ends up [tokens × 4]. The full 8-column output reappears in Step 12 when we concatenate both heads.

Zh = Ah · Vh
Output shape per head: [tokens × 4]. Each row is a fresh 4-number vector, computed as the weighted blend.

The numbers (head 0 output)

Reading this: 4 columns per head (since d_k = 4). One row per token.

The numbers (head 1 output)

Information has crossed token boundaries. Each token now carries a blend influenced by the words it attended to.

Step 12

12. Multi-head, concat heads and output projection

Glue the heads back together and apply one more linear transformation.

What is this? The two head outputs (each [tokens × 4]) are concatenated side-by-side along the column axis to get [tokens × 8]. Then this is multiplied by an output weight matrix WO [8 × 8] to produce the final attention block output.

Why do we need it? Each head looked at a different pattern. Concat puts them in one tensor. WO lets the model mix the per-head signals into a unified message.

MultiHead = concat(Z0, Z1) · WO
concat, place head 0’s 4 columns next to head 1’s 4 columns.   WO, learned [8 × 8] mixing matrix.

Concatenated heads

After output projection (W_O)

Multiple heads, run in parallel, then merged, the model gets several “perspectives” and learns how to combine them.

Step 13

13. Residual connection + LayerNorm

Add the original input back. Normalize again.

What is this? We take the attention output and add it to the original input X (before any of the attention math). This “residual connection” (also called “skip connection”) is one of the great tricks of modern deep learning.

Why do we need it? Two reasons: (a) it makes the network easy to train, gradients (numbers that tell the training algorithm how to nudge each weight to reduce error) can flow back cleanly through the addition; (b) it gives every block the option to make a small change rather than reinvent the input from scratch. The block can “add a tweak” instead of being the whole story.

Training = repeatedly tweaking weights from billions of examples to minimize prediction error. We skip training in this page and use fixed sin() weights, but residuals would still help training run smoothly.

After the residual, another LayerNorm prepares the vector for the feed-forward network.

res1 = Xinput + MultiHead
Xnorm2 = LayerNorm(res1)
Xinput, the original X from Step 5, not the LayerNormed version. The residual deliberately reaches around the attention block.

X + attention output

After LayerNorm 2

Residuals plus normalization, the “glue” that makes deep transformers trainable.

Step 14

14. Feed-forward network (FFN), per-token thinking

A small two-layer neural network applied to each token independently.

What is this? A simple recipe: take each token vector (size 8), multiply by Wff1 (shape [8, 16]) and add a bias to expand it to 16 numbers, apply a non-linear function called GeLU, then multiply by Wff2 (shape [16, 8]) to come back to size 8. So we grow each token from 8 → 16 (more room to think), apply GeLU, then squeeze back 16 → 8.

Why do we need it? Attention mixes information across tokens. The FFN does extra processing inside each token, independently. Together they alternate, mix, think, mix, think.

Linear vs non-linear. A linear operation is a sum-of-multiples (like y = 2x + 3). Stack two linear steps and you get another linear step, adding layers does nothing new. A non-linear function (like GeLU below) introduces kinks/curves; that’s what lets deeper networks learn richer patterns. Without a non-linearity, a 100-layer network has the same expressive power as a 1-layer one.

GeLU (Gaussian Error Linear Unit): a smooth function that lets positive numbers through, dampens negative ones.

FFN(x) = GeLU( x · Wff1 + bff1 ) · Wff2 + bff2
Wff1 shape [8, 16].   Wff2 shape [16, 8]. The expansion-then-contraction pattern is sometimes called the “MLP” (multi-layer perceptron) sub-block, MLP = input → wider hidden → output.

FFN intermediate (after expansion + GeLU): [tokens × 16]

FFN output: [tokens × 8]

Per-token reshaping, the “thinking” step that complements attention’s “mixing” step.

Step 15

15. Encoder output, context-aware vectors

Add FFN output back to its input. Done with one encoder block.

What is this? The FFN output is added to its input (another residual connection). The result is the encoder output: one 8-number vector per input token, where each vector now encodes both the original word, its position, and the relevant context from neighboring words.

Why does this matter? The decoder will look at this matrix when deciding what Spanish word to write next. It is the encoder’s entire summary of the English input.

encoder_output = res1 + FFN(Xnorm2)

The encoder output: [tokens × 8]

Reading this: this matrix is the encoder’s answer. The decoder will use this in cross-attention later.

One context-rich vector per input token. The English side is done.

Step 16

16. Stacking, doing it many times

Real transformers don’t stop at one block. They stack many.

What real transformers do: the encoder block we just walked through (LayerNorm → Attention → Residual → LayerNorm → FFN → Residual) is repeated N times back-to-back. The output of one block becomes the input of the next.

  • The original Transformer (2017): N = 6.
  • BERT-large: N = 24.
  • GPT-3: N = 96.

Each layer refines the representation a little more, lower layers tend to capture word-level features, higher layers capture sentence-level meaning. We use N = 1 on this page so the math stays small enough to display, but the per-layer math is identical.

More layers = more refinement, at the cost of more computation. The math in each layer is exactly what you just saw.

Step 17

17. The decoder, writing Spanish one word at a time

Generation is a loop: produce one token, append it, repeat.

Bridge from the encoder. So far we built encoder_output: one 8-number vector per English token (Section 15). Now we hand it to the decoder, whose job is to write the Spanish sentence. Crucially, the decoder works one token at a time, it writes a word, appends it to its own input, and runs again. Sections 18 to 22 walk through one decoder pass; Section 23 shows the loop in full.

What is this? The decoder produces the Spanish translation one token at a time. We start with a special “begin” token <bos> (beginning-of-sentence). The decoder reads everything written so far plus the encoder’s output, then predicts what the next token should be. We append the predicted token and repeat. We stop when the decoder predicts <eos> (end-of-sentence).

Why does it work this way? Language is sequential, the next word depends on every previous word. The autoregressive loop captures this: each decision is made knowing the full prefix.

<bos> and <eos> are special vocabulary tokens we add on purpose: <bos> says “start here”, <eos> says “I am done”. The model treats them as words it can predict, just like “hola”.

argmax: pick the option with the biggest score. argmax([0.1, 0.7, 0.2]) = 1 (the index of 0.7).

Decoder block (parallel to Section 6)

Decoder input (so far): tokens written so far
Masked Self-Attention -- look only at past Spanish tokens (Step 18)
Residual + LayerNorm
Cross-Attention -- look at the English encoder output (Step 19)
Residual + LayerNorm
Feed-Forward Network -- per-token thinking (Step 20)
Residual + Final LayerNorm
Output projection → logits → softmax → argmax -- pick next Spanish token (Steps 21 to 22)
y0 = <bos>
yt+1 = argmax( decoder( [y0, …, yt], encoder_output ) )
repeat until yt+1 = <eos>
yt, the t-th generated token.   argmax, pick the highest-probability token.

The autoregressive loop, visually

[<bos>] → decoder → hola
[<bos>, hola] → decoder → ,
[<bos>, hola, ,] → decoder → como
[<bos>, hola, ,, como] → decoder → estas
[<bos>, hola, ,, como, estas] → decoder → ?
[<bos>, hola, ,, como, estas, ?] → decoder → <eos>  (stop)

Each iteration runs the decoder block: masked self-attention → cross-attention → FFN → final norm → output projection → softmax → argmax. We will walk through the first iteration in detail, then show all iterations side-by-side.

The loop is what turns one model call into a full sentence.

Step 18

18. Masked self-attention, the causal triangle

Same Q·K·V machinery as the encoder, plus a mask that hides the future.

What is this? Inside the decoder, self-attention works the same way as in the encoder, queries, keys, values, scores, softmax, weighted sum. But before applying softmax we add a causal mask: a triangular grid of zeros and minus-infinities that forces each position to attend only to itself and earlier positions.

Why do we need it? During generation, the future doesn’t exist yet, we are about to predict it. Letting position t attend to position t+1 would be cheating. The mask makes the math forbid it.

Why minus infinity? We add the mask before softmax. softmax of -inf is 0, so masked positions contribute zero weight after softmax.

M[i, j] = 0 if j ≤ i, else −∞
A = softmax( (Q·KT)/sqrt(d_k) + M )
The mask is a [t × t] grid where t is the current decoder length. The lower triangle plus diagonal stays 0; the upper triangle becomes −∞.

The causal mask (last decoder iteration shown)

Reading this: by the last iteration, the decoder input has length t (e.g. 5 tokens for “hello, how are you”). The mask is a t×t grid: lower triangle & diagonal are 0 (allowed), upper triangle is -inf (blocked from looking at future tokens).

What it looks like in general (4 tokens):
row 0 = [0, -inf, -inf, -inf], first token can only see itself
row 1 = [0, 0, -inf, -inf], second token sees position 0 and 1
row 2 = [0, 0, 0, -inf], third token sees positions 0, 1, 2
row 3 = [0, 0, 0, 0]    , fourth token sees everyone before it (including itself)

Masked attention weights at this step

After the mask is added and softmax is applied. Notice the upper triangle is now exactly 0 (because exp(-inf) = 0) and each row still sums to 1.

A simple triangle of -infs prevents information from leaking backward in time. Generation can only look at what already exists.

Step 19

19. Cross-attention, the bridge from English to Spanish

Decoder asks the encoder: which English words should I focus on right now?

What is this? Cross-attention is just like self-attention, except the queries come from the decoder while the keys and values come from the encoder output. The decoder “queries” the English representation to pull in relevant information.

Concretely: the decoder has been writing Spanish so far. To pick the NEXT Spanish word, it forms a query from its own current state, then asks every English encoder vector “how relevant are you?”. This is the only place where English-side information flows into Spanish-side computation.

Why do we need it? When the decoder is about to produce hola, this layer lets it look at hello in the English side. When it’s about to produce estas, it can look at are and you.

Q = decoder_state · WcQ
K = encoder_output · WcK
V = encoder_output · WcV
cross_out = softmax((Q·KT)/sqrt(d_k)) · V
Q comes from the decoder; K and V from the encoder.   decoder_state, the LayerNormed output of the masked-self-attention residual from Step 18 (one 8-number vector per Spanish token written so far).   No mask here, the decoder is free to look at any English position.

Cross-attention weights (first iteration, head 0)

Reading this: rows = decoder positions (just <bos> here, since this is the first iteration). Columns = English tokens. Value = attention weight from that decoder position to that English position.

Cross-attention weights (first iteration, head 1)

This is where source-language information flows into the target-language decoder. Without it, the decoder couldn’t translate.

Step 20

20. Decoder FFN + final LayerNorm

Same FFN structure as the encoder; one final normalization at the end.

What is this? After cross-attention, the decoder runs a feed-forward network identical in shape to the encoder’s FFN (Linear → GeLU → Linear). Then a final LayerNorm produces a clean vector ready for the output projection.

Why do we need it? Same reasons as in the encoder: per-token thinking that complements the “mixing” done by the two attention sub-blocks.

Decoder vector after final LayerNorm (first iteration)

Reading this: 8 numbers. This vector represents “everything the model thinks about position 0 (i.e., what comes after <bos>)”.

A single 8-number vector remains. The next step turns it into a probability over every word in the vocabulary.

Step 21

21. Output projection, from hidden to logits

Multiply by W_out to produce one number per vocabulary token.

What is this? The 8-number decoder vector is multiplied by Wout (shape [8 × vocab_size]) to produce a vector of logits, one raw score per word in the vocabulary.

Why do we need it? We need a score per vocabulary token to pick a winner. The output projection is the gateway from the abstract 8-dim space into “which word should I say” space.

Logits: raw, unnormalized scores. Bigger means “more likely” but they are not yet probabilities.

As promised in the intro: in a real, trained transformer, billions of parameters trained on millions of examples cause the right Spanish token to win at each step. Our toy weights are random (deterministic, but unlearned). To make the architecture produce the correct output for our example phrases, we add a small teaching bias (+20) to the target token’s logit at each step, this is the only manual nudge in the whole page. Every other step you have seen is real math. When the input is an unsupported phrase, the bias is zero and the output reflects the (gibberish) raw projection, honest about its limits.
logits = decoder_vec · Wout + bout   (+ teaching_bias)
Wout shape [8, vocab_size=29].   Why 29 columns? Our vocabulary has 29 words, so we need 29 scores, one per candidate next-token.

Logits (first iteration): top entries shown

Reading this: bars are sorted by score. Picked token in orange.

A score per vocabulary word. The next step normalizes them into probabilities.

Step 22

22. Softmax over vocabulary, probabilities

Exponentiate, normalize, pick the highest. That is the next token.

What is this? Apply softmax to the logits to get a probability distribution over the entire vocabulary, one positive number per word, summing to 1. Then pick the word with the highest probability (this is called argmax).

Why probabilities? In real chatbots, always picking the most likely word makes responses dull. So they sample randomly weighted by these probabilities, which is why ChatGPT gives different answers to the same question. Our toy uses pure argmax (always the top one) for clarity.

Note: because of the teaching bias from Step 21, our picked token’s probability appears very confident. In a real trained model this confidence comes from learned weights, not a manual nudge.

pi = exp(logitsi) / sumj exp(logitsj)
next_token = argmaxi( pi )

Probability distribution (first iteration): top 8 candidates

Same shape as Step 21’s logit chart, but now bars are probabilities (sum to 1 across the full vocab). The picked bar is much more dominant than its raw logit was, that’s the exponential in softmax doing its job. Note: only top 8 are shown; the other 21 vocabulary words have tiny probabilities making up the rest of the 1.0 total.

One word wins each step. That is the next token in our Spanish output.

Step 23

23. The full loop, every step of generation

We just walked through iteration 1. Here are all iterations side-by-side.

What is this? The whole sequence of decoder iterations. Each box below is one full pass through steps 18 to 22, with that step’s picked token. The decoder builds up the Spanish sentence one token at a time, stopping at <eos>.

Why does seeing the loop matter? Most diagrams of transformers show a single block. The loop is what turns those blocks into a translation. The encoder runs once; the decoder runs once per output token.

Iteration k = the k-th call to the decoder; after iteration k the decoder input has length k. Each card below shows the picked word (orange) plus the top 3 candidates, notice how often the wrong words still have non-trivial probability.

Translation alignment. Each card also shows a tiny cross-attention strip, row = the decoder’s newest query position, columns = English source tokens. Watch which English word lights up when the decoder picks each Spanish word.

All iterations

The encoder reads the English sentence once. The decoder runs in a loop, one full pass per Spanish word it produces.

Step 24

24. Final translated sentence

Stitch the picked tokens together. We’re done.

All generated tokens (excluding <bos> and <eos>) are joined with spaces, with no space placed before commas or question marks. That is the translation.

Generated token sequence

Final translation

Spanish
--

The full pipeline: tokenize → embed → add positions → encode → decode in a loop → project → softmax → argmax → stitch. Every chatbot you use is a much bigger version of this.

Step 25

25. What we left out

Honest list of simplifications.

  • Many layers. Real transformers stack 6 to 96 encoder/decoder layers. We used 1.
  • Bigger vectors. Real d_model is 768 (BERT), 12288 (GPT-3). We used 8.
  • More heads. Real attention often uses 8, 16, or 96 heads. We used 2.
  • Sub-word tokens. Real systems use BPE or SentencePiece, breaking rare words into pieces (“unbelievable” → [“un”, “believ”, “able”]). We used whole words and an <unk> fallback.
  • Trained weights. Real weights are learned by gradient descent on huge corpora. Our weights come from a sin() formula, we hand-tuned a small bias on the output to make supported phrases translate correctly. How do real models acquire these weights? See Issue 02: Fine-tuning, step by step.
  • Dropout, weight decay, learning rate schedules. Training tricks. None of them affect inference math.
  • KV caching. Real decoders cache the K and V matrices across iterations to avoid recomputing them. We recompute, because it is simpler.
  • Sampling strategies. Real systems use temperature, top-k, top-p, beam search. We used pure argmax.
  • Decoder-only models. GPT-style models drop the encoder entirely and just do autoregressive generation. The math overlaps heavily with what you saw.

Despite all the simplifications, the architecture you saw on this page is genuinely how transformers work. Real systems are this same dance, scaled up.