The Transformer,
step by step.
Type a sentence. Watch every part of a transformer translate it into Spanish, with real numbers, real math, and plain‑English explanations of every step.
A transformer is the kind of neural network that powers modern AI like ChatGPT, Claude, and Google Translate. It reads a sentence, builds a rich internal “understanding” of it, and writes a new sentence one word at a time. This page walks through every step on a tiny model so you can see exactly what is happening at each stage. No background in math, neural networks, or language models is required, we will define every term as we go.
d_model = 8, 2 attention heads, and 1 encoder + 1 decoder layer. All weights are produced by a deterministic sin() formula (no randomness), so refreshing the page gives identical numbers. Real systems like GPT use 12 to 96 layers, embeddings of size 768 to 12288, and weights learned from massive text data, but every step you will see below is the same math, just smaller.
One small intervention: at the very last step (output logits), we add
+20 to the correct Spanish token so the example phrases translate cleanly, explained openly in Section 21. Every other number on this page is real, unmodified math.
1. Tokenization, cutting the sentence into pieces
Turning a string of letters into a list of discrete pieces.
What is this? A sentence is a string of characters. Computers struggle to think about whole sentences directly, they prefer small, atomic pieces. Tokenization is the chopping board: it takes raw text and cuts it into a list of tokens, where each token is one word or one piece of punctuation.
Why do we need it? Every later step works on a fixed list of pieces. We also lowercase the text so “Hello” and “hello” share the same identity. Punctuation like commas and question marks become their own tokens because they carry meaning.
The numbers
Reading this: each pill below is one token. Order is left-to-right, just like English.
A sentence becomes an ordered list of small pieces, the smallest units the model will reason about.
2. Token IDs, looking up each token in the vocabulary
Each token becomes a number: its index in the model’s dictionary.
What is this? The model has a fixed vocabulary: a list of every token it knows. Each token has a numeric index in that list (its token ID). To turn our list of words into something we can compute on, we replace each token with its ID.
Why do we need it? Math operates on numbers, not words. An ID is the simplest possible numeric handle. If a word is not in the vocabulary, we use a special <unk> (“unknown”) ID instead.
The numbers
Reading this: each pill shows the token, then its position in the vocabulary list.
Words become integers, coordinates into the model’s memory.
3. Token embeddings, a list of numbers per word
Each ID becomes an 8-dimensional vector that captures meaning.
Vector definition first: a vector is just a fixed-length list of numbers, like (0.3, -0.1, 0.8, ...). An 8-dimensional vector means a list of 8 numbers.
What is this? A single integer like 7 tells us nothing about what “how” means. So we replace each ID with a vector (8 numbers, here). This vector is looked up from a giant table called the embedding matrix WE, with one row per vocabulary word.
A 2D analogy first. Imagine plotting words on a flat map. “king” and “queen” sit near each other; “banana” sits far away. Each word becomes a point with two coordinates, e.g. (0.3, 0.7). We do the same here, but with 8 coordinates instead of 2. “Close” just means the numbers are similar.
Why do we need it? Vectors can encode many features at once. In a real model, the embedding for “king” ends up close to “queen” in this 8-dimensional space, far from “banana”. The model gets to position similar words near each other so later math can compare them.
The numbers
Reading this: each row is one token’s 8-number embedding. All numbers come from a deterministic sin() formula based on (vocab ID, dimension), so they are reproducible.
Words become small lists of numbers, rich enough to encode meaning, small enough to do math on.
4. Positional encoding, telling the model where each word sits
A separate vector per position, made of sine and cosine waves.
What is this? In Step 8 (attention) the model will look at all words simultaneously. That step has no idea which word came first, it sees them as a bag, not a list. So we stamp each word with its position before that step: a second 8-dim vector per position, computed purely from the position number using sines and cosines.
Why do we need it? “Dog bites man” vs “Man bites dog” have the same words in different positions, with very different meanings. Position must be part of the input, or attention is order-blind.
Why sines and cosines? They give every position a unique “fingerprint” and let the model easily learn relationships like “3 positions apart” through simple math.
10000 is just an arbitrary large constant the original paper picked, it controls how fast the wavelengths grow across columns.
pos, the token’s position (0, 1, 2, …).
i, the pair index (0 to d_model/2 − 1). The two columns of pair i are 2i and 2i+1; even columns use sin, odd use cos.
PE[1, 2] = sin(1 / 100002/8) = sin(1 / 10) = sin(0.1) ≈ 0.0998. Look at the pos 1 row, column d2 below, it should match.
The numbers
Reading this: row pos 0 is the “position 0” vector. Even columns are sines, odd columns are cosines. These numbers are fixed, not learned, not random.
Position is injected as a structured vector that depends only on where in the sentence the token sits.
5. Combined input X, meaning plus position
Add the two vectors together, element by element. This is what flows into the encoder.
What is this? We literally add the embedding vector and the position vector for each token. The result X is one 8-number vector per token that encodes both what the token is and where it sits.
Why do we need it? One vector per token is the input shape every later block expects. Adding (rather than concatenating) keeps the size at 8 and lets the model pull apart the two signals when needed.
The numbers
Reading this: shape is [number_of_tokens × 8]. This X matrix is the encoder’s starting point.
Each token now carries one 8-number vector that knows what word it is and where it is.
6. The encoder block, big picture
The encoder rewrites every token vector to be context-aware. Here is its skeleton.
What is this? The encoder takes X and produces a new sequence of 8-number vectors, same shape, but each new vector is “informed” by the other tokens. The word “bank” in “river bank” should end up with a different vector than in “savings bank”, because attention pulls in context.
Why do we need it? Word meaning depends on neighbors. The encoder is the part that lets each word “ask” the others “who is relevant to me?” and absorb their information.
What is inside one encoder block
Don’t worry about understanding this yet. Sections 7 to 15 unpack each box one at a time, with live numbers, in the order shown above. By the end of Section 15 every box will be a familiar friend.
An encoder block has two sub-blocks: attention (mixes information across tokens) and feed-forward (extra processing per token). Each sub-block is wrapped in a normalization step and a residual connection.
7. LayerNorm, keeping the numbers tidy
Per-token normalization: mean to 0, variance to 1, then a small adjustment.
Activation = the numbers flowing between layers (just our token vectors here). Why might they blow up? Each matmul can multiply numbers by 8+ different values and add the results. Repeat that 12 times in a deep network, and a starting number of 1 can become 1,000,000 or shrink to 0.000001. Learned parameters = numbers adjusted during training; in our toy model, the “learned” parameters come from a fixed sin() formula, not real training.
What is this? For each token vector, we compute the mean and standard deviation of its 8 numbers, subtract the mean, divide by the standard deviation, then multiply and shift by two learned parameters (gamma and beta). The result has zero mean and unit variance, standardized.
Why do we need it? LayerNorm yanks every vector back into a stable range so deeper layers can do their job. Think of it as a thermostat keeping the pipes at a steady temperature.
Mean: the average of the numbers. Variance: how spread out they are (average squared distance from the mean). Standard deviation: square root of variance, a typical “spread” size in the original units.
The numbers (encoder LayerNorm 1)
Reading this: rows are tokens, columns are the 8 vector dimensions. Watch what happens: each row gets shifted (mean → 0) and rescaled (spread → 1), then nudged by gamma/beta. The stats table shows what we subtract and divide by.
Before LayerNorm (= X from step 5):
Row stats (mean / variance per token):
After LayerNorm:
LayerNorm is mechanical, no information is added, only the scale is fixed. It buys numerical stability for everything that follows.
8. Q, K, V, three lenses on each word
From one input matrix, build three views: Query, Key, Value.
Weight matrix definition first: a weight matrix is just a grid of numbers the model multiplies the input by. The numbers come from training (in our toy, from a fixed sin() formula). “Weights” are these numbers.
What is this? We multiply the normalized input by three different weight matrices (WQ, WK, WV) to produce three new matrices: Q (queries), K (keys), and V (values). Same shape as input, different content.
Why do we need it? Picture a librarian. Every book has a topic label (the key) and contents (the value). When you walk in with a question (the query), instead of picking one book, you grab a little bit from every book, weighted by how well its label matches your question. That weighted blend is what comes out of attention.
• Query (Q): “What am I looking for?”, a question each token asks.
• Key (K): “What do I advertise?”, a label each token displays.
• Value (V): “What do I deliver?”, the actual information returned if matched.
Three views let the model decouple “what to look for”, “what to be findable as”, and “what to send back”.
Matrix multiplication (matmul) in plain words: if A has 5 rows and 8 columns, and B has 8 rows and 3 columns, the result has 5 rows and 3 columns. Each output number is computed by taking a row of A, lining it up with a column of B, multiplying pair-by-pair, and adding the products.
The numbers
Reading this: three [tokens × 8] grids. Each row is one token’s query / key / value vector.
Q (queries):
K (keys):
V (values):
Three new matrices, queries, keys, values, built by three different linear projections of the same input.
Next we use Q and K together to compute how much each token should attend to each other token, that’s the attention score grid in Section 9.
9. Attention scores, how much does each token care about each other?
Q dotted with K-transpose: similarity between every pair of tokens.
Dot product first. A dot product takes two same-length lists, multiplies pair-by-pair, and adds the results. Example: (1, 2, 3) · (4, 5, 6) = 1·4 + 2·5 + 3·6 = 4 + 10 + 18 = 32. That’s the whole operation.
Pipeline preview (Sections 9 to 12): these four sections are one continuous pipeline. (1) compute scores from Q and K. (2) scale & softmax to get attention weights. (3) use weights to mix V into per-head outputs. (4) concat heads and project. Output: a new vector per token, same shape as input.
What is this step? For every pair of tokens (i, j), we compute the dot product of token i’s query and token j’s key. The result is a square grid of “raw scores”: how aligned token i’s question is with token j’s advertisement.
Why does dot product equal similarity? Picturing 4-number lists as arrows is just an analogy, what really matters: similar lists give a big positive dot product, opposite lists give a negative one, unrelated lists give roughly zero. So the dot product is a clean numeric measure of “how aligned”.
Heads, briefly. A head is a slice of columns. We split Q’s 8 columns into Q[:, 0:4] (head 0) and Q[:, 4:8] (head 1), and likewise for K and V. We then run the entire attention computation twice in parallel, once per head, and merge the results in Section 12. Each head can learn a different pattern of attention.
Kh is [tokens × 4]; we transpose it to [4 × tokens] so the inner dimensions match for matmul. Output shape: [tokens × tokens], one score per pair.
Color legend (applies to all heatmaps below). Dark orange = highest in this grid; dark indigo = lowest; cream = middle. Colors are normalized per grid, so a dark orange in one heatmap is not directly comparable to a dark orange in another.
The numbers (head 0)
Reading this: rows are query tokens, columns are key tokens. Hover for exact values. Bigger = more match.
The numbers (head 1)
Same shape, different head, learns a different pattern of attention.
A square grid of similarities, one per head, each head will weight tokens differently in the next step.
10. Scaling and softmax, turning raw scores into weights
Divide by sqrt(d_k), then squash each row to probabilities that sum to 1.
What is this? We do two small things: (a) divide every score by sqrt(d_k) (= sqrt(4) = 2 here) so the numbers stay small enough for softmax to work nicely; (b) apply softmax across each row, turning raw scores into weights between 0 and 1 that add up to 1.
Why do we need it? Without scaling, dot products grow with vector size and softmax becomes “peaky”. Peaky = one weight near 1.0, the others near 0, the model picks ONE token and ignores everyone else, instead of blending. Bad early in training. Without softmax, we cannot use the scores as a clean weighted-mixing recipe in the next step.
Softmax: take a list of numbers, exponentiate each, divide by the sum. The output is a probability distribution, positive numbers that sum to 1.
What is exp(x)? It means ex, where e ≈ 2.718 (a constant). Two important properties: (1) it is always positive, (2) bigger inputs grow much bigger, exp(2)/exp(1) ≈ 2.7 but exp(5)/exp(1) ≈ 55. So softmax assigns most of the weight to the top scores.
exp(scaledik) over every k in row i”.
Scaled scores (head 0)
Just the previous heatmap divided by 2.
Attention weights after softmax (head 0)
Reading this: each row sums to 1.0. Cell (i, j) is the fraction of attention token i pays to token j.
Attention weights after softmax (head 1)
Each row of the attention matrix is now a clean “mixing recipe”: it tells us how to blend the value vectors in the next step.
11. Weighted sum with V, the actual attention output
Use the attention weights to mix the value vectors.
What is this? For each token, we compute a weighted sum of all value vectors using the attention weights from the previous step as the recipe. If token are attends 60% to how and 40% to you, then its new vector is 60% of how’s value plus 40% of you’s value.
Why do we need it? This is where information actually moves between tokens. Each token now contains a blended summary of the tokens it cared about most.
Reminder: back in Step 9 we split each token’s 8 numbers into 2 heads of 4. So this head’s V is shape [tokens × 4], and Z ends up [tokens × 4]. The full 8-column output reappears in Step 12 when we concatenate both heads.
The numbers (head 0 output)
Reading this: 4 columns per head (since d_k = 4). One row per token.
The numbers (head 1 output)
Information has crossed token boundaries. Each token now carries a blend influenced by the words it attended to.
12. Multi-head, concat heads and output projection
Glue the heads back together and apply one more linear transformation.
What is this? The two head outputs (each [tokens × 4]) are concatenated side-by-side along the column axis to get [tokens × 8]. Then this is multiplied by an output weight matrix WO [8 × 8] to produce the final attention block output.
Why do we need it? Each head looked at a different pattern. Concat puts them in one tensor. WO lets the model mix the per-head signals into a unified message.
Concatenated heads
After output projection (W_O)
Multiple heads, run in parallel, then merged, the model gets several “perspectives” and learns how to combine them.
13. Residual connection + LayerNorm
Add the original input back. Normalize again.
What is this? We take the attention output and add it to the original input X (before any of the attention math). This “residual connection” (also called “skip connection”) is one of the great tricks of modern deep learning.
Why do we need it? Two reasons: (a) it makes the network easy to train, gradients (numbers that tell the training algorithm how to nudge each weight to reduce error) can flow back cleanly through the addition; (b) it gives every block the option to make a small change rather than reinvent the input from scratch. The block can “add a tweak” instead of being the whole story.
Training = repeatedly tweaking weights from billions of examples to minimize prediction error. We skip training in this page and use fixed sin() weights, but residuals would still help training run smoothly.
After the residual, another LayerNorm prepares the vector for the feed-forward network.
X + attention output
After LayerNorm 2
Residuals plus normalization, the “glue” that makes deep transformers trainable.
14. Feed-forward network (FFN), per-token thinking
A small two-layer neural network applied to each token independently.
What is this? A simple recipe: take each token vector (size 8), multiply by Wff1 (shape [8, 16]) and add a bias to expand it to 16 numbers, apply a non-linear function called GeLU, then multiply by Wff2 (shape [16, 8]) to come back to size 8. So we grow each token from 8 → 16 (more room to think), apply GeLU, then squeeze back 16 → 8.
Why do we need it? Attention mixes information across tokens. The FFN does extra processing inside each token, independently. Together they alternate, mix, think, mix, think.
Linear vs non-linear. A linear operation is a sum-of-multiples (like y = 2x + 3). Stack two linear steps and you get another linear step, adding layers does nothing new. A non-linear function (like GeLU below) introduces kinks/curves; that’s what lets deeper networks learn richer patterns. Without a non-linearity, a 100-layer network has the same expressive power as a 1-layer one.
GeLU (Gaussian Error Linear Unit): a smooth function that lets positive numbers through, dampens negative ones.
FFN intermediate (after expansion + GeLU): [tokens × 16]
FFN output: [tokens × 8]
Per-token reshaping, the “thinking” step that complements attention’s “mixing” step.
15. Encoder output, context-aware vectors
Add FFN output back to its input. Done with one encoder block.
What is this? The FFN output is added to its input (another residual connection). The result is the encoder output: one 8-number vector per input token, where each vector now encodes both the original word, its position, and the relevant context from neighboring words.
Why does this matter? The decoder will look at this matrix when deciding what Spanish word to write next. It is the encoder’s entire summary of the English input.
The encoder output: [tokens × 8]
Reading this: this matrix is the encoder’s answer. The decoder will use this in cross-attention later.
One context-rich vector per input token. The English side is done.
16. Stacking, doing it many times
Real transformers don’t stop at one block. They stack many.
What real transformers do: the encoder block we just walked through (LayerNorm → Attention → Residual → LayerNorm → FFN → Residual) is repeated N times back-to-back. The output of one block becomes the input of the next.
- The original Transformer (2017): N = 6.
- BERT-large: N = 24.
- GPT-3: N = 96.
Each layer refines the representation a little more, lower layers tend to capture word-level features, higher layers capture sentence-level meaning. We use N = 1 on this page so the math stays small enough to display, but the per-layer math is identical.
More layers = more refinement, at the cost of more computation. The math in each layer is exactly what you just saw.
17. The decoder, writing Spanish one word at a time
Generation is a loop: produce one token, append it, repeat.
Bridge from the encoder. So far we built encoder_output: one 8-number vector per English token (Section 15). Now we hand it to the decoder, whose job is to write the Spanish sentence. Crucially, the decoder works one token at a time, it writes a word, appends it to its own input, and runs again. Sections 18 to 22 walk through one decoder pass; Section 23 shows the loop in full.
What is this? The decoder produces the Spanish translation one token at a time. We start with a special “begin” token <bos> (beginning-of-sentence). The decoder reads everything written so far plus the encoder’s output, then predicts what the next token should be. We append the predicted token and repeat. We stop when the decoder predicts <eos> (end-of-sentence).
Why does it work this way? Language is sequential, the next word depends on every previous word. The autoregressive loop captures this: each decision is made knowing the full prefix.
<bos> and <eos> are special vocabulary tokens we add on purpose: <bos> says “start here”, <eos> says “I am done”. The model treats them as words it can predict, just like “hola”.
argmax: pick the option with the biggest score. argmax([0.1, 0.7, 0.2]) = 1 (the index of 0.7).
Decoder block (parallel to Section 6)
The autoregressive loop, visually
[<bos>] → decoder → hola
[<bos>, hola] → decoder → ,
[<bos>, hola, ,] → decoder → como
[<bos>, hola, ,, como] → decoder → estas
[<bos>, hola, ,, como, estas] → decoder → ?
[<bos>, hola, ,, como, estas, ?] → decoder → <eos> (stop)
Each iteration runs the decoder block: masked self-attention → cross-attention → FFN → final norm → output projection → softmax → argmax. We will walk through the first iteration in detail, then show all iterations side-by-side.
The loop is what turns one model call into a full sentence.
18. Masked self-attention, the causal triangle
Same Q·K·V machinery as the encoder, plus a mask that hides the future.
What is this? Inside the decoder, self-attention works the same way as in the encoder, queries, keys, values, scores, softmax, weighted sum. But before applying softmax we add a causal mask: a triangular grid of zeros and minus-infinities that forces each position to attend only to itself and earlier positions.
Why do we need it? During generation, the future doesn’t exist yet, we are about to predict it. Letting position t attend to position t+1 would be cheating. The mask makes the math forbid it.
Why minus infinity? We add the mask before softmax. softmax of -inf is 0, so masked positions contribute zero weight after softmax.
The causal mask (last decoder iteration shown)
Reading this: by the last iteration, the decoder input has length t (e.g. 5 tokens for “hello, how are you”). The mask is a t×t grid: lower triangle & diagonal are 0 (allowed), upper triangle is -inf (blocked from looking at future tokens).
row 0 = [0, -inf, -inf, -inf], first token can only see itself
row 1 = [0, 0, -inf, -inf], second token sees position 0 and 1
row 2 = [0, 0, 0, -inf], third token sees positions 0, 1, 2
row 3 = [0, 0, 0, 0] , fourth token sees everyone before it (including itself)
Masked attention weights at this step
After the mask is added and softmax is applied. Notice the upper triangle is now exactly 0 (because exp(-inf) = 0) and each row still sums to 1.
A simple triangle of -infs prevents information from leaking backward in time. Generation can only look at what already exists.
19. Cross-attention, the bridge from English to Spanish
Decoder asks the encoder: which English words should I focus on right now?
What is this? Cross-attention is just like self-attention, except the queries come from the decoder while the keys and values come from the encoder output. The decoder “queries” the English representation to pull in relevant information.
Concretely: the decoder has been writing Spanish so far. To pick the NEXT Spanish word, it forms a query from its own current state, then asks every English encoder vector “how relevant are you?”. This is the only place where English-side information flows into Spanish-side computation.
Why do we need it? When the decoder is about to produce hola, this layer lets it look at hello in the English side. When it’s about to produce estas, it can look at are and you.
Cross-attention weights (first iteration, head 0)
Reading this: rows = decoder positions (just <bos> here, since this is the first iteration). Columns = English tokens. Value = attention weight from that decoder position to that English position.
Cross-attention weights (first iteration, head 1)
This is where source-language information flows into the target-language decoder. Without it, the decoder couldn’t translate.
20. Decoder FFN + final LayerNorm
Same FFN structure as the encoder; one final normalization at the end.
What is this? After cross-attention, the decoder runs a feed-forward network identical in shape to the encoder’s FFN (Linear → GeLU → Linear). Then a final LayerNorm produces a clean vector ready for the output projection.
Why do we need it? Same reasons as in the encoder: per-token thinking that complements the “mixing” done by the two attention sub-blocks.
Decoder vector after final LayerNorm (first iteration)
Reading this: 8 numbers. This vector represents “everything the model thinks about position 0 (i.e., what comes after <bos>)”.
A single 8-number vector remains. The next step turns it into a probability over every word in the vocabulary.
21. Output projection, from hidden to logits
Multiply by W_out to produce one number per vocabulary token.
What is this? The 8-number decoder vector is multiplied by Wout (shape [8 × vocab_size]) to produce a vector of logits, one raw score per word in the vocabulary.
Why do we need it? We need a score per vocabulary token to pick a winner. The output projection is the gateway from the abstract 8-dim space into “which word should I say” space.
Logits: raw, unnormalized scores. Bigger means “more likely” but they are not yet probabilities.
Logits (first iteration): top entries shown
Reading this: bars are sorted by score. Picked token in orange.
A score per vocabulary word. The next step normalizes them into probabilities.
22. Softmax over vocabulary, probabilities
Exponentiate, normalize, pick the highest. That is the next token.
What is this? Apply softmax to the logits to get a probability distribution over the entire vocabulary, one positive number per word, summing to 1. Then pick the word with the highest probability (this is called argmax).
Why probabilities? In real chatbots, always picking the most likely word makes responses dull. So they sample randomly weighted by these probabilities, which is why ChatGPT gives different answers to the same question. Our toy uses pure argmax (always the top one) for clarity.
Note: because of the teaching bias from Step 21, our picked token’s probability appears very confident. In a real trained model this confidence comes from learned weights, not a manual nudge.
Probability distribution (first iteration): top 8 candidates
Same shape as Step 21’s logit chart, but now bars are probabilities (sum to 1 across the full vocab). The picked bar is much more dominant than its raw logit was, that’s the exponential in softmax doing its job. Note: only top 8 are shown; the other 21 vocabulary words have tiny probabilities making up the rest of the 1.0 total.
One word wins each step. That is the next token in our Spanish output.
23. The full loop, every step of generation
We just walked through iteration 1. Here are all iterations side-by-side.
What is this? The whole sequence of decoder iterations. Each box below is one full pass through steps 18 to 22, with that step’s picked token. The decoder builds up the Spanish sentence one token at a time, stopping at <eos>.
Why does seeing the loop matter? Most diagrams of transformers show a single block. The loop is what turns those blocks into a translation. The encoder runs once; the decoder runs once per output token.
Iteration k = the k-th call to the decoder; after iteration k the decoder input has length k. Each card below shows the picked word (orange) plus the top 3 candidates, notice how often the wrong words still have non-trivial probability.
Translation alignment. Each card also shows a tiny cross-attention strip, row = the decoder’s newest query position, columns = English source tokens. Watch which English word lights up when the decoder picks each Spanish word.
All iterations
The encoder reads the English sentence once. The decoder runs in a loop, one full pass per Spanish word it produces.
24. Final translated sentence
Stitch the picked tokens together. We’re done.
All generated tokens (excluding <bos> and <eos>) are joined with spaces, with no space placed before commas or question marks. That is the translation.
Generated token sequence
Final translation
The full pipeline: tokenize → embed → add positions → encode → decode in a loop → project → softmax → argmax → stitch. Every chatbot you use is a much bigger version of this.
25. What we left out
Honest list of simplifications.
- Many layers. Real transformers stack 6 to 96 encoder/decoder layers. We used 1.
- Bigger vectors. Real
d_modelis 768 (BERT), 12288 (GPT-3). We used 8. - More heads. Real attention often uses 8, 16, or 96 heads. We used 2.
- Sub-word tokens. Real systems use BPE or SentencePiece, breaking rare words into pieces (“unbelievable” → [“un”, “believ”, “able”]). We used whole words and an
<unk>fallback. - Trained weights. Real weights are learned by gradient descent on huge corpora. Our weights come from a
sin()formula, we hand-tuned a small bias on the output to make supported phrases translate correctly. How do real models acquire these weights? See Issue 02: Fine-tuning, step by step. - Dropout, weight decay, learning rate schedules. Training tricks. None of them affect inference math.
- KV caching. Real decoders cache the K and V matrices across iterations to avoid recomputing them. We recompute, because it is simpler.
- Sampling strategies. Real systems use temperature, top-k, top-p, beam search. We used pure argmax.
- Decoder-only models. GPT-style models drop the encoder entirely and just do autoregressive generation. The math overlaps heavily with what you saw.
Despite all the simplifications, the architecture you saw on this page is genuinely how transformers work. Real systems are this same dance, scaled up.