How LLMs Work — A Visual Walkthrough

Architecture Overview

Almost every large language model today — GPT, LLaMA, Claude — uses a decoder-only transformer. The original 2017 "Attention Is All You Need" paper had both an encoder and a decoder, but it turned out that for language generation, the decoder side alone is enough and scales better. So "decoder-only" became the standard.

I get a bit more time to revisit the architecture to get better muscle memory. Here comes the step wise understanding and illustration include a full model in browser using WebGPU without any external dependency.

Here's the flow from begin to end:

Text is broken into tokens (subwords), each converted to a vector via the embedding table. Each token will get back a vector, the dimension of vector differs in this example we using 512, mean using 512 float value to encode each token.

Position embeddings are added so the model knows the position of each token. In the original attention paper it use cosine and sine with different frequencies and the now modern model are all using trained parameter instead.

The sequence passes through N identical layers — each layer has two sub-steps: attention (tokens look at each other) and feed-forward (each token processed independently)

After all layers, a linear projection maps the last position's vector to scores over the vocabulary

Softmax turns those scores into probabilities — the highest is the predicted next token

The key insight: attention lets each token gather relevant information from other positions in the sequence, while the feed-forward layer transforms that information locally. Stack enough layers and the model learns surprisingly rich language patterns. This demo uses a tiny 2-layer model for illustration.

Byte Pair Encoding

Language models don't read text character by character — they work on tokens. BPE does not use any linguistic rules to split words. It looks at the text itself, finds what pairs of bytes appear most often, and merges them into tokens.

It starts from the ground floor: all 256 possible byte values. Every ASCII character, every punctuation mark, every possible byte. That's the initial vocabulary — and it can already encode anything, including multi-byte characters like Chinese, Japanese, or emoji.

How non-ASCII text works before any merge:

Chinese: each character is 3 bytes in UTF-8. Before any merges, one character = 3 tokens. After enough merges on Chinese training data, those 3 bytes get merged into a single token — so common characters end up as one token. Uncommon ones stay as 3 tokens.

Emoji: 🔥 is 4 bytes in UTF-8, so it starts as 4 tokens.

There are no unknown words. Any text falls back to its raw bytes.

Then the merging begins. Scan all adjacent pairs in the training text and count how often each pair appears. The most frequent pair gets merged into a new token. On Shakespeare, the first merges look like: e + space → "e " (because "the ", "he ", "be " all end with "e "), then t + h → "th", then th + e → "the", and so on. Each merge reduces the total token count by collapsing one pair everywhere it appears. This repeats until you hit the target vocabulary size.

How non-ASCII text works after merge:

Chinese: each character is 3 bytes in UTF-8. After merges, so common characters end up as one token. Uncommon ones stay as 3 tokens.

Emoji: 🔥 is 4 bytes in UTF-8, so it starts as 4 tokens. If it appears often enough in the training corpus, it gets its own merged token. Rare emoji stay as multiple byte tokens.

Unknown words: there are no unknown words. Any text falls back to its raw bytes, which are always in the base 256 vocabulary. You might get inefficient tokenization, but encoding never fails.

The vocabulary below starts at 256 and grows to 768 after 500 merges on Shakespeare's complete works. Watch the left chart to see which pairs are most frequent at each step.

Vocab: 256 tokens · 0 merges

Loading…

Top remaining merges

Tokenize

After building the BPE vocabulary, encoding text means applying the merge rules greedily: start with individual bytes, then repeatedly apply the highest-priority merge that matches.

Let's trace how "To be, or not to be" gets tokenized — focusing on just "To":

Start: T, o and space are individual byte tokens — 3 tokens in the beginning

BPE learned "To" as a merged token (capital T followed by o is very common in English text)

The space after makes "To " — that's another merge candidate, and it likely becomes its own token too

Result: "To " → 1 token. Not 3 separate bytes.

Each token id shown below is the row index into the embedding table — that's how text becomes numbers. Type anything to see how it splits.

Loading…

Token Embedding

Every token is an integer id — for example, "To" might be token 418. The embedding matrix has one row per vocabulary entry, shaped [vocab_size × dim_model]. Looking up token 418 means taking row 418 of that matrix. That's it — a table lookup.

Here's the part that confused me for a while: the embedding matrix is trained alongside all the other model weights. It's neither a hand-crafted dictionary of word meanings nor separately trained. It starts as random noise and gets updated every training step by gradient descent, just like Q, K, V weights. The model learns what each token's vector should be by predicting what comes next in billions of sentences.

By the end of training, tokens that appear in similar contexts end up with similar vectors — "dog" and "puppy" cluster together. This emerges automatically by the training; you don't specify it.

So an embedding model is just a by-product of a trained LLM model.

Position Embedding

Transformers have no built-in sense of order — if without position embeddings, "The cat sat" and "sat cat The" would look identical.

The original "Attention Is All You Need" paper used fixed sinusoidal embeddings. The formula: for position pos and dimension i, use sin(pos / 10000^(2i/d)) for even dims and cos(...) for odd dims. The intuition: different dimensions get different frequencies — like different clock speeds. Dim 0 oscillates fast (sin(0), sin(1), sin(2)...), dim 6 oscillates very slowly (sin(0/1000), sin(1/1000)...). Each position gets a unique fingerprint across all dimensions:

Position 0, dim 0 → sin(0) = 0.00

Position 1, dim 0 → sin(1) ≈ 0.84

Position 1, dim 6 → sin(1/1000) ≈ 0.001 (nearly flat)

In practice, every modern model (GPT-2, LLaMA, Claude) uses learnable position embeddings instead — just trained parameters initialized randomly and updated by gradient descent. Why did they take over? They tend to perform better because the model can learn whatever positional pattern helps the specific task, rather than being locked to a fixed formula. The cost is that they don't generalize beyond the training sequence length.

Toggle it to compare between sinusoidal or learnable

Embedded Input

Each input position becomes a vector by adding two vectors together: the token embedding (what word it is) + the position embedding (where it sits in the sequence). The result is the embedded input — shape [seq_len × dim_model].

A concrete example: suppose token "cat" has embedding value 0.42 at dimension 3, and position 2 has embedding value −0.15 at dimension 3. The combined value is simply 0.42 + (−0.15) = 0.27. This happens independently for every dimension.

Why does simple addition work? The network learns to disentangle them during training. Some dimensions end up encoding mostly "what" (the token identity), others encode mostly "where" (the position). The model figures out which is which by itself — we don't specify it. The animation below shows each cell flying from its token embedding + position embedding into the combined result.

QKV Projection

Input: the embedded sequence [seq_len × dim_model]. Output: Q, K, V — all three at once — from a single matrix multiply.

W_QKV is shaped [dim_model × 3·dim_model] — three sections side by side. The left a third produces Q (queries), the middle K (keys), the right V (values). One matmul, three results. During training, each section learns a different role.

The multi-head split is already baked in here. Each head's Q slice occupies a contiguous column range within the Q section: head 0 gets columns 0..dim_head−1, head 1 gets dim_head..2·dim_head−1, and so on. It's not three separate operations — it's just one big matrix multiplication.

Multi-Head Attention

Input: Q, K, V each shaped [seq_len × dim_model], split into n_heads column slices of width dim_head. Each head works on its own slice independently.

For each head: compute scores = Q_h @ K_h^T / √dim_head → shape [seq_len × seq_len]. Softmax across each row gives attention weights — how much each position should pull from every other. Multiply by V_h to get the head's output: [seq_len × dim_head].

Why does this work? Think about the word "bank". In "river bank" it means shoreline; in "bank account" it means finance. Q encodes "what am I looking for?", K encodes "what do I offer?". The dot product Q·K measures the match. The model learns that when "bank" is next to "river", those vectors align strongly — so attention flows that way and the right meaning gets reinforced. Different heads specialize in different relationship types: one might track syntax, another coreference, another semantic similarity.

dim_model split to multiple head to dimension of dim_head. Each head operates in a narrower subspace, which forces the heads to specialize rather than all doing the same thing.

Concat & Output Projection

Each head produced [seq_len × dim_head]. Concatenating all n_heads side by side gives back [seq_len × dim_model] — full width again.

Then multiply by W_out [dim_model × dim_model]. This is the mixing step: without it, each head's output would stay isolated in its own column slice. W_out lets what head 0 found influence the final representation alongside what head 1 found. It blends the specialized views back into a unified signal.

Output shape: [seq_len × dim_model] — same as what came in. That's required for the residual connection: the attention block's output gets added back to its input, so the shapes must match.

Feed Forward

Input: [seq_len × dim_model] from the attention block. Unlike attention, this step processes each token's vector independently — no information flows between positions here. That's attention's job.

The structure is expand then contract: W1 projects from dim_model → 4 × dim_model (the hidden layer), then a nonlinearity (GELU/ReLU), then W2 projects back down to dim_model. Why 4×? The wider hidden layer gives the model breathing room to represent complex intermediate patterns — like working memory before committing to an answer. Without the nonlinearity between W1 and W2, the two multiplies would collapse into one, and you'd have gained nothing.

Output: [seq_len × dim_model] — exactly the same shape as input. The residual connection adds the input back to the output, so the network only needs to learn what to change, not the full representation from scratch.

Training

Training is next-token prediction — repeated billions of times. The process for each step:

Take a chunk of text, tokenize it into a sequence of ids

Feed token ids 1..N as input, predict ids 2..N+1 (one step ahead at every position simultaneously)

At each position, run the full forward pass to get logits over the vocabulary, then softmax → probability distribution

Loss = cross-entropy between the predicted distribution and the actual next token. This measures how surprised the model was by the correct answer.

Backpropagate gradients through all weights; the Adam optimizer updates each parameter

What does the loss number mean? Cross-entropy on a random model over 768 tokens would be about ln(768) ≈ 6.6. A well-trained model on familiar text gets below 2. When to stop: training ends when the validation loss (measured on held-out text) plateaus or starts rising — that's overfitting. In practice, large models are often trained until the compute budget runs out, since they rarely fully overfit.

Colors below: bright orange = large positive weight, bright blue = large negative, dark = near zero. Q and K grow large because they interact multiplicatively (Q·Kᵀ/√dₖ) and carry the attention signal. Token embedding rows that appear in training accumulate large values; rows for unseen tokens stay dark.

The training data is Shakespeare dataset, and the model is fully using webgpu without external dependency. You can download it and run it on your own.

Loss:

Inference

Inference runs one token at a time. Here's what happens on each step:

Tokenize the prompt, feed all tokens through the model to warm up the KV cache

Get logits at the last position → softmax → probability over the vocabulary

Pick the top token (greedy decoding), or sample using temperature

Append that token; feed just the new token through on the next step — reusing the cached K and V

Repeat until end-of-sequence token or max length

The KV cache is why inference stays fast. Without it, generating token N would re-compute K and V for all N−1 previous tokens — cost grows as O(N²). With the cache, each new token only computes its own K and V, then attends to the already-stored cache. O(N) total instead of O(N²). For a 1000-token generation that's a 1000× difference in compute. The score rows in the visualization show attention weights over all cached positions — dark slots are future positions that don't exist yet.

Temperature controls how random the sampling is. The model outputs raw logits (unnormalized scores). Divide by temperature T before softmax: T=1.0 → original distribution; T<1 (e.g. 0.5) → more peaked, model sticks to high-probability tokens (more predictable); T>1 (e.g. 1.5) → flatter, model picks lower-probability tokens more often (more creative, may be incoherent). T→0 is greedy — always the highest probability token. Colors: orange = positive activation, blue = negative, dark = near zero.

A Wheel Maker