Comparing LLM Token Distributions: An Interactive Zipf–KL Explorer
An interactive visualisation of KL divergence between two Zipf-shaped token distributions — the kind a language model produces. Set each distribution's entropy, shuffle the rank ordering by a chosen Spearman correlation, and watch forward and reverse KL respond in real time.
This is a follow-up to the interactive KL divergence explorer, recast for the distribution an LLM (Large Language Model) produces: a probability over a vocabulary of tens of thousands of tokens. Drag the sliders and watch how KL between two such distributions behaves.
At every step, a language model emits a vector of logits — one real number per vocabulary token — and a softmax turns them into a probability distribution. The log of those probabilities is what APIs hand back as logprobs. So a single forward pass gives you a full distribution over the vocabulary, not just the sampled token.
You often want to compare two such distributions, and , for example:
- the same prompt through two different models, or a model and its quantised / distilled / fine-tuned variant
- a policy against the model it was initialised from during RLHF (the KL penalty that stops it drifting)
- a small draft model against the target model in speculative decoding
- the same model at two temperatures
KL divergence is a way to score “how far has drifted from ?”. A language model’s output probabilities can be approximated by a Zipf distribution.
Why Zipf?
Count how often each word appears in a large text corpus, sort by frequency, and the -th most common word occurs with frequency roughly proportional to . That is Zipf’s law and commonly used when analysing frequency of tokens across a whole corpus. The general form assigns the token at rank a probability
where is the vocabulary size, is the exponent (classic word-frequency Zipf has ) and is the normaliser.
Zipf is a decent approximation to an LLM’s output probabilities, but the next-token distribution at a single position is only Zipf-like (Holtzman et al., 2020).
In the visualisation both and are Zipf distributions over the same tokens.
Entropy sets the Zipf shape
Rather than expose the exponent directly, each distribution is controlled by its entropy:
and the explorer solves for the that achieves it. Entropy is monotonic in : at the distribution is uniform with the maximum ; as it collapses onto the top token and .
The ”=” button between the entropy sliders snaps ’s entropy onto ’s.
Rank correlation: when two models disagree on the ordering
Two real models rarely rank the vocabulary identically — they mostly agree on the obvious top tokens and diverge in the tail. The Rank corr. ρ control captures this: ’s tokens are reordered relative to ’s until their Spearman rank correlation reaches the value you set. For two rankings it is defined as:
- — identical ordering. Any KL comes purely from the entropy difference.
- — independent orderings. KL is now dominated by disagreement about which token goes where.
The reshuffling jitters each token’s log-rank, so the widely-spaced head stays put while the densely-packed tail scrambles first — mirroring how models agree on the likely tokens and disagree on the rest. The right pane plots ’s probabilities in ’s rank order (with overlaid as a line): at the bars trace ’s line; lower scatters them, and the per-rank plot lights up. (Note that with few elements a single permutation can’t be perfectly decorrelated — at the lowest reachable is around , which the readout reports honestly.)
How KL is calculated
For discrete distributions the KL divergence is a sum over the vocabulary:
Each token contributes — the bottom plot draws exactly these per-rank contributions.
Two properties carry over from the continuous case:
- It’s weighted by . Disagreements where is large dominate; disagreements out in the tail barely register. For token distributions this means KL is governed by the head — the handful of tokens the model actually considers.
- It’s asymmetric. in general; the explorer shows both. Unlike the continuous explorer, KL here is always finite — both Zipf distributions have full support over the vocabulary, so no is ever zero.
Things worth playing with
- Same shape, different order. Hit Reordered only (or press = then drop ). Both distributions have identical entropy, so every bit of KL is the models disagreeing about token ranks.
- Same order, different sharpness. Temperature keeps and separates the entropies — KL with no reordering at all. This is the cost of running a model hotter or colder than its reference.
- Watch the head dominate. Increase and the per-rank plot stays concentrated near rank 1: KL is set by a few high-probability tokens, the long tail contributes almost nothing.
- Forward vs reverse. Make much flatter than and compare the two KL boxes — the asymmetry is the same mode-seeking / mode-covering story from the KL explorer.
Why it matters
When you fine-tune with an RLHF objective, the KL-to-reference penalty is precisely summed over the vocabulary at each step — keeping the policy’s token distribution close to where it started (Bai et al., 2022). Knowledge distillation trains the student by minimising a KL divergence to the teacher (Gu et al., 2024). Speculative decoding keeps outputs exact through a rejection step whose acceptance probability is one minus the total-variation distance between draft and target (Leviathan et al., 2023; Chen et al., 2023). And when you quantise a model, KL divergence from the full-precision original tracks how much the output distribution actually moved (Dutta et al., 2024).