Post

Comparing LLM Token Distributions: An Interactive Zipf–KL Explorer

An interactive visualisation of KL divergence between two Zipf-shaped token distributions — the kind a language model produces. Set each distribution's entropy, shuffle the rank ordering by a chosen Spearman correlation, and watch forward and reverse KL respond in real time.

Comparing LLM Token Distributions: An Interactive Zipf–KL Explorer

This is a follow-up to the interactive KL divergence explorer, recast for the distribution an LLM (Large Language Model) produces: a probability over a vocabulary of tens of thousands of tokens. Drag the sliders and watch how KL between two such distributions behaves.

At every step, a language model emits a vector of logits — one real number per vocabulary token — and a softmax turns them into a probability distribution. The log of those probabilities is what APIs hand back as logprobs. So a single forward pass gives you a full distribution PP over the vocabulary, not just the sampled token.

You often want to compare two such distributions, PP and QQ, for example:

  • the same prompt through two different models, or a model and its quantised / distilled / fine-tuned variant
  • a policy against the model it was initialised from during RLHF (the KL penalty that stops it drifting)
  • a small draft model against the target model in speculative decoding
  • the same model at two temperatures

KL divergence is a way to score “how far has QQ drifted from PP?”. A language model’s output probabilities can be approximated by a Zipf distribution.

Why Zipf?

Count how often each word appears in a large text corpus, sort by frequency, and the ii-th most common word occurs with frequency roughly proportional to 1/i1/i. That is Zipf’s law and commonly used when analysing frequency of tokens across a whole corpus. The general form assigns the token at rank ii a probability

pi=isZ,Z=k=1Nksp_i = \frac{i^{-s}}{Z}, \qquad Z = \sum_{k=1}^{N} k^{-s}

where NN is the vocabulary size, ss is the exponent (classic word-frequency Zipf has s1s \approx 1) and ZZ is the normaliser.

Zipf is a decent approximation to an LLM’s output probabilities, but the next-token distribution at a single position is only Zipf-like (Holtzman et al., 2020).

In the visualisation both PP and QQ are Zipf distributions over the same NN tokens.

Entropy sets the Zipf shape

Rather than expose the exponent ss directly, each distribution is controlled by its entropy:

H(P)=ipilnpiH(P) = -\sum_i p_i \ln p_i

and the explorer solves for the ss that achieves it. Entropy is monotonic in ss: at s=0s = 0 the distribution is uniform with the maximum H=lnNH = \ln N; as ss \to \infty it collapses onto the top token and H0H \to 0.

The ”=” button between the entropy sliders snaps QQ’s entropy onto PP’s.

Rank correlation: when two models disagree on the ordering

Two real models rarely rank the vocabulary identically — they mostly agree on the obvious top tokens and diverge in the tail. The Rank corr. ρ control captures this: QQ’s tokens are reordered relative to PP’s until their Spearman rank correlation reaches the value you set. For two rankings it is defined as:

ρ=16idi2N(N21),di=rankP(i)rankQ(i)\rho = 1 - \frac{6 \sum_i d_i^2}{N(N^2 - 1)}, \qquad d_i = \operatorname{rank}_P(i) - \operatorname{rank}_Q(i)
  • ρ=1\rho = 1 — identical ordering. Any KL comes purely from the entropy difference.
  • ρ0\rho \approx 0 — independent orderings. KL is now dominated by disagreement about which token goes where.

The reshuffling jitters each token’s log-rank, so the widely-spaced head stays put while the densely-packed tail scrambles first — mirroring how models agree on the likely tokens and disagree on the rest. The right pane plots QQ’s probabilities in PP’s rank order (with PP overlaid as a line): at ρ=1\rho = 1 the bars trace PP’s line; lower ρ\rho scatters them, and the per-rank plot lights up. (Note that with few elements a single permutation can’t be perfectly decorrelated — at N=40N = 40 the lowest reachable ρ\rho is around 0.30.3, which the readout reports honestly.)

How KL is calculated

For discrete distributions the KL divergence is a sum over the vocabulary:

DKL(PQ)=ipilnpiqiD_{\mathrm{KL}}(P \,\|\, Q) = \sum_{i} p_i \, \ln \frac{p_i}{q_i}

Each token contributes piln(pi/qi)p_i \ln(p_i/q_i) — the bottom plot draws exactly these per-rank contributions.

Two properties carry over from the continuous case:

  • It’s weighted by PP. Disagreements where PP is large dominate; disagreements out in the tail barely register. For token distributions this means KL is governed by the head — the handful of tokens the model actually considers.
  • It’s asymmetric. DKL(PQ)DKL(QP)D_{\mathrm{KL}}(P \,\|\, Q) \ne D_{\mathrm{KL}}(Q \,\|\, P) in general; the explorer shows both. Unlike the continuous explorer, KL here is always finite — both Zipf distributions have full support over the vocabulary, so no qiq_i is ever zero.

Things worth playing with

  • Same shape, different order. Hit Reordered only (or press = then drop ρ\rho). Both distributions have identical entropy, so every bit of KL is the models disagreeing about token ranks.
  • Same order, different sharpness. Temperature keeps ρ=1\rho = 1 and separates the entropies — KL with no reordering at all. This is the cost of running a model hotter or colder than its reference.
  • Watch the head dominate. Increase NN and the per-rank plot stays concentrated near rank 1: KL is set by a few high-probability tokens, the long tail contributes almost nothing.
  • Forward vs reverse. Make QQ much flatter than PP and compare the two KL boxes — the asymmetry is the same mode-seeking / mode-covering story from the KL explorer.

Why it matters

When you fine-tune with an RLHF objective, the KL-to-reference penalty is precisely DKL(ππref)D_{\mathrm{KL}}(\pi \,\|\, \pi_{\text{ref}}) summed over the vocabulary at each step — keeping the policy’s token distribution close to where it started (Bai et al., 2022). Knowledge distillation trains the student by minimising a KL divergence to the teacher (Gu et al., 2024). Speculative decoding keeps outputs exact through a rejection step whose acceptance probability is one minus the total-variation distance between draft and target (Leviathan et al., 2023; Chen et al., 2023). And when you quantise a model, KL divergence from the full-precision original tracks how much the output distribution actually moved (Dutta et al., 2024).

This post is copyrighted by Josh Levy-kramer.