Post

Interactive Jensen–Shannon Divergence Visualisation

An interactive visualisation of Jensen–Shannon divergence - the symmetric, always-finite cousin of KL. Shape two distributions and watch JSD, its ceiling of one bit, and the per-point contribution respond in real time.

Interactive Jensen–Shannon Divergence Visualisation

This is a follow-up to the interactive KL divergence explorer. That one ended on two rough edges of Kullback–Leibler divergence: it’s asymmetric, and it blows up to \infty the moment one distribution puts mass where the other has none. Jensen–Shannon divergence is the standard fix for both. Drag the sliders and watch how it behaves.

For two probability distributions PP and QQ, let M=12(P+Q)M = \tfrac12(P + Q) be their mixture - the distribution you get by flipping a fair coin and sampling from PP or QQ accordingly. The Jensen–Shannon divergence is

JSD(P,Q)=12DKL(PM)+12DKL(QM)\mathrm{JSD}(P, Q) = \tfrac12 D_{\mathrm{KL}}(P \,\|\, M) + \tfrac12 D_{\mathrm{KL}}(Q \,\|\, M)

It’s the average of two KL divergences, but instead of measuring PP against QQ directly, it measures each of them against the thing halfway between. That give you three properties KL doesn’t have:

  • Symmetric. Swapping PP and QQ leaves MM — and therefore JSD\mathrm{JSD} — unchanged.
  • Always finite, and bounded. MM assigns zero probability only where both PP and QQ do, so neither KL term can ever divide by zero. In fact 0JSD(P,Q)log20 \le \mathrm{JSD}(P, Q) \le \log 2 — that’s ln20.693\ln 2 \approx 0.693 nats, or exactly one bit. Disjoint distributions equal one bit.
  • JSD\boldsymbol{\sqrt{\mathrm{JSD}}} is a true distance. More on that below.

In the visualisation you control two skew-normal distributions, PP and QQ, each with Mean, Std, Skew and Truncate sliders, plus a Discretise slider that bins both into histograms and computes the discrete JSD between the bin masses.

The upper plot draws the two densities. The lower plot draws the integrand 12plogpm+12qlogqm\tfrac12 p \log\frac{p}{m} + \tfrac12 q \log\frac{q}{m} — the pointwise contribution to JSD at each xx. Unlike the KL integrand, this curve is non-negative everywhere and never infinite: the worst a single point can contribute is 12max(p,q)log2\tfrac12 \max(p, q)\log 2, which happens when one density is zero there and the other isn’t.

Things worth playing with:

  • Make them disjoint. Pull PP and QQ apart (or hit the Nearly disjoint preset). JSD\mathrm{JSD} climbs to ln2\ln 2 — one bit — and stays there. In the KL explorer the same move sent KL to \infty. Bounded versus undefined.
  • Truncate QQ’s support. Clip QQ to zero where PP still has mass. DKL(PQ)D_{\mathrm{KL}}(P\|Q) would be \infty; JSD\mathrm{JSD} stays comfortably finite, because MM never vanishes there.
  • Coarsen the bins. Turn on Discretise and widen the bins — JSD\mathrm{JSD} drops, because binning destroys information (the data-processing inequality). As bin width 0\to 0 it recovers the continuous value.

Why it matters

Routing both distributions through their average MM is the whole trick: MM never assigns zero where PP or QQ has mass, so DKL(PM)D_{\mathrm{KL}}(P\|M) and DKL(QM)D_{\mathrm{KL}}(Q\|M) are both finite, and averaging the two makes the result symmetric. JSD is, roughly, “how badly does the average of PP and QQ describe each of them” — and that quantity has a clean interpretation.

It’s a mutual information. Flip a fair coin Z{0,1}Z \in \{0, 1\}; on Z=0Z=0 draw XPX \sim P, on Z=1Z=1 draw XQX \sim Q. Then

JSD(P,Q)=I(X;Z)\mathrm{JSD}(P, Q) = I(X; Z)

— the information a single sample XX carries about which distribution it came from. If P=QP = Q, a sample tells you nothing about the coin, and JSD=0\mathrm{JSD} = 0. If PP and QQ have disjoint supports, a sample reveals the coin exactly — one bit — and JSD=log2\mathrm{JSD} = \log 2. JSD measures distinguishability from a single draw.

Its square root is a metric. JSD(P,Q)\sqrt{\mathrm{JSD}(P, Q)} is a distance metric that satisfies the triangle inequality: so you can actually do geometry with it — nearest-neighbour search over distributions, clustering documents by their word distributions, embedding distributions in a metric space.

It answers a different question than KL. DKL(PQ)D_{\mathrm{KL}}(P\|Q) is the expected log-likelihood-ratio under PP: “if the data genuinely comes from PP, how wrong, on average, is someone who believes QQ?” It’s inherently directional and anchored to one reference distribution — which is exactly right when there is a ground truth and an approximation: maximum likelihood, cross-entropy loss, variational inference all minimise a KL of that shape. JSD asks a symmetric question instead — “how different are these two, with neither privileged?” — which is the right one when both sides are just empirical samples you want to compare: two text corpora, two model checkpoints, this month’s traffic against last month’s. And where KL answers “\infty — incomparable” for distributions with different supports, JSD still gives a graded, bounded answer.

That bounded answer has a flip side worth knowing: once PP and QQ are fully disjoint, JSD\mathrm{JSD} is pinned at log2\log 2 and stops responding — its gradient vanishes — so pushing two disjoint distributions towards each other gives you nothing to descend.

This post is copyrighted by Josh Levy-kramer.