Interactive Jensen–Shannon Divergence Visualisation
An interactive visualisation of Jensen–Shannon divergence - the symmetric, always-finite cousin of KL. Shape two distributions and watch JSD, its ceiling of one bit, and the per-point contribution respond in real time.
This is a follow-up to the interactive KL divergence explorer. That one ended on two rough edges of Kullback–Leibler divergence: it’s asymmetric, and it blows up to the moment one distribution puts mass where the other has none. Jensen–Shannon divergence is the standard fix for both. Drag the sliders and watch how it behaves.
For two probability distributions and , let be their mixture - the distribution you get by flipping a fair coin and sampling from or accordingly. The Jensen–Shannon divergence is
It’s the average of two KL divergences, but instead of measuring against directly, it measures each of them against the thing halfway between. That give you three properties KL doesn’t have:
- Symmetric. Swapping and leaves — and therefore — unchanged.
- Always finite, and bounded. assigns zero probability only where both and do, so neither KL term can ever divide by zero. In fact — that’s nats, or exactly one bit. Disjoint distributions equal one bit.
- is a true distance. More on that below.
In the visualisation you control two skew-normal distributions, and , each with Mean, Std, Skew and Truncate sliders, plus a Discretise slider that bins both into histograms and computes the discrete JSD between the bin masses.
The upper plot draws the two densities. The lower plot draws the integrand — the pointwise contribution to JSD at each . Unlike the KL integrand, this curve is non-negative everywhere and never infinite: the worst a single point can contribute is , which happens when one density is zero there and the other isn’t.
Things worth playing with:
- Make them disjoint. Pull and apart (or hit the Nearly disjoint preset). climbs to — one bit — and stays there. In the KL explorer the same move sent KL to . Bounded versus undefined.
- Truncate ’s support. Clip to zero where still has mass. would be ; stays comfortably finite, because never vanishes there.
- Coarsen the bins. Turn on Discretise and widen the bins — drops, because binning destroys information (the data-processing inequality). As bin width it recovers the continuous value.
Why it matters
Routing both distributions through their average is the whole trick: never assigns zero where or has mass, so and are both finite, and averaging the two makes the result symmetric. JSD is, roughly, “how badly does the average of and describe each of them” — and that quantity has a clean interpretation.
It’s a mutual information. Flip a fair coin ; on draw , on draw . Then
— the information a single sample carries about which distribution it came from. If , a sample tells you nothing about the coin, and . If and have disjoint supports, a sample reveals the coin exactly — one bit — and . JSD measures distinguishability from a single draw.
Its square root is a metric. is a distance metric that satisfies the triangle inequality: so you can actually do geometry with it — nearest-neighbour search over distributions, clustering documents by their word distributions, embedding distributions in a metric space.
It answers a different question than KL. is the expected log-likelihood-ratio under : “if the data genuinely comes from , how wrong, on average, is someone who believes ?” It’s inherently directional and anchored to one reference distribution — which is exactly right when there is a ground truth and an approximation: maximum likelihood, cross-entropy loss, variational inference all minimise a KL of that shape. JSD asks a symmetric question instead — “how different are these two, with neither privileged?” — which is the right one when both sides are just empirical samples you want to compare: two text corpora, two model checkpoints, this month’s traffic against last month’s. And where KL answers “ — incomparable” for distributions with different supports, JSD still gives a graded, bounded answer.
That bounded answer has a flip side worth knowing: once and are fully disjoint, is pinned at and stops responding — its gradient vanishes — so pushing two disjoint distributions towards each other gives you nothing to descend.