Post

Interactive KL Divergence Visualisation

An interactive visualisation of Kullback–Leibler divergence. Shape two distributions and watch forward vs reverse KL, the pointwise integrand, and the effects of asymmetry, support mismatch and discretisation.

This page is an interactive explorer for building that intuition around Kullback–Leibler divergence: drag the sliders and watch the divergence respond in real time.

An interactive explorer of how Kullback–Leibler (KL) divergence behaves when you change the input distributions.

For two probability distributions PP and QQ, the Kullback–Leibler divergence (also called relative entropy) is

DKL(PQ)=p(x)logp(x)q(x)dxD_{\mathrm{KL}}(P \,\|\, Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx

It measures how badly QQ approximates PP, weighted by where PP actually puts its mass. Disagreements in regions PP considers likely count a lot; disagreements in regions PP considers unlikely barely count at all.

in the visualisation you control two skew-normal distributions, PP and QQ, each with four sliders:

  • Mean — shifts the distribution left/right.
  • Std — widens or narrows it.
  • Skew — pushes mass into one tail.
  • Truncate — hard-clips the support to a window around the mean, so the density is exactly zero outside it.

There’s also a Discretise slider that bins both distributions into histograms of a chosen width and computes the discrete KL between the bin masses instead of the continuous integral.

The upper plot draws the two densities. The lower plot draws the integrand p(x)log(p(x)/q(x))p(x)\log(p(x)/q(x)) — the pointwise contribution to KL at each xx.

Things worth playing with:

  • Asymmetry. Narrow QQ with PP fixed and watch DKL(PQ)D_{\mathrm{KL}}(P \,\|\, Q) explode, while DKL(QP)D_{\mathrm{KL}}(Q \,\|\, P) stays small. This is why the direction matters: minimising DKL(QP)D_{\mathrm{KL}}(Q \,\|\, P) is mode-seeking (QQ hides in one mode), minimising DKL(PQ)D_{\mathrm{KL}}(P \,\|\, Q) is mode-covering (QQ must cover all of PP).
  • Support. Truncate QQ so it’s exactly zero somewhere PP has mass, and DKL(PQ)D_{\mathrm{KL}}(P \,\|\, Q) \to \infty. This is the absolute-continuity condition that’s usually glossed over — and why KL fails for disjoint distributions, motivating Jensen–Shannon and Wasserstein.
  • Discretisation. Turn the bins on and KL drops, because binning destroys information (the data-processing inequality). As bin width 0\to 0, the discrete sum recovers the continuous integral.

If you’ve trained a classifier, you’ve minimised a KL divergence: cross-entropy loss is DKL(PdataQmodel)D_{\mathrm{KL}}(P_{\text{data}} \,\|\, Q_{\text{model}}) plus a constant entropy term.

This post is copyrighted by Josh Levy-kramer.