Interactive KL Divergence Visualisation
An interactive visualisation of Kullback–Leibler divergence. Shape two distributions and watch forward vs reverse KL, the pointwise integrand, and the effects of asymmetry, support mismatch and discretisation.
This page is an interactive explorer for building that intuition around Kullback–Leibler divergence: drag the sliders and watch the divergence respond in real time.
An interactive explorer of how Kullback–Leibler (KL) divergence behaves when you change the input distributions.
For two probability distributions and , the Kullback–Leibler divergence (also called relative entropy) is
It measures how badly approximates , weighted by where actually puts its mass. Disagreements in regions considers likely count a lot; disagreements in regions considers unlikely barely count at all.
in the visualisation you control two skew-normal distributions, and , each with four sliders:
- Mean — shifts the distribution left/right.
- Std — widens or narrows it.
- Skew — pushes mass into one tail.
- Truncate — hard-clips the support to a window around the mean, so the density is exactly zero outside it.
There’s also a Discretise slider that bins both distributions into histograms of a chosen width and computes the discrete KL between the bin masses instead of the continuous integral.
The upper plot draws the two densities. The lower plot draws the integrand — the pointwise contribution to KL at each .
Things worth playing with:
- Asymmetry. Narrow with fixed and watch explode, while stays small. This is why the direction matters: minimising is mode-seeking ( hides in one mode), minimising is mode-covering ( must cover all of ).
- Support. Truncate so it’s exactly zero somewhere has mass, and . This is the absolute-continuity condition that’s usually glossed over — and why KL fails for disjoint distributions, motivating Jensen–Shannon and Wasserstein.
- Discretisation. Turn the bins on and KL drops, because binning destroys information (the data-processing inequality). As bin width , the discrete sum recovers the continuous integral.
If you’ve trained a classifier, you’ve minimised a KL divergence: cross-entropy loss is plus a constant entropy term.