2022-11-24

Kullback-Leibler Divergence

Table of Contents

\begin{align*} D_{KL}(P || Q) &= \mathbb E_{x \sim P}\left[\log \frac{P(x)}{Q(x)} \right] \geq 0 \\ &= \mathbb E_{x \sim P}\left[\log P(x) \right] - \mathbb E_{x \sim P}\left[\log Q(x) \right] \\ &= - H(P) + H(P, Q) \\ \end{align*}

where \(H(P,Q)\) is Cross Entropy.

Facts:

1. Consequence of KL divergence not being symmetric

See Figure 3.6 in Deep Learning book by Goodfellow, for a consequence of this. There we keep \(P\) as fixed and \(Q \sim \mathcal N\) varying:

  • \(\arg \min_Q D_{KL}(P || Q)\) chooses \(Q\) such that it places high probability in the places \(P\) has high probability. If P has multiple modes, Q blurs them.

    Intuition: \(Q\) represents coding distribution. So we avoid having low \(Q\) where \(P\) is high or non-zero because that would me larger coding length.

  • \(\arg \min_Q D_{KL} (Q||P)\) chooses \(Q\) such that it avoids the low probability place of \(P\) and choose one of the modes of \(P\). If the modes are near then it still might result to \(Q\) that blurs the modes.

    Intuition: \(P\) represents coding distribution and it is fixed. So, \(Q\) has to avoid the low density regions of \(P\) otherwise the coding length in that region will be high.


Backlinks


You can send your feedback, queries here