Kullback-Leibler Divergence
Table of Contents
where \(H(P,Q)\) is Cross Entropy.
Facts:
- Measures the information loss when approximating true distribution \(p\) using model distribution \(q\)
- Not symmetric and thus not a metric
There exists a symmetric version of KL divergence called the Jensen–Shannon divergence
\(JSD(p || q) = 1/2 D(p || m) + 1/2 D(q || m)\) where \(m = 1/2 (p + q)\) is the mixture distribution
The square root of JSD is a metric, Jensen-Shannon Distance.
- Minimizer of KL divergence wrt \(Q\) is same as minimizer of cross entropy \(H(P,Q)\) wrt \(Q\)
1. Consequence of KL divergence not being symmetric
See Figure 3.6 in Deep Learning book by Goodfellow, for a consequence of this. There we keep \(P\) as fixed and \(Q \sim \mathcal N\) varying:
\(\arg \min_Q D_{KL}(P || Q)\) chooses \(Q\) such that it places high probability in the places \(P\) has high probability. If P has multiple modes, Q blurs them.
Intuition: \(Q\) represents coding distribution. So we avoid having low \(Q\) where \(P\) is high or non-zero because that would me larger coding length.
\(\arg \min_Q D_{KL} (Q||P)\) chooses \(Q\) such that it avoids the low probability place of \(P\) and choose one of the modes of \(P\). If the modes are near then it still might result to \(Q\) that blurs the modes.
Intuition: \(P\) represents coding distribution and it is fixed. So, \(Q\) has to avoid the low density regions of \(P\) otherwise the coding length in that region will be high.