2026-04-17

Deep Learning Book

Table of Contents

1. Chapter 3 - Probability & Info Theory

Deep Learning - Goodfellow, Bengio, Courville.pdf: Page 69

Probability Theory - Way to express uncertainity of statements and an Extension of logic to deal with uncertainity

Information Theory - computes amount of uncertainity in a probability distribution

Sources of uncertainity:

  1. Inherent Stochasticity
  2. Partial Observability
  3. Incomplete modeling

Probability theory was developed to analyze frequencies of events. But when the events are not repeatable, we still can use probability theory to represent a degree of belief. The former is called frequentist probability and the later is called Bayesian probability. It turns out that both of them follows the same rules/formulas of probability theory. 1

To formally develop probability theory for continuous random variables we need measure theory.

Chain Rule / Product rule of Conditional Probabilities

Join probability can expressed product of conditional probabilities.

\begin{align*} P(x^{(1)}, x^{(2)}, ..., x^{(n)}) = P(x^{(1)}) \Pi_{i=n}^2 P(x^{(i)}| x^{(1)}, ..., x^{(i-1)}) \end{align*}
  • \(x \perp y\) - Denotes independence
  • \(x \perp y | z\) - Denotes conditional independence

Independence is stronger condition than covariance:

  • Independent variables have zero covariance
  • But zero covariance only implies there is no linear dependence.

Gaussian Distribution

  • \(\mu \in \mathbb R\)
  • \(\sigma \in (0, \infty)\)

    Or use can used inverse of variance, called precision \(\beta \in (0, \infty) = \frac 1 {\sigma^2}\)

It is useful and good choice in many places because:

  1. Central Limit Theorem: Sum of many independent random variables is approximately normally distributed. So, a complex system with many parts can be modeled as gaussian noise.
  2. Of all probability distribution with same variance, normal distribution has the higher uncertainity. So, it encodes the least amount of prior knowledge into the model.

Multivariate Normal Distribution

  • \(\mathcal N(x; \mu; \Sigma)\)
  • \(\Sigma\) is covariance matrix - a positive definite symmetric matrix
  • \(\beta = \Sigma^{-1}\) is precision matrix
  • Isotropic Gaussian means \(\Sigma = \sigma I\)

Dirac delta is not an ordinary function, it is a generalized function. Generalized functions are defined in terms of its properties when integrated.

Gaussian mixture model is a universal approximator of densities. Any smooth density can be approximated with any specific, non-zero amount of error by a gaussian mixture model with enough components.

Useful functions

  • Sigmoid
    • \(\sigma(x) = \frac 1 { 1 + \exp(-x)}\)
    • Range \((0,1)\)
    • Useful for producing probability for binomial distribution
  • Softplus
    • \(\zeta(x) = \log(1+\exp(x))\)
    • Range \((0, \infty)\)
    • A smooth version of positive-part function \(x^+ = max(0, x)\)
    • Useful for producing \(\beta\) or \(\sigma\) of normal distribution

Relations:

  • \(\frac {d} {dx} \zeta(x) = \sigma(x)\)
  • \(\frac {d} {dx} \sigma(x) = \sigma(x)(1- \sigma(x))\)
  • \(\log \sigma(x) = - \zeta(-x)\)
  • \(\zeta(x) - \zeta(-x) = x\)

If \(x\) and \(y\) are continuous random variables and \(g\) is a invertible, continuous, differentiable function then \(y = g(x)\) doesn't imply \(p_y(y) = p_x(g^{-1}(y))\) instead it is:

\begin{align*} p_y(y) = p_x(g^{-1}(y)) \left| \frac{\partial x}{\partial y} \right| \end{align*}

Information Theory:

Information theory was originally developed to measure expected length of messages for optimal code in communication. This deals with discrete distributions. Shannon entropy assigns amount of uncertainity to an probability distribution.

We can, and do, apply similar formulas for continuous distribution but the interpretations don't remain same, and some properties are lost. This is called Differential Entropy. E.g. an event with probability = 1 has zero information because it is guranteed to occur, and similarly an event with density = 1 has zero information although it is not guranteed to occur.

KL Divergence:

\begin{align*} D_{KL}(P || Q) = \mathbb E_{x \sim P}\left[\log \frac{P(x)}{Q(x)} \right] \geq 0 \end{align*}

KL Divergence is not symmetric. See Figure 3.6 to see a consequence of this. There we keep \(P\) as fixed and \(Q \sim \mathcal N\) varying:

  • \(\arg \min_Q D_{KL}(P || Q)\) chooses \(Q\) such that it places high probability in the places \(P\) has high probability. If P has multiple modes, Q blurs them.

    Intuition: \(Q\) represents coding distribution. So we avoid having low \(Q\) where \(P\) is high or non-zero because that would me larger coding length.

  • \(\arg \min_Q D_{KL} (Q||P)\) chooses \(Q\) such that it avoids the low probability place of \(P\) and choose one of the modes of \(P\). If the modes are near then it still might result to \(Q\) that blurs the modes.

    Intuition: \(P\) represents coding distribution and it is fixed. So, \(Q\) has to avoid the low density regions of \(P\) otherwise the coding length in that region will be high.

Graphical Model:

When we represent factorization of probability distribution with a graph, we call it graphical model or structured probabilistic model.

Graphical models can be directed or undirected.

Directed models represent factorization using conditional probability distributions. For each RV there is a factor with conditional distribution over the RV node's parents.

Undirected models represent factorization using set of functions that assign unnormalized probablities to cliques of RVs. For each clique (set of nodes all connected to each other), say \(x_1, ..., x_k\), we have a factor \(\phi(x_1, ... x_k)\). Then the product of factors can be divided by a normalization constant.

Footnotes:


You can send your feedback, queries here