Decision Theory
Table of Contents
Decision theory tries to answer two main questions:
- How should people make decisions to be as rational as possible? (This is the normative aspect of Decision theory)
- How do people actually make decisions, taking into account human biases and emotions? (This is the descriptive aspect of decision theory)
In context of Machine learning: (From Bishop - Pattern Recognition And ML - 2006)
Probability Theory gives us a framework to quantify and manipulate uncertainty. Decision theory, combined with probability theory, allows us to make optimal decisions involving uncertainty.
Problem:
We have to make a decision \(y\) give input \(x\).
1. Alternatives
Apart from using probability theory, some propose use of fuzzy logic, Dempster-Shafer Theory and other theories for decision making. While the proponent of probability theory argue any kind of decision rule that is pareto best is equivalent to a bayesian decision rule with appropriate utility function and prior distribution. And thus probability theory is sufficient.
One critique of Decision theory is that it can't take into account the effect of unknown-unknowns that can happen in real life and that have have a big impact. I however, don't like this critique.
2. Classification Problem
For classification problem, decisions may be based on:
Minimizing Misclassification Rate
Intuitively, for classification problem choosing the class \(C_i\) with highest probability \(p(C_i | x)\) would be best. The inuition is correct if we want to minimize the misclassification rate.
Minimizing Expected Loss
Sometimes classifying to one class is more critical than another class. Thus, there is a loss/cost function assigning cost of classifiying class \(k\) as \(j\). In that case, optimal decision is to classify to class \(j\) such that
\begin{align*} \sum_k L(k, j) p(C_k | x) \end{align*}is minimum.
Reject to Decide
Sometimes it can be appropriate to not make a decision when the problem is difficult. For example, we may make a decision when the largest class probability \(p(C_i |x)\) is exceeds a threshold, and if it doesn't, we reject the input.
3. Regression Problem
For regression problem, decision may be based on:
Minimizing squared loss:
\(y(x) = E[y|x]\)
minimizes the squared loss. i.e. optimal least squares perdictor is given by conditional mean.
Minimizing other loss:
L1 loss is minimized by conditional median, and L0 loss is minimized by conditional mode. #
4. Approaches for Inference and Decision
All of the above show that we need to learn a model of the class distribution \(p(C_k |x)\), called inference stage, beforehand to make a decision, called decision stage. But there can be other approaches that learn a different model.
Three approaches to decision making:
Generative Approach: Model \(p(x, y)\)
The joint probability distribution \(p(x,y)\) provides complete information about uncertainity about the input \(x\) and output \(y\) variables. All other quantities arive from either marginalizing or conditioning with respect to appropriate variables. Determining, \(p(x,t)\) from training data is an example of inference.
Discriminative Approach: Model \(p(y|x)\)
Having \(p(y|x)\) (from generative or discriminative approach) has following benefits:
- Minimizing risk
- Rejection option: Rejecting (not doing decision) when the probability (confidence) is low
- Compensating for class priors
Combining models
Eg. If we assume independence: \(p(x_1,x_2 | C_k) = p(x_1 | C_k) p(x_2 | C_k)\). We can combine models even without assuming conditional independence.
- Non probabilistic: Use discriminant function \(y = f(x)\)