Introduction
Given a set of data points and their corresponding labels , a generative classifier attempts to model the distribution and then uses Bayes’ rule or the chain rule to calculate .
where is the number of features and is the number of data points. The label of each data point is a discrete random variable that can take on one of values.
Sometimes, is a binary random variable belonging to one of two classes, . We can see that the form of the label is not important, as long as it is discrete. It is the goal of the generative classifier to predict the label of a new data point given the training data and . The generative classifier does this by maximizing the posterior probability (equivalent to minimizing the wrong classification rate) or minimizing the expected risk .
Maximizing the Posterior Probability
Given a new data point , the wrong classification rate when predicting as is
Using Bayes’ rule, we can rewrite this as
To minimize the wrong classification rate, we want to maximize the posterior probability . Since is constant, and is the prior probability of , we just need to maximize , the posterior probability of given . And the prediction is
Minimizing the Expected Risk
The cost of different types of misclassifications can be different. For example, the cost of misclassifying a tumor as benign is much higher than the cost of misclassifying a benign tumor as malignant. In this case, we should define a loss function that captures the cost of misclassifying as . The expected risk is then
where is the posterior probability of given . The prediction is
Actually, the expected risk is the weighted sum of the wrong classification rate, where the weights are the costs of misclassifications, since . A simple example of a loss function is the 0-1 loss function, where
In this case, the expected risk happens to be the wrong classification rate.
The short form of the loss function is , where is the predicted label and is the true label.
In the following we give common distributions and their corresponding parameter estimates.
MLE of :
MAP of :
The laplace smoothing is
MLE of :
MAP of :
When , the MAP estimate is the same as the MLE estimate. The laplace smoothing is
MLE of :
MLE of :
MLE of :
MLE of :
Gaussian Discriminant Analysis
The Gaussian Discriminant Analysis (GDA) model assumes that the data points are generated from a multivariate Gaussian distribution , where is the mean vector and is the covariance matrix. The model also assumes that the labels are generated from a multinoulli distribution , the prior probability of each class.
Linear discriminant analysis (LDA) is a special case of GDA where the covariance matrix is the same for all classes. In this case, the decision boundary is linear.
where is symmetric and is the prior probability of class .