Generative Classifier


Introduction

Given a set of data points X and their corresponding labels Y, a generative classifier attempts to model the distribution P(X,Y) and then uses Bayes’ rule or the chain rule to calculate P(Y|X).

XD×N=[x1,1x1,2x1,Nx2,1x2,2x2,NxD,1xD,2xD,N]=[x1x2xN]

where D is the number of features and N is the number of data points. The label of each data point yi is a discrete random variable that can take on one of K values.

yi{1,2,,K}

Sometimes, yi is a binary random variable belonging to one of two classes, yi{0,1}. We can see that the form of the label is not important, as long as it is discrete. It is the goal of the generative classifier to predict the label of a new data point xnew given the training data X and Y. The generative classifier does this by maximizing the posterior probability (equivalent to minimizing the wrong classification rate) P(Y|X) or minimizing the expected risk R.

Maximizing the Posterior Probability

Given a new data point x, the wrong classification rate when predicting x as c is

P(yc|x)=1P(y=c|x)

Using Bayes’ rule, we can rewrite this as

P(yc|x)=1P(x|y=c)P(y=c)P(x)

To minimize the wrong classification rate, we want to maximize the posterior probability P(y=c|x). Since P(x) is constant, and P(y=c) is the prior probability of y=c, we just need to maximize P(x|y=c), the posterior probability of x given y=c. And the prediction y^ is

y^=argmincP(yc|x)=argmaxcP(x|y=c)P(y=c)

Minimizing the Expected Risk

The cost of different types of misclassifications can be different. For example, the cost of misclassifying a tumor as benign is much higher than the cost of misclassifying a benign tumor as malignant. In this case, we should define a loss function L(y,y^) that captures the cost of misclassifying y as y^. The expected risk is then

R(y^|x)=y=1KL(y,y^)P(y|x)

where P(y|x) is the posterior probability of y given x. The prediction y^ is

y^=argmincR(c|x)=argmincy=1KL(y,c)P(y|x)

Actually, the expected risk is the weighted sum of the wrong classification rate, where the weights are the costs of misclassifications, since L(y,y)=0. A simple example of a loss function is the 0-1 loss function, where

L(y,y^)={0if y=y^1if yy^

In this case, the expected risk happens to be the wrong classification rate.

R(y^|x)=y=1KL(y,y^)P(y|x)=P(yy^|x)

The short form of the loss function is Ly^,y, where y^ is the predicted label and y is the true label.

Parameter Estimation

In the following we give common distributions and their corresponding parameter estimates.

Bernoulli Distribution

f(x;p)={px=11px=0

MLE of p:

p^=N1N

MAP of p:

p^=N1+α1N+α+β2

The laplace smoothing is

p^=N1+1N+2

Multinoulli Distribution

f(x;p)={p1x=1p2x=2pKx=K

MLE of p:

p^k=NkN

MAP of p:

p^k=Nk+αk1N+k=1KαkK

When αk=1, the MAP estimate is the same as the MLE estimate. The laplace smoothing is

p^k=Nk+1N+K

Gaussian Distribution

f(x;μ,σ2)=12πσ2e(xμ)22σ2

MLE of μ:

μ^=1Ni=1Nxi

MLE of σ2:

σ^2=1Ni=1N(xiμ^)2

Multivariate Gaussian Distribution

f(x;μ,Σ)=1(2π)D|Σ|e12(xμ)TΣ1(xμ)

MLE of μ:

μ^=1Ni=1Nxi

MLE of Σ:

Σ^=1Ni=1N(xiμ^)(xiμ^)T

Gaussian Discriminant Analysis

The Gaussian Discriminant Analysis (GDA) model assumes that the data points x are generated from a multivariate Gaussian distribution N(μ,Σ), where μ is the mean vector and Σ is the covariance matrix. The model also assumes that the labels y are generated from a multinoulli distribution Multinoulli(ϕ), the prior probability of each class.

P(y|x)=P(x|y)P(y)P(x)P(x|y)P(y)=N(x|μy,Σy)ϕylnP(y|x)=lnN(x|μy,Σy)+lnϕy=D2ln2π12ln|Σy|12(xμy)TΣy1(xμy)+lnϕylnP(y=i|x)P(y=j|x)=12ln|Σi||Σj|12(xμi)TΣi1(xμi)+12(xμj)TΣj1(xμj)+lnϕiϕj

Linear discriminant analysis (LDA) is a special case of GDA where the covariance matrix Σ is the same for all classes. In this case, the decision boundary is linear.

lnP(y=i|x)P(y=j|x)=12(xμi)TΣ1(xμi)+12(xμj)TΣ1(xμj)+lnϕiϕj=12xTΣ1x+xTΣ1μi12μiTΣ1μi+12xTΣ1xxTΣ1μj+12μjTΣ1μj+lnϕiϕj=xTΣ1(μiμj)12μiTΣ1μi+12μjTΣ1μj+lnϕiϕj

where Σ1 is symmetric and ϕi is the prior probability of class i.