4.1 An Overview of Classification

Predicting a qualitative (string/label) response for an observation can be referred to as classifying that observation. This chapter discusses three classifiers- logistic regression, linear discriminant analysis, K-nearest neighbors.

4.2 Why Not Linear Regression?

In classification, class is the label that we are trying to predict. For example, if we try to classify an image of a ball to football or cricket ball, our classes are football and cricket. Linear regression for classification on two class produces good result but isn’t preferable for more classes.

4.3 Logistic Regression

Logistic regression models the probability that Y belongs to a particular category. For example, to classify loan defaulters ( as yes or no) given their balance, probability of default given balance will be-

p(X)=β0+β1Xp(X) = \beta_{0} + \beta_{1}X

4.3.1 The Logistic Model

This is for a two class problem. When predicting a response, the linear model sometimes produces values greater than 1 or smaller than 0. This is unwarranted because the probability of something must be in [0,1] range and to enforece that, the logistic function takes the form-

p(X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1+e^{\beta_{0} + \beta_{1}X}} \tag{1} After a bit of manipulation of (1)-

p(X)1p(X)=eβ0+β1X\frac{p(X)}{1-p(X)}=e^{\beta_{0} + \beta_{1}X}

This left hand side is called the odds, range is [0, \infty) which respectively indicates very high probability or low probability of something. For example, on average 1 in 5 people with an odds of 1/4 will default if p(X)=0.2p(X) = 0.2 .

4.3.2 Estimating the Regression Coefficients

The coeffs β0,β1\beta_{0}, \beta_{1} are estimated using maximum likelihood. Intuition behind maximum likelihood is that we seek β0,β1\beta_{0}, \beta_{1} such that the predicted probability p^(X)\hat{p}(X) yields a number close to 1 for all in class 1 and number close to 0 for all in class 2. The mathematical details of maximum likelihood is beyond the scope of this book. Any statistical package/software can be used to estimate the coefficients.

4.3.3 Making Predictions

After estimating the coefficients, just plug in them into (2) for prediction. Qualitative predictors can also be used in form of dummy variable approach. For example, the Default data set contains qualitative variable student. To fit the model, 1 for student, 0 for non-student is used. So, if the coeffs are -3.5041(intercept: β0\beta_{0} ) and 0.4049(student: β1\beta_{1} ), the prediction is,

Pr^(default=Yesstudent=Yes)=e3.5041+0.4049×11+e3.5041+0.4049×1Pr^(default=Yesstudent=Yes)=e3.5041+0.4049×01+e3.5041+0.4049×0\hat{Pr}(default=Yes|student=Yes) = \frac{e^{-3.5041+0.4049\times1}}{1+e^{-3.5041+0.4049\times1}}\\\hat{Pr}(default=Yes|student=Yes) = \frac{e^{-3.5041+0.4049\times0}}{1+e^{-3.5041+0.4049\times0}}

4.3.4 Multiple Logistic Regression

Discusses the way of predicting binary response using multiple predictors. If X=(X0....Xp)X = (X_{0}....X_{p}) are pp predictors, then (2) can be rewritten as,

p(X)=eβ0+β1X1+...+βpXp1+eβ0+β1X1+...+βpXpp(X) = \frac{e^{\beta_{0} + \beta_{1}X_{1}+...+\beta_{p}X_{p}}}{1+e^{\beta_{0} + \beta_{1}X_{1}+...+\beta_{p}X_{p}}}

Use the same maximum likelihood to estimate the coefficients.

4.3.5 Logistic Regression for >2 Response Classes

It is possible but is not commonly used by many. Hence, not discussed in this book. Available in packages/software.

4.4 Linear Discriminant Analysis

This method models the distribution of the predictors X separately in each of the response classes, and then use Bayes’ theorem to flip these around into estimates for Pr(Y=kX=x)Pr(Y=k|X=x) .

4.4.1 Using Bayes’ Theorem for Classification

If there are more than 2 classes in the problem, πk\pi_{k} represents the prior probability of each class. It means πk\pi_{k} is the probability that a randomly chosen observation comes from kth class.

4.4.2 Linear Discriminant Analysis for p = 1

This is for classification using one predictor. Let fkPr(X=xY=k)f_{k}\equiv Pr(X=x|Y=k) denotes the probability density function of X for an observation that comes from the kth class. In other words, fk(x)f_{k}(x) is relatively large if there is a high probability that an observation in the kth class has XxX \approx x , and fk(x)f_{k}(x) is small if it is very unlikely that an observation in the kth class has XxX \approx x. Now, Bayes’ theorem states that,

Pr(Y=kX=x)=πkfk(x)l=0Kπlfl(x)(2)Pr(Y=k|X=x) = \frac{\pi_{k}f_{k}(x)}{\sum_{l=0}^K \pi_{l}f_{l}(x)} \tag{2}

where πk=number of sample in kth classtotal number of sample\pi_{k}=\frac{\text{number of sample in kth class}}{\text{total number of sample}}

Let fk(x)f_{k}(x) be a gaussian distribution. Then in one dimensional setting,

fk(x)=12πσkexp(12σk2(xμk))f_{k}(x) = \frac{1}{\sqrt{2\pi}\sigma_{k}}\,exp(-\frac{1}{2\sigma_{k}^2}(x-\mu_{k}))

μk,σk2\mu_{k}, \sigma_{k}^2 are mean and variance parameters of the predictor for the kth class. If σk2\sigma_{k}^2 is common in all class the probability of kth class becomes,

pk(x)=πk12πσexp(12σ2(xμk))l=1Kπl12πσexp(12σ2(xμl))p_{k}(x)= \frac{\pi_{k}\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma^2}(x-\mu_{k}))}{\sum_{l=1}^K \pi_{l}\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma^2}(x-\mu_{l}))}

Taking the log of the above, δk(x)=xμkσ2μk22σ2+log(πk)\delta_{k}(x) = x\frac{\mu_{k}}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k)

Here, μk\mu_k is the mean of the predictors that belong to class k, σ2\sigma^2 is the weighted average of the sample variance for each K class. In practice, the Bayes classifier cannot be calculated, instead the LDA estimates the Bayes classifier.

4.4.3 Linear Discriminant Analysis for p >1

X = (X1, X2, . . . , Xp) is drawn from a multivariate Gaussian (or multivariate normal) distribution, with a class-specific mean vector and a common covariance matrix. Covariance is how values changes together. The multivariate Gaussian distribution assumes that each individual predictor follows a one dimensional normal distribution, with some correlation between each pair of predictors.


Reference

  • James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

Updated:

Comments