4.1 An Overview of Classification

Predicting a qualitative (string/label) response for an observation can be referred to as classifying that observation. This chapter discusses three classifiers- logistic regression, linear discriminant analysis, K-nearest neighbors.

4.2 Why Not Linear Regression?

In classification, class is the label that we are trying to predict. For example, if we try to classify an image of a ball to football or cricket ball, our classes are football and cricket. Linear regression for classification on two class produces good result but isn’t preferable for more classes.

4.3 Logistic Regression

Logistic regression models the probability that Y belongs to a particular category. For example, to classify loan defaulters ( as yes or no) given their balance, probability of default given balance will be-

\[p(X) = \beta_{0} + \beta_{1}X\]

4.3.1 The Logistic Model

This is for a two class problem. When predicting a response, the linear model sometimes produces values greater than 1 or smaller than 0. This is unwarranted because the probability of something must be in [0,1] range and to enforece that, the logistic function takes the form-

\(p(X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1+e^{\beta_{0} + \beta_{1}X}} \tag{1}\) After a bit of manipulation of (1)-

\[\frac{p(X)}{1-p(X)}=e^{\beta_{0} + \beta_{1}X}\]

This left hand side is called the odds, range is [0, \(\infty\)) which respectively indicates very high probability or low probability of something. For example, on average 1 in 5 people with an odds of 1/4 will default if \(p(X) = 0.2\) .

4.3.2 Estimating the Regression Coefficients

The coeffs \(\beta_{0}, \beta_{1}\) are estimated using maximum likelihood. Intuition behind maximum likelihood is that we seek \(\beta_{0}, \beta_{1}\) such that the predicted probability \(\hat{p}(X)\) yields a number close to 1 for all in class 1 and number close to 0 for all in class 2. The mathematical details of maximum likelihood is beyond the scope of this book. Any statistical package/software can be used to estimate the coefficients.

4.3.3 Making Predictions

After estimating the coefficients, just plug in them into (2) for prediction. Qualitative predictors can also be used in form of dummy variable approach. For example, the Default data set contains qualitative variable student. To fit the model, 1 for student, 0 for non-student is used. So, if the coeffs are -3.5041(intercept: \(\beta_{0}\) ) and 0.4049(student: \(\beta_{1}\) ), the prediction is,

\[\hat{Pr}(default=Yes|student=Yes) = \frac{e^{-3.5041+0.4049\times1}}{1+e^{-3.5041+0.4049\times1}}\\\hat{Pr}(default=Yes|student=Yes) = \frac{e^{-3.5041+0.4049\times0}}{1+e^{-3.5041+0.4049\times0}}\]

4.3.4 Multiple Logistic Regression

Discusses the way of predicting binary response using multiple predictors. If \(X = (X_{0}....X_{p})\) are \(p\) predictors, then (2) can be rewritten as,

\[p(X) = \frac{e^{\beta_{0} + \beta_{1}X_{1}+...+\beta_{p}X_{p}}}{1+e^{\beta_{0} + \beta_{1}X_{1}+...+\beta_{p}X_{p}}}\]

Use the same maximum likelihood to estimate the coefficients.

4.3.5 Logistic Regression for >2 Response Classes

It is possible but is not commonly used by many. Hence, not discussed in this book. Available in packages/software.

4.4 Linear Discriminant Analysis

This method models the distribution of the predictors X separately in each of the response classes, and then use Bayes’ theorem to flip these around into estimates for \(Pr(Y=k|X=x)\) .

4.4.1 Using Bayes’ Theorem for Classification

If there are more than 2 classes in the problem, \(\pi_{k}\) represents the prior probability of each class. It means \(\pi_{k}\) is the probability that a randomly chosen observation comes from kth class.

4.4.2 Linear Discriminant Analysis for p = 1

This is for classification using one predictor. Let \(f_{k}\equiv Pr(X=x|Y=k)\) denotes the probability density function of X for an observation that comes from the kth class. In other words, \(f_{k}(x)\) is relatively large if there is a high probability that an observation in the kth class has \(X \approx x\) , and \(f_{k}(x)\) is small if it is very unlikely that an observation in the kth class has \(X \approx x\). Now, Bayes’ theorem states that,

\[Pr(Y=k|X=x) = \frac{\pi_{k}f_{k}(x)}{\sum_{l=0}^K \pi_{l}f_{l}(x)} \tag{2}\]

where \(\pi_{k}=\frac{\text{number of sample in kth class}}{\text{total number of sample}}\)

Let \(f_{k}(x)\) be a gaussian distribution. Then in one dimensional setting,

\[f_{k}(x) = \frac{1}{\sqrt{2\pi}\sigma_{k}}\,exp(-\frac{1}{2\sigma_{k}^2}(x-\mu_{k}))\]

\(\mu_{k}, \sigma_{k}^2\) are mean and variance parameters of the predictor for the kth class. If \(\sigma_{k}^2\) is common in all class the probability of kth class becomes,

\[p_{k}(x)= \frac{\pi_{k}\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma^2}(x-\mu_{k}))}{\sum_{l=1}^K \pi_{l}\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma^2}(x-\mu_{l}))}\]

Taking the log of the above, \(\delta_{k}(x) = x\frac{\mu_{k}}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k)\)

Here, \(\mu_k\) is the mean of the predictors that belong to class k, \(\sigma^2\) is the weighted average of the sample variance for each K class. In practice, the Bayes classifier cannot be calculated, instead the LDA estimates the Bayes classifier.

4.4.3 Linear Discriminant Analysis for p >1

X = (X1, X2, . . . , Xp) is drawn from a multivariate Gaussian (or multivariate normal) distribution, with a class-specific mean vector and a common covariance matrix. Covariance is how values changes together. The multivariate Gaussian distribution assumes that each individual predictor follows a one dimensional normal distribution, with some correlation between each pair of predictors.


Reference