2 minute read

Boosting

Boosting is also a general approach that can be applied to many statistical methods. Unlike bagging, boosting use the same dataset. Boosting works on top of weak learners as in very simple algorithms or simple version of a powerful algorithm. For example, a decision tree with 1 split allowed is a weak learner. Output of that tree will be closer to random guessing. Boosting trains the same weak learner multiple times and after each training iterations, updates the dataset.

AdaBoost

It is a binary classifier where \(Y \in \{-1, +1\}\). The algorithm begins with setting up weights \(D_i = \frac{1}{N}\) for each observation \(X_i\) where \(i=1...N\). Suppose we train \(T\) models, then output from the \(t_{th}\) model will be \(G_t(x_i)\). Before the training of \(t_{th}\) model, we update \(X\) as \(X_i = X_i \times D_i\). As we can see, it is element wise operation. This considers as “punishing” the training samples which were misclassified in the previous model. Now that we have punished and trained, we will calculate the error, \(err_t\)

\[\frac{ \sum_{i=1}^{N} D_i \times I(y_i \ne G_t(x_i)) }{\sum_{i=1}^{N} D_i}\]

\(I(condition)\) returns 1 or 0 if the condition is true or false respectively. Now we will have multiple models at our hand and when finally combining their results, we should give more priorities to the more accurate models. This is achieved by calculating, \(\alpha_t = \log\left( \frac{1-err_t}{err_t} \right)\). Also, we haven’t updated the weights \(D_i\). They should be updated based on their performance on \(t_{th}\) tree.

\[D_i = D_i \times \exp(\alpha_t \times I(y_i \ne G_t(X_i)))\]

using \(I(y_i \ne G_t(X_i))\) makes sure that we don’t update samples that have been correctly classified. Finally after training of \(T\) models, we combine their result using majority vote.

\[G(x) = sign\left(\sum_{t=1}^{T} \alpha_t \times G_t(x)\right)\]

\(sign(x)\) returns -1 if \(x\) is negative, 1 if \(x\) is positive, and 0 if \(x\) is zero.

Implementation

class AdaBoost:
    def __init__(self, T):
        self.T = T
        self.modelContainer = []
        self.modelAlpha = []
        
    def fit(self, X, y):
        nSamples = X.shape[0]
        self.D = np.full(fill_value=(1/nSamples), shape=(nSamples))
        self.modelContainer = []
        self.modelAlpha = np.ones((self.T))
        self.estimator_weights_ = []
        self.estimator_errors_ = []
        
        for t in range(self.T):
            model = DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2)
            model.fit(X, y, sample_weight= self.D)
            predictions = model.predict(X)
            error_indices = y != predictions
            err = np.sum(self.D[error_indices]) / np.sum(self.D)
            
            alpha = np.log((1-err) / err)
            
            self.D[error_indices] = self.D[error_indices] * np.exp(alpha)
            self.modelContainer.append(model)
            self.modelAlpha[t] = alpha
#             storing meta information
            self.estimator_weights_.append(alpha)
            self.estimator_errors_.append(err)
    def predict(self,X):
        nSamples = X.shape[0]
        Y = np.zeros((nSamples, self.T))
        for t in range(self.T):
            model = self.modelContainer[t]
            Y[:,t] = model.predict(X) * self.modelAlpha[t]
        y = np.sum(Y, axis=1)
        y = np.sign(y)
        return y
    
    def score(self, X, y):
        predicted = self.predict(X)
        
        return accuracy_score(y, predicted)
        

Trying out the model

x1 = np.array([.1,.2,.4,.8, .8, .05,.08,.12,.33,.55,.66,.77,.88,.2,.3,.4,.5,.6,.25,.3,.5,.7,.6])
x2 = np.array([.2,.65,.7,.6, .3,.1,.4,.66,.77,.65,.68,.55,.44,.1,.3,.4,.3,.15,.15,.5,.55,.2,.4])
y = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1])
X = np.vstack((x1,x2)).T

y[y==-1] = -1

sklearn’s

boost = AdaBoostClassifier( base_estimator = DecisionTreeClassifier(max_depth = 1, max_leaf_nodes=2), 
                            algorithm = 'SAMME',n_estimators=3, learning_rate=1.0)
boost.fit(X,y)
boost.score(X, y)
0.8695652173913043

ours

ab = AdaBoost(3)
ab.fit(X,y)
ab.score(X,y)
0.8695652173913043

nice!

References

  1. Freund, Yoav, and Robert E. Schapire. “A desicion-theoretic generalization of on-line learning and an application to boosting.” European conference on computational learning theory. Springer, Berlin, Heidelberg, 1995.
  2. Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.
  3. James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

Comments