The Element of Statistical Learning Chapter 16

Chapter 16. Ensemble Learning

What is the idea of Ensemble Learning?

The idea of ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models.

Zhou Zhihua Ensemble Learning: to boost weak learners which are slightly better than random guess to strong learners which can make very accurate predictions.

Ensemble learning can be broken down into two tasks:

First, developing a population of base learners from the training data,

then combining them to form the composite predictor.

Zhou Zhihua:

First, a number of base learners are produced, which can be generated in a parallel style or in a sequential style where the generation of a base learner has influence on the generation of subsequent learners.

Then, the base learners are combined to use, where among the most popular combination schemes are majority voting for classification and weighted averaging for regression.

List some methods of Ensemble Learning.

  • Bagging

    • trains a number of base learners each from a different bootstrap sample by calling a base learning algorithm.
    • After obtaining the base learners, Bagging combines them by majority voting and the most-voted class is predicted.
    • Sample: Random Forest
    • Reduce variance
  • Boosting

    • Is a family of algorithms since there are many variants.
    • Sample: Adaboost
    • Reduce bias
  • Stacking

    • A number of first-level individual learners are generated from the training data set by employing different learning algorithms.
    • Those individual learners are then combined by a second-level learner which is called as meta-learner.

Bayesian methods for nonparametric regression can also be viewed as ensemble methods

Generally speaking, there is no ensemble method which outperforms other ensemble methods consistently.

List some Penalized Regression and how they works

Lasso regression and ridge regression.

Consider the dictionary of all possible J-terminal node regression trees $T=\{T_k\}$ that could be realized on the training data as basis functions in $R^p$. The linear model is

Suppose the coefficients are to be estimated by least squares. Since the number of such trees is likely to be much larger than even the largest training data sets, some form of regularization is required. Let $\hat{\alpha}(\lambda)$ solve

$J(\alpha)$ is a function of the coefficients that generally penalizes larger values.

Why ensemble superior to Singles - generalization

  • the training data might not provide sufficient information for choosing a single best learner
  • the search processes of the learning algorithms might be imperfect
  • the hypothesis space being searched might not contain the true target function, while ensembles can give some good approximation.

The bias-variance decomposition is often used in studying the performance of ensemble methods.