# Chapter 16. Ensemble Learning

## What is the idea of Ensemble Learning?

The **idea** of ensemble learning is **to build a prediction model by combining the strengths of a collection of simpler base models**.

Zhou Zhihua Ensemble Learning: **to boost** **weak learner**s which are slightly better than random guess to **strong learners** which can make very accurate predictions.

Ensemble learning can be broken down into **two** tasks:

**First**, developing a population of base learners from the training data,

**then** combining them to form the composite predictor.

Zhou Zhihua:

**First**, a number of base learners are produced, which can be generated in a *parallel* style or in a *sequential* style where the generation of a base learner has influence on the generation of subsequent learners.

**Then**, the base learners are combined to use, where among the most popular combination schemes are *majority votin*g for classification and *weighted averagin*g for regression.

## List some **methods** of Ensemble Learning.

**Bagging**- trains a number of base learners each from a different
*bootstrap*sample by calling a base learning algorithm. - After obtaining the base learners, Bagging combines them by majority voting and the most-voted class is predicted.
- Sample: Random Forest
- Reduce
**variance**

- trains a number of base learners each from a different
**Boosting**- Is a family of algorithms since there are many variants.
- Sample: Adaboost
- Reduce
**bias**

**Stacking**- A number of first-level individual learners are generated from the training data set by employing different learning algorithms.
- Those individual learners are then combined by a second-level learner which is called as
*meta-learner*.

Bayesian methods for nonparametric regression can also be viewed as ensemble methods

Generally speaking, there is no ensemble method which outperforms other ensemble methods consistently.

## List some **Penalized Regression** and how they works

**Lasso regression** and **ridge regression**.

Consider the dictionary of all possible J-terminal node regression trees $T=\{T_k\}$ that could be realized on the training data as basis functions in $R^p$. The linear model is

Suppose the coefficients are to be estimated by **least squares**. Since the number of such trees is likely to be much larger than even the largest training data sets, some form of regularization is required. Let $\hat{\alpha}(\lambda)$ solve

$J(\alpha)$ is a function of the coefficients that generally penalizes larger values.

## Why ensemble superior to Singles - **generalization**

- the training data might not provide sufficient information for choosing a single best learner
- the search processes of the learning algorithms might be imperfect
- the hypothesis space being searched might not contain the true target function, while ensembles can give some good approximation.

**The bias-variance decomposition** is often used in studying the performance of ensemble methods.