# Random Forests in Data Mining

November 28, 2017

Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Since always, artificial intelligence has been driven by the ambition to understand and uncover complex relations in data. That is, to find models that can not only produce accurate predictions, but also be used to extract knowledge in an intelligible way. This section introduces random forest description in details.

### Random Forest Definition

A Random Forest consists of a collection or ensemble of simple tree predictors, each capable of producing a response when presented with a set of predictor values. For classification problems, this response takes the form of a class membership, which associates, or classifies, a set of independent predictor values with one of the categories present in the dependent variable. Alternatively, for regression problems, the tree response is an estimate of the dependent variable given the predictors. A Random Forest consists of an arbitrary number of simple trees, which are used to determine the final outcome.  For classification problems, the ensemble of simple trees vote for the most popular class. In the regression problem, their responses are averaged to obtain an estimate of the dependent variable. Using tree ensembles can lead to significant improvement in prediction accuracy (i.e., better ability to predict new data cases).

Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. The basic principle is that a group of “weak learners” can come together to form a “strong learner”. Random Forests are a wonderful tool for making predictions considering they do not overfit because of the law of large numbers. Introducing the right kind of randomness makes them accurate classifiers and regressors.

Random Forests grows many classification trees. Each tree is grown as follows:

• If the number of cases in the training set is N, sample N cases at random – but with replacement, from the original data. This sample will be the training set for growing the tree.
• If there are M input variables, a number mM is specified such that at each node, m variables are selected at random out of the M and the best split on this m is used to split the node. The value of m is held constant during the forest growing.
• Each tree is grown to the largest extent possible. There is no pruning.

Figure 1: An Example of Random Forest

Random Forest is the go to machine learning algorithm that works through bagging approach to create a bunch of decision trees with a random subset of the data. It is considered to be one of the most effective algorithms to solve almost any prediction task. It can be used both for classification and the regression kind of problems. It is a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest.

### Technical Details

Forests algorithm is one of the best among classification algorithms – able to classify large amounts of data with accuracy. Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging. Random Forests are an ensemble learning method (also thought of as a form of nearest neighbor predictor) for classification and regression that construct a number of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.

The response of each tree depends on a set of predictor values chosen independently (with replacement) and with the same distribution for all trees in the forest, which is a subset of the predictor values of the original data set. The optimal size of the subset of predictor variables is given by ​$$log_2 (M+1)$$​ where is the number of inputs. For classification problems, given a set of simple trees and a set of random predictor variables, the Random Forest method defines a margin function that measures the extent to which the average number of votes for the correct class exceeds the average vote for any other class present in the dependent variable. This measure provides us not only with a convenient way of making predictions, but also with a way of associating a confidence measure with those predictions.

For regression problems, Random Forests are formed by growing simple trees, each capable of producing a numerical response value. Here, too, the predictor set is randomly selected from the same distribution and for all trees. Given the above, the mean-square error for a Random Forest is given by:

$Mean Error = (Observed – Tree Response)^2$

The predictions of the Random Forest are taken to be the average of the predictions of the trees:

$Random Forest Prediction (S)= 1/K ∑_(K-1)^K K^{th} Tree Response$

Here the index runs over the individual trees in the forest.

### References

[1] “Random Forests”, available online at: http://www.statsoft.com/Textbook/Random-Forest

[2] Michael Walker, “Random Forests Algorithm”, Data Science Central, available online at: https://www.datasciencecentral.com/profiles/blogs/random-forests-algorithm

[3] Gilles Louppe, “Understanding Random Forests”, PhD dissertation, July 2014

[4] Vishakha Jha, “Random Forest – Supervised classification machine learning algorithm”, June 15, 2017, available online at: https://www.techleer.com/articles/107-random-forest-supervised-classification-machine-learning-algorithm/

$${}$$