Introduction to Decision Tree and Random Forest Algorithm

Today in the field of machine learning, we can choose an algorithm which can perform calculations at a very intense level, with the increase in computational power.

Decision tree and Random Forest algorithms are well known among the machine learning algorithm.

Decision Tree

Since we know that decision trees are widely used in Data mining, text mining, information extraction, machine learning, pattern recognition etc.

But there are many other types of classifiers like you may have heard of neural network or support virtual machines. So why did we use a decision tree?

  • Well, decision tree is represented graphically as hierarchical structures so they have a very unique property that they are easy to read and understand.
  • In fact, they are among the few models that are interpretable, where you can understand exactly why the classifier makes a decision.Also, it is able to handle numerical and categorical data.

Now we know the importance of using decision tree, so lets have a look what the Decision tree actually is?

  • A decision tree is a classifier expressed as a recursive partition of the instance space and is a popular technique in data mining.
  • In a decision tree, each internal node splits the instance space into two or more sub-spaces according to a certain discrete function of the input attributes values.
  • It represents decisions and decision making and allows user to take a problem with multiple possible solution and display it into a simple easy understandable format.
Parts of decision tree
  1. Root node ( it is the topmost node in tree and have no incoming edges).
  2. Internal node or test node(it denotes test on attribute and have outgoing edges and are represented as circles).
  3. Branch node(it denotes outcome of test).
  4. Leaf node or decision nodes(represents a classification or decision and are denoted as triangles).

Decision tree which are used in Data mining can be divided into two types as follow

  • Classification tree analysis(when predicted result is the class to which the data belongs).
  • Regression tree analysis(when predicted result can be taken as real number).

The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures.

Decision tree algorithms
  1. ID3(Interactive Dichotomiser 3)
  2. C4.5(Successor of ID3)
  3. CART(Classification of Regression tree)
  4. CHAID(CHI-squared Automatic Interaction Detector)
  5. MARS(Multivariate Adaptive Regression Splines)

Random Forest Algorithm

This algorithm is like bootstrapping algorithm with Decision tree (CART) model.

But the difference between the decision tree algorithm and Random Forest algorithm is that in Random Forest, the all the processes like that of finding the root node or splitting the feature nodes, will run randomly. Or we can say that Random forest algorithm works as a large collection decorrelated decision trees.

By using different initial variables and different sample, Random forest tries to build multiple CART model. So at the same time, it takes the variance of several input variables and enables high number of observations for participation in prediction process.

And this results in much more accurate predictions in comparisons to simple CART or regression models.

Two major beliefs for Random forest algorithm are
  • Most of the tree can provide correct prediction of class for most part of the data.
  • The trees are making mistake at different places.
Now here are some key features of Random forest algorithm
  • It can use both for classification and the regression kind of problems.
  • As the number of trees increase, results become more accurate.
  • It can be run efficiently on large data bases.
  • It generates an internal unbiased estimates of what variables are important in the classification with the generalization error as the forest building progresses.
  • It is very effective for estimating missing data and maintaining accuracy when a large proportion of the data are missing.
Advantages of random forest algorithm
  • No overfitting problems during the use of random forest algorithm in any classification problem.
  • We can use the same random forest algorithm for both classification and regression task.
  • The random forest algorithm can be used for feature engineering( Which means identifying the most important features out of the available features from the training dataset).
  • Generated forests can be use on other data in future.
Applications of Random Forest algorithm

In Banking: for finding the loyal customer and finding the fraud customers.

In Medicine: to identify the correct combination of the components to validate the medicine and to identifying the disease by analyzing the patient’s medical records.

In Stock Market: to identify the stock behavior as well as the expected loss or profit by purchasing the particular stock.