Wednesday, December 11, 2019

Decision Trees, Random forest


Ref:
  1. https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb  (basic all information ***)
  2. https://www.youtube.com/watch?v=nWuUahhK3Oc (*** better for understnding Regression tree)
  3. https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/ (here is the full example of sunny outlook exam) 
  4. https://en.wikipedia.org/wiki/C4.5_algorithm (c4.5 algorithm)
  5. https://www.geeksforgeeks.org/decision-tree-introduction-example/ (gini index and entropy) 
  6. https://www.youtube.com/watch?v=Pz6xX6rK5M4&list=PLBv09BD7ez_4_UoYeGrzvqveIR_USBEKD&index=1  (vvvi ***** clearly describe information gain
  7. https://datascience.stackexchange.com/questions/24339/how-is-a-splitting-point-chosen-for-continuous-variables-in-decision-trees (Good question and answer)
  8. https://m.youtube.com/watch?v=OD8aO4ovIBo (continuous data or numeric data splitting boss video please sees ★*****★***★***********)
  9. https://www.youtube.com/watch?v=eKD5gxPPeY0 (decision tree for multiple classification
    • Here log base 3 for 3 classification for entrophy
    • log base for 4 for 4 classfication for entropy
  10. https://medium.com/@rishabhjain_22692/decision-trees-it-begins-here-93ff54ef134 (Entropy and information gain, ID3 ************** have good example) 
    • ID3: 

      • Entropy using the frequency table of one attribute:
      • Entropy using the frequency table of two attributes:
      Information Gain  step (very important in ref 5)
    •  
    •  

      Gini Index (see ref 5 )

      Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.

    • It works with categorical target variable “Success” or “Failure”.
    • It performs only Binary splits
    • Higher the value of Gini higher the homogeneity.
    • CART (Classification and Regression Tree) uses Gini method to create binary splits.
    1. Chi-Square

      It is an algorithm to find out the statistical significance between the differences between sub nodes and parent node. We measure it by sum of squares of standardised differences between observed and expected frequencies of target variable.

    2. It works with categorical target variable “Success” or “Failure”.
    3. It can perform two or more splits.
    4. Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
    5. Chi-Square of each node is calculated using formula,
    6. Chi-square = ((Actual — Expected)² / Expected)¹/2
    7. It generates tree called CHAID (Chi-square Automatic Interaction Detector)
  11.  

    Variance matrix :

    1. variance matrix important due to numerical or continuous data

Regression Trees vs Classification Trees (REF -01)

The terminal nodes (or leaves) lies at the bottom of the decision tree. This means that decision trees are typically drawn upside down such that leaves are the bottom & roots are the tops.
Both the trees work almost similar to each other. The primary differences and similarities between Classification and Regression Trees are:
  1. Regression trees are used when dependent variable is continuous. Classification Trees are used when dependent variable is categorical.
  2. In case of Regression Tree, the value obtained by terminal nodes in the training data is the mean response of observation falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mean value.
  3. In case of Classification Tree, the value (class) obtained by terminal node in the training data is the mode of observations falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mode value.
  4. Both the trees divide the predictor space (independent variables) into distinct and non-overlapping regions.







use standard classification tree:  basic classification algorithm
data example:
Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No



use C4.5 implementation: https://en.wikipedia.org/wiki/C4.5_algorithm (Please study the algorithm part of this wiki)
data example:

tutorials labs. exam


all complete 74


some partial 23



use standard Regression Tree: When data is linear then, use linear regresstion . because, Decision tree not work better on linear data.

Use C4.5 Implementation: this time use the C4.5 algorithm.


x1 x2 x3 x4 Target

25 34 2 34 22

233 3 78 3 22







Random Forest

The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random:
  1. Random sampling of training data points when building trees
  2. Random subsets of features considered when splitting nodes

 

  1.  https://www.youtube.com/watch?v=J4Wdy0Wc_xQ (Excellent video ******)
  2. https://www.youtube.com/watch?v=g9c66TUylZ4 (**********) 
  3. https://www.youtube.com/watch?v=nyxTdL_4Q-Q (*******)
  4. https://builtin.com/data-science/random-forest-algorithm 
    1. Overall, random forest is a (mostly) fast, simple and flexible tool, but not without some limitations (performance issue).
  5. https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76 (**** basic of random forest) 


Bootstrapping: sampling random sets of observations with replacement.
Bagging: Bootstrapping data plus using the aggregate to make a decision is called bagging

Typycally 1/3 of the original data does not end up in the bootstrap data. This 1 dataset is called out-of-bag data-set.













No comments:

Post a Comment

Autoboxing and Unboxing

  Autoboxing  is the automatic conversion that the Java compiler makes between the primitive types and their corresponding object wrapper cl...