Sunday, December 29, 2019

Overfitting, underfitting, (6 tricks to prevent overfitting in machine learning.), Regularization


Ref:
  1. https://ai-odyssey.com/2018/03/22/memorizing-is-not-learning%E2%80%8A-%E2%80%8A6-tricks-to-prevent-overfitting-in-machine-learning/  (*********************)
  2.  



Rgularization:
  1. https://medium.com/@dk13093/lasso-and-ridge-regularization-7b7b847bce34
  2.  https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c
  3. https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261  

By reducing the sum of absolute values of the coefficients, what Lasso Regularization (L1 Norm) does is to reduce the number of features in the model altogether to predict the target variable.

On the other hand, by reducing the sum of square of coefficients, Ridge Regularization (L2 Norm) doesn’t necessarily reduce the number of features per se, but rather reduces the magnitude/impact that each features has on the model by reducing the coefficient value.

So simply put, both regularization does indeed prevent the model from overfitting, but I would like to think of Lasso Regularization as reducing the quantity of features while Ridge Regularization as reducing the quality of features. In essence, both types of reductions are needed, and therefore it makes much more sense why ElasticNet (combination of Lasso and Ridge Regularization) would be the ideal type of regularization to perform on a model.


https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models/159379    (VVVVVVIII *******************) (to understand the L1 Norm does is to reduce the number of features in the model  and L2 Norm doesn’t necessarily reduce the number of features per) (second answer)


With a sparse model, we think of a model where many of the weights are 0. Let us therefore reason about how L1-regularization is more likely to create 0-weights.
Consider a model consisting of the weights
.
With L1 regularization, you penalize the model by a loss function
= .
With L2-regularization, you penalize the model by a loss function
= If using gradient descent, you will iteratively make the weights change in the opposite direction of the gradient with a step size
multiplied with the gradient. This means that a more steep gradient will make us take a larger step, while a more flat gradient will make us take a smaller step. Let us look at the gradients (subgradient in case of L1):

, where








 

Need to understand the question vvvviiiiii.
1.Why vanishing gradients is problematic for training deep neural networks? How does using ReLU
alleviate this problem?
2. Why is bias term necessary in neural network?
3. Explain the bias-variance trade-off.
4. How does L1/L2 regularization reduce overfitting?
5. Explain how dropout allows us to train an ensemble of neural network simultaneously.
6. How does L1 regularization create sparse model?
7. If number of neurons is fixed, is it better to make a neural network deeper (more layer) or wider
(more neurons per layer)?
8. What is the effect of learning rate for training neural network?

Sunday, December 22, 2019

Neural Network, Back propagation


Book name:


REF:

Wednesday, December 11, 2019

Decision Trees, Random forest


Ref:
  1. https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb  (basic all information ***)
  2. https://www.youtube.com/watch?v=nWuUahhK3Oc (*** better for understnding Regression tree)
  3. https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/ (here is the full example of sunny outlook exam) 
  4. https://en.wikipedia.org/wiki/C4.5_algorithm (c4.5 algorithm)
  5. https://www.geeksforgeeks.org/decision-tree-introduction-example/ (gini index and entropy) 
  6. https://www.youtube.com/watch?v=Pz6xX6rK5M4&list=PLBv09BD7ez_4_UoYeGrzvqveIR_USBEKD&index=1  (vvvi ***** clearly describe information gain
  7. https://datascience.stackexchange.com/questions/24339/how-is-a-splitting-point-chosen-for-continuous-variables-in-decision-trees (Good question and answer)
  8. https://m.youtube.com/watch?v=OD8aO4ovIBo (continuous data or numeric data splitting boss video please sees ★*****★***★***********)
  9. https://www.youtube.com/watch?v=eKD5gxPPeY0 (decision tree for multiple classification
    • Here log base 3 for 3 classification for entrophy
    • log base for 4 for 4 classfication for entropy
  10. https://medium.com/@rishabhjain_22692/decision-trees-it-begins-here-93ff54ef134 (Entropy and information gain, ID3 ************** have good example) 
    • ID3: 

      • Entropy using the frequency table of one attribute:
      • Entropy using the frequency table of two attributes:
      Information Gain  step (very important in ref 5)
    •  
    •  

      Gini Index (see ref 5 )

      Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.

    • It works with categorical target variable “Success” or “Failure”.
    • It performs only Binary splits
    • Higher the value of Gini higher the homogeneity.
    • CART (Classification and Regression Tree) uses Gini method to create binary splits.
    1. Chi-Square

      It is an algorithm to find out the statistical significance between the differences between sub nodes and parent node. We measure it by sum of squares of standardised differences between observed and expected frequencies of target variable.

    2. It works with categorical target variable “Success” or “Failure”.
    3. It can perform two or more splits.
    4. Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
    5. Chi-Square of each node is calculated using formula,
    6. Chi-square = ((Actual — Expected)² / Expected)¹/2
    7. It generates tree called CHAID (Chi-square Automatic Interaction Detector)
  11.  

    Variance matrix :

    1. variance matrix important due to numerical or continuous data

Regression Trees vs Classification Trees (REF -01)

The terminal nodes (or leaves) lies at the bottom of the decision tree. This means that decision trees are typically drawn upside down such that leaves are the bottom & roots are the tops.
Both the trees work almost similar to each other. The primary differences and similarities between Classification and Regression Trees are:
  1. Regression trees are used when dependent variable is continuous. Classification Trees are used when dependent variable is categorical.
  2. In case of Regression Tree, the value obtained by terminal nodes in the training data is the mean response of observation falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mean value.
  3. In case of Classification Tree, the value (class) obtained by terminal node in the training data is the mode of observations falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mode value.
  4. Both the trees divide the predictor space (independent variables) into distinct and non-overlapping regions.







use standard classification tree:  basic classification algorithm
data example:
Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No



use C4.5 implementation: https://en.wikipedia.org/wiki/C4.5_algorithm (Please study the algorithm part of this wiki)
data example:

tutorials labs. exam


all complete 74


some partial 23



use standard Regression Tree: When data is linear then, use linear regresstion . because, Decision tree not work better on linear data.

Use C4.5 Implementation: this time use the C4.5 algorithm.


x1 x2 x3 x4 Target

25 34 2 34 22

233 3 78 3 22







Random Forest

The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random:
  1. Random sampling of training data points when building trees
  2. Random subsets of features considered when splitting nodes

 

  1.  https://www.youtube.com/watch?v=J4Wdy0Wc_xQ (Excellent video ******)
  2. https://www.youtube.com/watch?v=g9c66TUylZ4 (**********) 
  3. https://www.youtube.com/watch?v=nyxTdL_4Q-Q (*******)
  4. https://builtin.com/data-science/random-forest-algorithm 
    1. Overall, random forest is a (mostly) fast, simple and flexible tool, but not without some limitations (performance issue).
  5. https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76 (**** basic of random forest) 


Bootstrapping: sampling random sets of observations with replacement.
Bagging: Bootstrapping data plus using the aggregate to make a decision is called bagging

Typycally 1/3 of the original data does not end up in the bootstrap data. This 1 dataset is called out-of-bag data-set.













Autoboxing and Unboxing

  Autoboxing  is the automatic conversion that the Java compiler makes between the primitive types and their corresponding object wrapper cl...