Sunday, December 29, 2019

Overfitting, underfitting, (6 tricks to prevent overfitting in machine learning.), Regularization

Ref:

https://ai-odyssey.com/2018/03/22/memorizing-is-not-learning%E2%80%8A-%E2%80%8A6-tricks-to-prevent-overfitting-in-machine-learning/ (*********************)

Rgularization:

By reducing the sum of absolute values of the coefficients, what Lasso Regularization (L1 Norm) does is to reduce the number of features in the model altogether to predict the target variable.

On the other hand, by reducing the sum of square of coefficients, Ridge Regularization (L2 Norm) doesn’t necessarily reduce the number of features per se, but rather reduces the magnitude/impact that each features has on the model by reducing the coefficient value.

So simply put, both regularization does indeed prevent the model from overfitting, but I would like to think of Lasso Regularization as reducing the quantity of features while Ridge Regularization as reducing the quality of features. In essence, both types of reductions are needed, and therefore it makes much more sense why ElasticNet (combination of Lasso and Ridge Regularization) would be the ideal type of regularization to perform on a model.

https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models/159379 (VVVVVVIII *******************) (to understand the L1 Norm does is to reduce the number of features in the model and L2 Norm doesn’t necessarily reduce the number of features per) (second answer)

With a sparse model, we think of a model where many of the weights are 0. Let us therefore reason about how L1-regularization is more likely to create 0-weights.
Consider a model consisting of the weights

(w_{1}, w_{2}, \dots, w_{m})

.
With L1 regularization, you penalize the model by a loss function

L_{1} (w)

Σ_{i} | w_{i} |

.
With L2-regularization, you penalize the model by a loss function

L_{2} (w)

\frac{1}{2} Σ_{i} w_{i}^{2}

If using gradient descent, you will iteratively make the weights change in the opposite direction of the gradient with a step size

η

multiplied with the gradient. This means that a more steep gradient will make us take a larger step, while a more flat gradient will make us take a smaller step. Let us look at the gradients (subgradient in case of L1):

\frac{d L_{1} (w)}{d w} = s i g n (w)

, where

s i g n (w) = (\frac{w_{1}}{| w_{1} |}, \frac{w_{2}}{| w_{2} |}, \dots, \frac{w_{m}}{| w_{m} |})

\frac{d L_{2} (w)}{d w} = w

Need to understand the question vvvviiiiii.
1.Why vanishing gradients is problematic for training deep neural networks? How does using ReLU
alleviate this problem?
2. Why is bias term necessary in neural network?
3. Explain the bias-variance trade-off.
4. How does L1/L2 regularization reduce overfitting?
5. Explain how dropout allows us to train an ensemble of neural network simultaneously.
6. How does L1 regularization create sparse model?
7. If number of neurons is fixed, is it better to make a neural network deeper (more layer) or wider
(more neurons per layer)?
8. What is the effect of learning rate for training neural network?

Sunday, December 22, 2019

Neural Network, Back propagation

Book name:

http://www.deeplearningbook.org/ (refereed by Tanvir san)

REF:

https://medium.com/deep-learning-demystified/generalization-in-neural-networks-7765ee42ac23 (basic of NN)
https://medium.com/datadriveninvestor/the-basics-of-neural-networks-304364b712dc (basic of NN)
https://gadictos.com/neural-network-pt1/ (back propagation)
https://towardsdatascience.com/lets-code-a-neural-network-in-plain-numpy-ae7e74410795 (Given by tanvir san)
http://cs231n.github.io/neural-networks-case-study/ (Given by rossi san and practice in ai class with palash ***************)
https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536 (back propagation mathematical term ************ vVVVIIIIIIIIIII)

There are four equation need to understand the back propagation
last layer derivative or change calculation for activation

Hidden layer derivative or change calculation for activation

Differential w.r.t to w
Differential wrt to b

above equation based on Andrew Ng video

WE need to care about NN sign to understand the equation. Andrew ng and midium NN presention(W,Z, layer lonation different) different presentatin.

https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function (derivative of softmax funtion)

Wednesday, December 11, 2019

Decision Trees, Random forest

Ref:

https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb (basic all information ***)
https://www.youtube.com/watch?v=nWuUahhK3Oc (*** better for understnding Regression tree)
https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/ (here is the full example of sunny outlook exam)
https://en.wikipedia.org/wiki/C4.5_algorithm (c4.5 algorithm)
https://www.geeksforgeeks.org/decision-tree-introduction-example/ (gini index and entropy)
https://www.youtube.com/watch?v=Pz6xX6rK5M4&list=PLBv09BD7ez_4_UoYeGrzvqveIR_USBEKD&index=1 (vvvi ***** clearly describe information gain)
https://datascience.stackexchange.com/questions/24339/how-is-a-splitting-point-chosen-for-continuous-variables-in-decision-trees (Good question and answer)
https://m.youtube.com/watch?v=OD8aO4ovIBo (continuous data or numeric data splitting boss video please sees ★*****★***★***********)
https://www.youtube.com/watch?v=eKD5gxPPeY0 (decision tree for multiple classification)

Here log base 3 for 3 classification for entrophy

log base for 4 for 4 classfication for entropy

https://medium.com/@rishabhjain_22692/decision-trees-it-begins-here-93ff54ef134 (Entropy and information gain, ID3 ************** have good example)

ID3:
- Entropy using the frequency table of one attribute:
- Entropy using the frequency table of two attributes:
Information Gain step (very important in ref 5)
Gini Index (see ref 5 )

Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.
It works with categorical target variable “Success” or “Failure”.
It performs only Binary splits
Higher the value of Gini higher the homogeneity.
CART (Classification and Regression Tree) uses Gini method to create binary splits.

Chi-Square

It is an algorithm to find out the statistical significance between the differences between sub nodes and parent node. We measure it by sum of squares of standardised differences between observed and expected frequencies of target variable.
It works with categorical target variable “Success” or “Failure”.
It can perform two or more splits.
Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
Chi-Square of each node is calculated using formula,
Chi-square = ((Actual — Expected)² / Expected)¹/2
It generates tree called CHAID (Chi-square Automatic Interaction Detector)

Variance matrix :
1. variance matrix important due to numerical or continuous data

Regression Trees vs Classification Trees (REF -01)

The terminal nodes (or leaves) lies at the bottom of the decision tree. This means that decision trees are typically drawn upside down such that leaves are the bottom & roots are the tops.

Both the trees work almost similar to each other. The primary differences and similarities between Classification and Regression Trees are:

Regression trees are used when dependent variable is continuous. Classification Trees are used when dependent variable is categorical.
In case of Regression Tree, the value obtained by terminal nodes in the training data is the mean response of observation falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mean value.
In case of Classification Tree, the value (class) obtained by terminal node in the training data is the mode of observations falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mode value.
Both the trees divide the predictor space (independent variables) into distinct and non-overlapping regions.

use standard classification tree: basic classification algorithm
data example:

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No

use C4.5 implementation: https://en.wikipedia.org/wiki/C4.5_algorithm (Please study the algorithm part of this wiki)
data example:

	tutorials	labs.	exam
	all	complete	74
	some	partial	23

use standard Regression Tree: When data is linear then, use linear regresstion . because, Decision tree not work better on linear data.

Use C4.5 Implementation: this time use the C4.5 algorithm.

	x1	x2	x3	x4	Target
	25	34	2	34	22
	233	3	78	3	22

Random Forest

The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random:

Random sampling of training data points when building trees
Random subsets of features considered when splitting nodes

https://www.youtube.com/watch?v=J4Wdy0Wc_xQ (Excellent video ******)
https://www.youtube.com/watch?v=g9c66TUylZ4 (**********)
https://www.youtube.com/watch?v=nyxTdL_4Q-Q (*******)
https://builtin.com/data-science/random-forest-algorithm

Overall, random forest is a (mostly) fast, simple and flexible tool, but not without some limitations (performance issue).

https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76 (**** basic of random forest)

Bootstrapping: sampling random sets of observations with replacement.
Bagging: Bootstrapping data plus using the aggregate to make a decision is called bagging

Typycally 1/3 of the original data does not end up in the bootstrap data. This 1 dataset is called out-of-bag data-set.

Wednesday, December 4, 2019

Unsupervised classification - (K-mean clustering)

K-mean clustering:

https://www.youtube.com/watch?v=hDmNF9JG3lo (Andew ng course booss *******) (13.1-13.5) 13week
https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a
https://www.youtube.com/watch?v=9991JlKnFmk (*** good to see for batter understanding)
https://www.youtube.com/watch?v=i7dFu1bPogc ( *** show some disadvantage of k-means at the end of video)

need to lean:

Association Rules
hierarchy clustering
k-nearest neighbors algorithm (for supervised algorithm)

Sunday, December 1, 2019

Paper presentation (suupport vector regression related paper), (Lagrange)

Ref:

https://link.springer.com/content/pdf/10.1007/11758549_63.pdf (presentation paper)
https://pdfs.semanticscholar.org/c5a9/67eaded74a9fc414de4ad5120b0b66acd2c3.pdf
https://shuzhanfan.github.io/2018/05/understanding-mathematics-behind-support-vector-machines/ (VVI and related to ref 1,2)
https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/constrained-optimization/a/lagrange-multipliers-single-constraint (before understanding the paper please study the lagrange multiplier theorem)
https://cs.adelaide.edu.au/~javen/talk/L4%20regression.pdf (valoi **********)
http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf (read please)
https://www.saedsayad.com/support_vector_machine_reg.htm (***)

Lagrange :

https://en.wikipedia.org/wiki/Lagrange_multiplier (****** describe awesome)
https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/constrained-optimization/a/lagrange-multipliers-single-constraint (ref 1 and ref2 are enough to understand the lagrange)
https://mccormickml.com/2013/04/16/trivial-svm-example/ (svm scoring function ) (**************)
http://cs229.stanford.edu/materials/smo.pdf (code plus description)

Work Notes

Sunday, December 29, 2019

Overfitting, underfitting, (6 tricks to prevent overfitting in machine learning.), Regularization

Sunday, December 22, 2019

Neural Network, Back propagation

Wednesday, December 11, 2019

Decision Trees, Random forest

ID3:

Gini Index (see ref 5 )

Chi-Square

Variance matrix :

Regression Trees vs Classification Trees (REF -01)

Random Forest

Wednesday, December 4, 2019

Unsupervised classification - (K-mean clustering)

Sunday, December 1, 2019

Paper presentation (suupport vector regression related paper), (Lagrange)

Autoboxing and Unboxing