decision-tree
# Summary
It’s a tree-shaped supervised learning algorithm, works on if-then statement, that can be used in classification and regression problems. The input can be both continuous and categorical. Feature values are preferred to be categorical, if continuous they are discretized.
Advantages:
- can handle missing values
- explainable and interpretable
Disadvantages:
- prone to overfitting
overfitting
An overfitting model is the one that fits well with the training data, i.e. little or no error, however it does not...
- sensitive to outlier
Use Cases:
- customer-churn-prediction
- credit-scoring
credit scoring
Apakah nasabah layak diberikan pinjaman? Data: Profil nasabah Problem dari nasabah: Proses yang perlu dilalui butuh waktu lama Data kaggle Credit Card Approval...
- disease-prediction
# Overfitting and Instability of Decision Trees
Overfitting:
- Decision trees prone to overfitting
overfitting
An overfitting model is the one that fits well with the training data, i.e. little or no error, however it does not...
- That deteriorates the tree’s generalization capability and usefulness on unseen data since it will start modeling the noise
Instability:
- We can limit the tree’s depth beforehand, but there’s still the problem of instability
- even small changes to the training data, such as excluding a few instances, can result in a completely different tree
# Depth of the tree
When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes’ actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).
On the flip side, if we make our tree very shallow, it doesn’t divide up the houses into very distinct groups.
At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason)
A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.
# Splitting the Data
When splitting, we choose to partition the data by the attribute that results in the smallest impurity of the new nodes
# Control overfitting in decision tree
max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area to the overfitting area.
# Algorithm
A decision tree uses different algorithms to decide whether to split a node into two or more sub-nodes. The algorithm chooses the partition maximizing the purity of the split (i.e., minimizing the impurity
)impurity
impurity is a measure of homogeneity of the labels at the node at hand Define impurity: There are different ways to define impurity. In...
impurity is a measure of homogeneity of the labels at the node at hand
# ID3 (Iterative Dichotomiser)
- Uses entropy
entropy
In statistics, entropy is a measure of information varies from 0 to 1 0: all the data belong to a single class 1:...
- The attribute with the highest information gain is used as a root node, and a similar approach is followed after that
- Entropy is the measure that characterizes the impurity
impurity
impurity is a measure of homogeneity of the labels at the node at hand Define impurity: There are different ways to define impurity. In...
- when entropy , the set is perfectly classified (all element in are of the same class)
- when entropy , the class distribution is equal
- entropy is calculated for each remaining attribute, the attribute with the smallest entropy is used to split the set on this iteration
- the higher the entropy, the higher potential to improve the classification here