🪴 Aradinka Digital Garden

Search

Search IconIcon to open search

decision-tree

Last updated Jan 1, 2023

# Summary

It’s a tree-shaped supervised learning algorithm, works on if-then statement, that can be used in classification and regression problems. The input can be both continuous and categorical. Feature values are preferred to be categorical, if continuous they are discretized.

Advantages:

Disadvantages:

Use Cases:

# Overfitting and Instability of Decision Trees

Overfitting:

Instability:

# Depth of the tree

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes’ actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

On the flip side, if we make our tree very shallow, it doesn’t divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason)

 A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

# Splitting the Data

When splitting, we choose to partition the data by the attribute that results in the smallest impurity of the new nodes

# Control overfitting in decision tree

max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area to the overfitting area.

# Algorithm

A decision tree uses different algorithms to decide whether to split a node into two or more sub-nodes. The algorithm chooses the partition maximizing the purity of the split (i.e., minimizing the impurity

)

impurity is a measure of homogeneity of the labels at the node at hand

# ID3 (Iterative Dichotomiser)

# Reference