entropy
In statistics, entropy is a measure of information
- varies from 0 to 1
- 0: all the data belong to a single class
- 1: the class distribution is equal
- a measure of the amount of uncertainity in the data set
- when $H(S)=0$, the set is perfectly classified (all element in $S$ are of the same class)
Let’s assume that a dataset T associated with a node contains examples from n classes. Then, its entropy is:
where $p_j$ is the relative frequency of class $j$ in $T$.
As is the case with the gini-impurity-index, a node is pure when $entropy(T)$ takes its minimum value, zero, and impure when it takes its highest value, 1.
# Example
- 4 red, 0 blue
- 2 red, 2 blue
- 3 red, 1 blue
# Information Gain
The information gain is the difference between a parent node’s entropy and the weighted sum of its child node entropies.
Let’s assume a dataset $T$ with $N$ objects is partitioned into two datasets: $T_1$ and $T_2$ of sizes $N_1$ and $N_2$. Then, the split’s Information Gain ($Gain_{split}$) is:
In general, if splitting $T$ into m subsets $T_1, T_2, \ldots, T_m$ with $N_1, N_2, \ldots, N_m$ objects, respectively, the split’s Information Gain ($Gain_{split}$) is:
# Example Splitting by Information Gain
https://www.baeldung.com/cs/impurity-entropy-gini-index#2-example-splitting-by-information-gain
steps:
- determine the attribute that offers the highest Information Gain