The “balanced-ness” of classification trees
Published 10/4/2017
A classification tree is a logical tree (in the computer science sense) that encodes an order of instructions used to sort a set of data into n classes. It is a rooted binary (directed-edge) tree with n labeled leaves, one leaf for each of the n classes into which the data are sorted. There are (2n − 3)!! possible labeled classification trees, up to isomorphism, each tree representing a different set of encoded instructions. (The double factorial n!! is recursively defined as n!! ≡ n(n − 2)!! with 0!! = 1!! = 1.) Suppose a supplied data set X to be sorted into n classes has the data evenly divided among the n classes. Attached with X is a feature set F assumed to be capable of producing a correct classification. A classification algorithm will attempt to maximally divide X as evenly as possible using one of the features. Then, repeating this process, each of the two separated halves of X will be evenly divided, on and on, so that the tree representing this process will be very balanced. By balanced is meant that the n unique paths from the root of the tree to each of the n leaves will all have the same (or very close to the same) number of edges in them. On the other hand, 1 if the data set X is very uneven, for example half of the data are in the first class, half of the rest in the second, and so on, then it is likely that the classification algorithm will produce a very unbalanced tree with the first split going entirely into the first leaf, while the second branch of the tree splits off half of its data into the second leaf, and so on. Thus the “balancedness” of a classification tree should be highly correlated with the unevenness of the data distribution among the classes. Suppose we choose to measure this balance–imbalance of a tree using the variance of its set of n path lengths (in number of edges) from the root to each of its n leaves. Among the (2n − 3)!! such trees, each with such a variance, there will be a maximum variance. Therefore we can normalize this set of variances by dividing by its maximum, creating an independent variable over the set [0, 1], and then produce a histogram of the number of classification trees occurring at each generated value of normalized variance. What would the resulting probability distribution look like for various values of n? That is the question we will try to answer here.