Back to Tutorials.
Tutorial 4: Splitting Algorithms
Before automatically creating a decision tree, you can choose from several splitting functions that are used to determine which attribute to split on. The following splitting functions are available:
- Random - The attribute to split on is chosen randomly.
- Information Gain - The attribute to split on is the one that has the maximum information gain. To calculate the information gain for an attribute, you first compute the information content. For the attribute "Thread = new" in the mail reading example, the examples will be partitioned into a set of 3 where the user action is "skips" and 7 where the user action is "reads." The information content about the user action is then calculated as -0.3 * log0.3 - 0.7 * log 0.7 = 0.881 (using log base 2). With "Thread = old", the information content is calculated in the same way, and the result is 0.811. The expected information gain is thus 1.0 - (10/18)*0.881 + (8/18)*0.811 = 0.150. (Note that there are a total of 18 examples, 10 of which have thread value new and 8 have thread value old)
- Gain Ratio - Selects the attribute with the highest information gain to number of input values ratio. The number of input values is the number of distinct values of an attribute occurring in the training set.
- GINI - The attribute with the highest GINI index is chosen. The GINI index is a measure of impurity of the examples.
|