What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?
I've read from the relevant documentation that :
Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (
sample_weight
) for each class to the same value.
But, it is still unclear to me how this works. If I set sample_weight
with an array of only two possible values, 1
's and 2
's, does this mean that the samples with 2
's will get sampled twice as often as the samples with 1
's when doing the bagging? I cannot think of a practical example for this.
Some quick preliminaries:
Let's say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the "impurity" of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate:
Pr(Class=k) = #(examples of class k in region) / #(total examples in region)
The impurity measure takes as input, the array of class probabilities:
[Pr(Class=1), Pr(Class=2), ..., Pr(Class=K)]
and spits out a number, which tells you how "impure" or how inhomogeneous-by-class the region of feature space is. For example, the gini measure for a two class problem is 2*p*(1-p)
, where p = Pr(Class=1)
and 1-p=Pr(Class=2)
.
Now, basically the short answer to your question is:
sample_weight
augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.
I believe this is best illustrated through example.
First consider the following 2-class problem where the inputs are 1 dimensional:
from sklearn.tree import DecisionTreeClassifier as DTC
X = [[0],[1],[2]] # 3 simple training examples
Y = [ 1, 2, 1 ] # class labels
dtc = DTC(max_depth=1)
So, we'll look trees with just a root node and two children. Note that the default impurity measure the gini measure.
Case 1: no sample_weight
dtc.fit(X,Y)
print dtc.tree_.threshold
# [0.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0, 0.5]
The first value in the threshold
array tells us that the 1st training example is sent to the left child node, and the 2nd and 3rd training examples are sent to the right child node. The last two values in threshold
are placeholders and are to be ignored. The impurity
array tells us the computed impurity values in the parent, left, and right nodes respectively.
In the parent node, p = Pr(Class=1) = 2. / 3.
, so that gini = 2*(2.0/3.0)*(1.0/3.0) = 0.444....
. You can confirm the child node impurities as well.
Case 2: with sample_weight
Now, let's try:
dtc.fit(X,Y,sample_weight=[1,2,3])
print dtc.tree_.threshold
# [1.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0.44444444, 0.]
You can see the feature threshold is different. sample_weight
also affects the impurity measure in each node. Specifically, in the probability estimates, the first training example is counted the same, the second is counted double, and the third is counted triple, due to the sample weights we've provided.
The impurity in the parent node region is the same. This is just a coincidence. We can compute it directly:
p = Pr(Class=1) = (1+3) / (1+2+3) = 2.0/3.0
The gini measure of 4/9
follows.
Now, you can see from the chosen threshold that the first and second training examples are sent to the left child node, while the third is sent to the right. We see that impurity is calculated to be 4/9
also in the left child node because:
p = Pr(Class=1) = 1 / (1+2) = 1/3.
The impurity of zero in the right child is due to only one training example lying in that region.
You can extend this with non-integer sample-wights similarly. I recommend trying something like sample_weight = [1,2,2.5]
, and confirming the computed impurities.