How can I match up cluster labels to my 'ground truth' labels in Matlab

I have searched here and googled, but to no avail. When clustering in Weka there is a handy option, classes to clusters, which matches up the clusters produced by the algorithm e.g. simple k-means, to the 'ground truth' class labels you supply as the class attribute. So that we can see cluster accuracy (% incorrect).

Now, how can I achieve this in Matlab, i.e. translate my clusterClasses vector e.g. [1, 1, 2, 1, 3, 2, 3, 1, 1, 1] into the same index as the supplied ground truth labels vector e.g. [2, 2, 2, 3, 1, 3]?

I think it is probably based on cluster centres and label centres, but I'm not sure how to implement!

Any help would be greatly appreciated.

Vincent


Solution 1:

I stumbled on a similar problem a couple of months ago while doing clustering. I did not search for built in solutions very long (although I am sure they must exist) and ended up writing my own little script for matching my found labels the best with the ground truth. The code is very crude, but it should get you started.

It is based on trying all possible rearrangements of the labels to see witch best fit the truth vector. That means that given a clustering result yte = [3 3 2 1] with ground truth y = [1 1 2 3], the script will try to match [3 3 2 1], [3 3 1 2], [2 2 3 1], [2 2 1 3], [1 1 2 3] and [1 1 3 2] with y to find the best match.

This is based on using the built in script perms() witch can not handle more than 10 unique clusters. The code can also tend to be slow for 7-10 unique clusters, as the complexity grows as a factorial.

function [accuracy, true_labels, CM] = calculateAccuracy(yte, y)
%# Function for calculating clustering accuray and matching found 
%# labels with true labels. Assumes yte and y both are Nx1 vectors with
%# clustering labels. Does not support fuzzy clustering.
%#
%# Algorithm is based on trying out all reorderings of cluster labels, 
%# e.g. if yte = [1 2 2], try [1 2 2] and [2 1 1] so see witch fit 
%# the truth vector the best. Since this approach makes use of perms(),
%# the code will not run for unique(yte) greater than 10, and it will slow
%# down significantly for number of clusters greater than 7.
%#
%# Input:
%#   yte - result from clustering (y-test)
%#   y   - truth vector
%#
%# Output:
%#   accuracy    -   Overall accuracy for entire clustering (OA). For
%#                   overall error, use OE = 1 - OA.
%#   true_labels -   Vector giving the label rearangement witch best 
%#                   match the truth vector (y).
%#   CM          -   Confusion matrix. If unique(yte) = 4, produce a
%#                   4x4 matrix of the number of different errors and  
%#                   correct clusterings done.

N = length(y);

cluster_names = unique(yte);
accuracy = 0;
maxInd = 1;

perm = perms(unique(y));
[pN pM] = size(perm);

true_labels = y;

for i=1:pN
    flipped_labels = zeros(1,N);
    for cl = 1 : pM
        flipped_labels(yte==cluster_names(cl)) = perm(i,cl);
    end

    testAcc = sum(flipped_labels == y')/N;
    if testAcc > accuracy
        accuracy = testAcc;
        maxInd = i;
        true_labels = flipped_labels;
    end

end

CM = zeros(pM,pM);
for rc = 1 : pM
    for cc = 1 : pM
        CM(rc,cc) = sum( ((y'==rc) .* (true_labels==cc)) );
    end
end

Example:

[acc newLabels CM] = calculateAccuracy([3 2 2 1 2 3]',[1 2 2 3 3 3]')

acc =

0.6667


newLabels =

 1     2     2     3     2     1


CM =

 1     0     0
 0     2     0
 1     1     1

Solution 2:

You might want to look into more flexible way of evaluating clusters. For example pair counting metrics.

The "class = cluster" assumption is typical for people coming from machine learning into clustering. But instead you should assume that some classes may consist of multiple clusters, or that multiple classes actually cluster. These are the interesting situations a clustering algorithm should actually detect.