K-means algorithm variation with equal cluster size

Solution 1:

This might do the trick: apply Lloyd's algorithm to get k centroids. Sort the centroids by descending size of their associated clusters in an array. For i = 1 through k-1, push the data points in cluster i with minimal distance to any other centroid j (i < jk) off to j and recompute the centroid i (but don't recompute the cluster) until the cluster size is n / k.

The complexity of this postprocessing step is O(k² n lg n).

Solution 2:

The ELKI data mining framework has a tutorial on equal-size k-means.

This is not a particulary good algorithm, but it's an easy enough k-means variation to write a tutorial for and teach people how to implement their own clustering algorithm variation; and apparently some people really need their clusters to have the same size, although the SSQ quality will be worse than with regular k-means.

In ELKI 0.7.5, you can select this algorithm as tutorial.clustering.SameSizeKMeansAlgorithm.

Solution 3:

You can view the distances as defining a weighted graph. Quite a few graph partitioning algorithms are explicitly based on trying to partition the graph vertices into two sets of equal size. This appears in, for example, the Kernighan-Lin algorithm and in spectral graph partitioning using the Laplacian. To get multiple clusters, you can recursively apply the partitioning algorithm; there's a nice discussion of this at the link on spectral graph partitioning.

Solution 4:

Try this k-means variation:

Initialization:

  • choose k centers from the dataset at random, or even better using kmeans++ strategy
  • for each point, compute the distance to its nearest cluster center, and build a heap for this
  • draw points from the heap, and assign them to the nearest cluster, unless the cluster is already overfull. If so, compute the next nearest cluster center and reinsert into the heap

In the end, you should have a paritioning that satisfies your requirements of the +-1 same number of objects per cluster (make sure the last few clusters also have the right number. The first m clusters should have ceil objects, the remainder exactly floor objects.)

Iteration step:

Requisites: a list for each cluster with "swap proposals" (objects that would prefer to be in a different cluster).

E step: compute the updated cluster centers as in regular k-means

M step: Iterating through all points (either just one, or all in one batch)

Compute nearest cluster center to object / all cluster centers that are closer than the current clusters. If it is a different cluster:

  • If the other cluster is smaller than the current cluster, just move it to the new cluster
  • If there is a swap proposal from the other cluster (or any cluster with a lower distance), swap the two element cluster assignments (if there is more than one offer, choose the one with the largest improvement)
  • otherwise, indicate a swap proposal for the other cluster

The cluster sizes remain invariant (+- the ceil/floor difference), an objects are only moved from one cluster to another as long as it results in an improvement of the estimation. It should therefore converge at some point like k-means. It might be a bit slower (i.e. more iterations) though.

I do not know if this has been published or implemented before. It's just what I would try (if I would try k-means. there are much better clustering algorithms.)

Solution 5:

Just in case anyone wants to copy and paste a short function here you go - basically running KMeans then finding the minimal matching of points to clusters under the constraint of maximal points assigned to cluster (cluster size)

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
import numpy as np


def get_even_clusters(X, cluster_size):
    n_clusters = int(np.ceil(len(X)/cluster_size))
    kmeans = KMeans(n_clusters)
    kmeans.fit(X)
    centers = kmeans.cluster_centers_
    centers = centers.reshape(-1, 1, X.shape[-1]).repeat(cluster_size, 1).reshape(-1, X.shape[-1])
    distance_matrix = cdist(X, centers)
    clusters = linear_sum_assignment(distance_matrix)[1]//cluster_size
    return clusters