How does K-Means work when initial centroid locations far away from the data are selected?

From my understanding of K-means clustering, k is chosen, centroid locations are selected, samples are assigned, and then the centroid moves to the mean of the samples, until there is no more movement.

enter image description here

So in this example, if 3 centroids are chosen with locations of (5,1), (5,10), and (5,20), what happens? I would expect all the samples to be assigned to the (5,1) centroid which would then move to the mean of the data (5,0) and the algorithm would terminate with all of the samples belonging to 1 cluster (the other centroids not moving, and with 1 iteration).

However, when I implement this in Python, using KMeans, it doesn't behave like this:

In [3]: xa = [1,2,3,4,5,6,7,8,9,10]
   ...: xb = [0,0,0,0,0,0,0,0,0,0]
   ...: 
   ...: data = pd.DataFrame({"Xa": xa, "Xb":xb}, columns=["Xa", "Xb"])
   ...: 
   ...: initial = np.array([[5,1], [5,10], [5,20]])
   ...: 
   ...: km = KMeans(n_clusters=3,init=initial, n_init=1, max_iter=1).fit(data)
   ...: print(km.cluster_centers_)
[[ 4.5  0. ]
 [10.   0. ]
 [ 9.   0. ]]

Letting the algorithm run with no limit on the number of iterations results in the following, which also doesn't make sense to me:

In [4]: km = KMeans(n_clusters=3,init=initial, n_init=1).fit(data)
   ...: print(km.cluster_centers_)
[[3.  0. ]
 [9.5 0. ]
 [7.  0. ]]

Could anyone explain my misunderstanding here?

The overall goal of k-means is to subdivide data into k clusters in such way, that the sum of distances between data points and centroids of their assigned clusters is the smallest possible. There is no guarantee that the iterative k-means algorithm will produce clusters minimizing this sum, but its implementations should reject clearly bad solutions. Returning some empty clusters is obviously not optimal, since in such a situation we can do better by subdividing a non-empty cluster into a few smaller ones.

For this reason, implementations of the k-means algorithm use various methods to avoid empty clusters. The implementation used by sklearn checks at each iteration step if empty clusters were produced. If this is the case, then the code selects data points that are furthest away from all current centers, and relocates the centers corresponding to the empty clusters to these furthest points. The functions in the sklearn source code that perform this step are _relocate_empty_clusters_dense and _relocate_empty_clusters_sparse.

How does K-Means work when initial centroid locations far away from the data are selected?

Related

Recent Posts