Is there a way to enforce that a set of points are assigned to the same class when clustering in sklearn or other clustering library?

I would like to use one of sklearn's clustering algorithms but with the restriction that certain sets of points must belong to the same class. For instance, given the set of points below I would like to enforce that all red points belong to the same class and all blue points belong to the same class. I would also like it so that red and blue points can belong to the same class. If this is not possible in sklearn I am also open to using other libraries.

Clustering with some points prespecified


Solution 1:

The name for this is "constrained clustering," which is a family of semi-supervised clustering approaches in which a user can also supply constraints as:

  1. Must Link - two nodes must belong to the same cluster
  2. Cannot Link - two nodes cannot belong to the same cluster

There's an implementation of the COP-KMeans algorithm, which provides an API like this:

import numpy
from copkmeans.cop_kmeans import cop_kmeans
input_matrix = numpy.random.rand(100, 500)
must_link = [(0, 10), (0, 20), (0, 30)]
cannot_link = [(1, 10), (2, 10), (3, 10)]
clusters, centers = cop_kmeans(dataset=input_matrix, k=5, ml=must_link,cl=cannot_link)

Solution 2:

One possible solution which should work for any library is to define a "superpoint" for the blue cluster and another for the red cluster.

So just define the blue superpoint to be the average / median of each blue point and similarly for the red. Then run the clustering on these two superpoints plus the remaining points