Here's a quick post that talks about the process behind K-Means Clustering.  There's a few variations of it, but I'll give you the gist so you understand what's happening under-the-hood whenever you fit one!

Okay, what is it?

K-means Clustering is one of the most popular clustering methods out there — perhaps this is because a large number of practical problems naturally work well with it, or because it's easy to explain and implement the algorithm, or because checking out the results and explaining why something is where it is tends to be easier than some other clustering methods.

Let's chat about what goes on in the (naive version) algorithm.  It might look a bit wild, but it's pretty easy once you understand the "iteration" steps 3—5.  See the figure below in the "Voronoi" section after reading this for an example.

  1. Given some value $k$ that the user picks beforehand, and observations $\{x_{1}, \dots, x_{n}\}$ where each $x_{i}$ is some row of features in your data set, we do the following.
  2. Pick $k$ means: $\{m_{1}, \dots, m_{k}\}$.  It doesn't matter where they are, but usually uniformly distributed works well.
  3. Given an observation $x_{i}$ find which mean point above the observation is (Euclidean distance) closest to.  Let's say it's closest to $m_{j}$; then we say that $x_{i}$ is currently in class $k$.  
  4. Once we're done placing each $x_{i}$ into its respective class, for each class take all of the observations that belong to that class and calculate their mean: this becomes the new mean for this class.  Assign this to be the new $m_{j}$ for class $j$, for each class $j$.  
  5. Keep doing steps 3 and 4 (assign, update) until the means more-or-less stabilize.

That's it.  You now have your means and the classes for your observations.  Awesome.


The geometrically-inclined reader will notice that what we've done is create a Voronoi diagram from the mean points, then created new mean points from the observations sitting in each Voronoi cell and updated the diagram.  Once the means stabilize, you have a Voronoi diagram where each of the Voronoi cells contains one class of points.  This is a great way to visualize the process.

Doing the process above with Voronoi Cells included

Wait, it's that easy?  What's the catch?

Yeah, okay, so.  There's a few things.

One.  The mean isn't a great measure if you have outliers, and this isn't a great clustering algorithm if your clusters aren't essentially Gaussian-looking.  For example, if you have a blob with a ring around it (think of the planet Saturn) it's going to fail miserably.  It wasn't made for that sort of data!

Two.  You might notice that, above, we talked about picking $k$ before doing anything — that gives the number of clusters we have. How do we know what $k$ is before doing the clustering?  We don't.  We have to guess and see what works best.  Unfortunately, this means that we need to look at things like inter- and intra-cluster variance and lots of other things to see if our results are reasonable.  If this is 2D or 3D we can plot it and easily see if this is what we want; otherwise, we need to rely on various metrics to help out.  One way people typically do this is the elbow method.

Three.  Because it's a fairly simplistic method it will not usually work on anything but fairly simple "blobs" of data.  It may be required to preprocess data beforehand to get it in a form that is "nice" for clustering.