In Pythonin' I'm gonna have some fun and code up some neat Python stuff.  All of this is able to be copy-pasted into a Jupyter Notebook and run!


Introduction and The Seven Sons

If you don't know what K-Means is, check out my previous post on it.

For Voronoi, the way I learned it was through the following story, which I'll share with you now:

Voronoi Diagram with Many Points

There once was a rich man with seven sons.  The seven sons hated each other and could never stand to be around one-another lest things become violent!  When the rich man died he had in his will that a carpenter go around his land (which was a large number of acres) and build a home for each of his seven sons.  

The carpenter, who wasn't a cartographer, didn't know how to map things out and so the homes were built in essentially random places on the old man's estate.  This didn't bother the seven sons so much at first but each son began to try to expand how much land they owned around the house and began building fences around what they thought was rightfully theirs.

Once a fence was built (and this is the important part) if it was closer to ANY OTHER SON than that son, it would be torn down.  After a while, the brothers constructed fences where all of the land inside the fence was closest to that brother and no other brother.  


And they lived in harmony and peace after that, or whatever.  The point of the story is — well, okay, it's mostly a story about Voronoi diagrams, so the gist of what a Voronoi diagram is, is a set of "fences" (lines) that make polygons around points such that each point inside that polygon is closest to that one original point and none of the other original points.  

We're going to do some Python that shows us the K-Means algorithm, which allows us to use Voronoi diagrams in a neat way.


The Python

The usual imports.  Some of the weirder things here are the plotting libraries (which, as usual, took 99% of the time to figure out for this script).  The palette I'm using here is a list of hex values corresponding to a nice set of colors that Seaborn provides.

from collections import defaultdict

import numpy as np
from scipy.spatial import Voronoi, voronoi_plot_2d
from scipy.spatial.distance import euclidean
from sklearn.datasets import make_blobs
from sklearn.preprocessing import minmax_scale

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="white")
colorblind_palette = sns.color_palette('colorblind').as_hex()

%matplotlib inline

Next is the bulk of the work.  We make three functions.  First, we need to classify our points (we color them according to whatever their closest point is).  Next, we need something to find a new center for these classified points (this is part of the K-Means algorithm).  Last, we do a plotting thing, which is always a nightmare but in this case wasn't too bad.

def classify_points(centers, data):
    """Classifies points from data according to which center they're closest to.
       Returns points like {'g': np.array([..., ...], ...), }
    """
    classes = defaultdict(list)
    for pt in data:
        dists = [euclidean(pt, center) for center in centers]
        classes[colorblind_palette[dists.index(min(dists))]].append(pt)

    # Concat the list of points in each class together to a np array.
    classes = {cl: np.array(classes[cl]) for cl in classes.keys()}
    return classes


def find_new_centers(data_classified):
    """Calculates new centers given the classified data."""
    new_means = []
    for pts in data_classified.values():
        new_means.append(pts.mean(axis=0))
    return np.array(new_means)


def plot_voronoi(ax, title, centers, data_classified):
    """Plot the Voronoi diagram with our data classified."""
    ax.set_title(title)
    vor = Voronoi(centers)
    _ = voronoi_plot_2d(vor, ax=ax, show_points=False, show_vertices=False)
    for center in data_classified.keys():
        ax.scatter(
            data_classified[center][:, 0], data_classified[center][:, 1], c=center, s=10
        )
    ax.scatter(centers[:, 0], centers[:, 1], c="black", s=75, marker="*")

After this, we can initialize our data and make arbitrary centers.

# Make the data and the initial arbitrary centers.
n_centers = 4
data_raw, _ = make_blobs(
    n_samples=60, n_features=2, centers=n_centers, random_state=12345
)
data = minmax_scale(data_raw, axis=0)
centers_0 = np.random.uniform(size=(n_centers, 2))

And last, we use the functions to go through the algorithm to see the evolution of the centers.

fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5), sharex=True, sharey=True)

# Iteration 0 (Initial Centers)
data_classified_0 = classify_points(centers=centers_0, data=data)
plot_voronoi(ax[0], "Iteration 0", centers_0, data_classified_0)

# Iteration 1
centers_1 = find_new_centers(data_classified_0)
data_classified_1 = classify_points(centers=centers_1, data=data)
plot_voronoi(ax[1], "Iteration 1", centers_1, data_classified_1)

# Iteration 2
centers_2 = find_new_centers(data_classified_1)
data_classified_2 = classify_points(centers=centers_2, data=data)
plot_voronoi(ax[2], "Iteration 2", centers_2, data_classified_2)

If you do this, you'll get the image at the title of this blog post.  Look at how quickly it converges to a "nice" series of points (though, the colors aren't necessarily consistent from plot to plot...) and creates four distinct clusters.

This one is pretty fun to play around with so try out a bunch of different parameters and see what neat pictures you can get!

As a bonus, we used make_blobs here from from the sklearn package — but what would happen if we made some other kind of toy dataset?  How does K-Means work with rings...?  Hm...