In this post we're going to do a cool thing with a small amount of code: we're going to make a Random Forest Classifier in Python using Sklearn.

We're going to do this in a few steps:

  1. Load the data and column names.
  2. Create the classifier and fit it with training data.
  3. Score the classifier.

Let's get to it.


Loading Things

We'll use the Iris data set because it's a common toy data set.  We're going to make a slight modification to make it a binary classification problem.  Let's import the things we need and then load the data.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, recall_score, precision_score

# Reading the page, we find that these are the appropriate columns.
columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "target"]

# For whatever reason, the "names" param in read_csv is the names of the columns.
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names=columns)

# Since data's last column has three possible values and we want a binary
# classification, let's do "is_iris_setosa" as a classification.  So,...
data["target"] = data["target"].apply(lambda s: s == "Iris-setosa")

Great.  You can try printing out data now if you want and see what it looks like.  Notice the last line here re-creates the target column and applies a function that asks, "Is this value Iris-setosa or not?"  We will now have 0s (not setosa) and 1s (setosa) now.

Note also that sklearn does have a built-in way to call the iris dataset, but I also wanted to show it's possible to use pandas to read CSVs from URLs.  This comes in handy quite a bit.


Making the Fitted Classifier

This part requires us to split up our data a bit (or use an associated CV method, which we will not do in this post).  Luckily, Sklearn makes this pretty easy.

X = data.drop("target", axis=1)
y = data["target"]

# Creates a train/test split.  The random state is for reproducability.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1234)

# Making our model and then fitting it.
clf = RandomForestClassifier().fit(X_train, y_train)

Notice here that we need to create an instance of RandomForestClassifier, which takes in relevant parameters for the model, and then we call the fit method on it.  One common error I've seen is people trying to do RandomForestClassifier(X_train, y_train) and this will not work.

Okay, we've got a trained classifier now.  Neat.  Let's see how it did.


Score the Classifier

We're going to use some standard metrics here and display them in a nice way using a pandas dataframe.

# Predict what the model will give us...
y_predicted = clf.predict(X_test)

metrics = [f1_score, precision_score, recall_score]
scores = [(metric.__name__, metric(y_test, y_predicted)) for metric in metrics]
scores_df = pd.DataFrame(scores, columns=["metric", "value"])
print(scores_df)

One thing here that might look strange: we're taking a list of functions (f1_score, etc.) and then we're using that in the list comprehension.  The __name__ gives us the name of the metric (e.g., "f1_score") as a string and then we call the metric on the test and predicted y sets.  That gives us a list of tuples of scores.  We plug that in to the dataframe which gives us a nicely formed table.  Mine looks like this in terminal:

            metric  value
0         f1_score    1.0
1  precision_score    1.0
2     recall_score    1.0

But this brings up another point.  We did really well — maybe too well.  The iris data set splits pretty nicely, so if we want to see the model do a bit worse we can go back and use a larger test set.  Try setting it to 95% and see what happens.  Why does this happen?


The Whole Thing In One Place

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, recall_score, precision_score

columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "target"]
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names=columns)
data["target"] = data["target"].apply(lambda s: s == "Iris-setosa")

X = data.drop("target", axis=1)
y = data["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1234)

clf = RandomForestClassifier().fit(X_train, y_train)

y_predicted = clf.predict(X_test)

metrics = [f1_score, precision_score, recall_score]
scores = [(metric.__name__, metric(y_test, y_predicted)) for metric in metrics]
scores_df = pd.DataFrame(scores, columns=["metric", "value"])
print(scores_df)

Pretty tiny, right?