Pretend you have a data set and one column, mode looks like this:

mode
-----
high
high
high
low
off
off
off
low


Maybe this is the settings for a machine, maybe it's the settings for a fan, who knows.  One thing that I see quite a bit in newly minted data scientists is leaping to use one-hot encode for categorical variables that don't have too many unique values.

Sometimes this is okay.  But in this case you're losing something: you're losing the fact that, in some sense, off < low < medium < high.  In other words, these things are ordered, and so we call these ordinal categorical variables.

What can we do?  I've seen some pretty wild transformations and functions which encode these, but, ultimately, you need the following: the list of variables and their approximate relative weight with respect to each other — that is, maybe off is 0, maybe low is 1, but maybe high is so much more wild that we want to count it a bit more and we can distance it from low by making its value 5.  That way instead of

off
low
high


we have

0
1
5


Practically, the values you choose will vary — some people just use 0, 1, 2, 3..., and that might be fine.  Either way, we now have a categorical-to-numeric mapping!

My method of mapping this in Python is the following:

# Suppose df is your dataframe with the mode column.

mode_mapping_dict = {
'off': 0,
'low': 1,
'high': 5
}

df['mode_numeric'] = df['mode'].apply(lambda x: mode_mapping_dict[x])


What you get back is pretty much what you'd expect:

 mode  | mode_numeric
------+-------------
'high' |  5
'high' |  5
'high' |  5
'low'  |  1
'off'  |  0
'off'  |  0
'off'  |  0
'low'  |  1


At this point, you can drop and forget about the mode column and use the mode_metric column (which has preserved the order!) in your models.