(In this post I'll assume that you've done a bit of Sklearn so these things will be somewhat familiar to you.  You don't need to be an expert, you only need to be somewhat familiar with the interface.)

The most annoying thing about feature engineering and modeling during EDA is that everything is all over the place — sometimes you'll forget to do a transform on your test set and it gives bizarre answers and you spend a half-day trying to figure it out.  Or, you know, maybe you're better than me at that stuff.

Either way, Pipelines make your code more readable and make things easier on you.  Pipelines make sense: you define all your transforms, define the pipeline, set the parameters, and then run the code.  Nothing extra.  It Just Works.

Here's an example.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Feature Transforms: just define them without params for now.
imputer = SimpleImputer()
scaler = MinMaxScaler()
clf = LogisticRegression()

# Make the pipeline.  First argument is of the form ['name', function],
# where the function has the .fit and .transform/.predict methods.
# The classifier (or whatever has .predict) must be last in the pipeline.
# Make the popeline.  It's a list of ['name', function]'s.

pl = Pipeline([
["imputer", imputer],
["scaler", scaler],
["classifier", clf]
])

# Here we set parameters for the things in our pipeline.
# Each argument is of the form TransformName__argument=value.
# For example, above we have imputer, so to change the strategy arg to median we would write:
# imputer__strategy="median".

pl.set_params(imputer__strategy="mean", scaler__feature_range=(0, 2))
pl.fit(X_train, y_train)

print("f1: ", f1_score(pl.predict(X_test), y_test))


Awesome!  If you don't count all the wild commenting, this is a concise way to feature engineer.  And if we decide to change our scaler or params, everything is nice and centralized.

Custom Functions?

Custom functions: a big reason (I think) why people avoid pipelines.  "I don't know how I would put this part in!" is the usual refrain.

But it's pretty easy now-a-days thanks to our friend FunctionTransformer from the sklearn.preprocessing package.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, FunctionTransformer

def custom_function(x):
""" Takes in a value, spits out a value."""
return x**3 + 1

X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

imputer = SimpleImputer()
custom_function = FunctionTransformer(custom_function)
abs_value = FunctionTransformer(abs)
scaler = MinMaxScaler()
clf = LogisticRegression()

pl = Pipeline([
["imputer", imputer],
["scaler", scaler],
["custom_thing", custom_function],
["abs_value", abs_value],
["classifier", clf]
])

pl.set_params(imputer__strategy="mean", scaler__feature_range=(0, 2))
pl.fit(X_train, y_train)

print("f1: ", f1_score(pl.predict(X_test), y_test))


That's it.  We've used our own custom function as well as a built-in Python function.  Most reasonable numpy and scipy functions will work easily as well with minimal modification.  It's also possible to use some unorthodox functions by cramming in parameters with partial.

The hardest part about using Pipelines is getting motivated to stop putting all of your functions in fifteen different unordered Jupyter Notebook cells and start using Pipelines!