Introduction to machine-learning pipelines with skrub DataOps#

In this example, we show how we can use Skrub’s DataOps to build a machine learning pipeline that records all the operations involved in pre-processing data and training a model. We will also show how to save the model, load it back, and then use it to make predictions on new, unseen data.

This example is meant to be an introduction to Skrub DataOps, and as such it will not cover all the features: further examples in the gallery Skrub DataOps will go into more detail on how to use Skrub DataOps for more complex tasks.

The data#

We begin by loading the employee salaries dataset, which is a regression dataset that contains information about employees and their current annual salaries. By default, the datasets.fetch_employee_salaries() function returns the training set. We will load the test set later, to evaluate our model on unseen data.

from skrub.datasets import fetch_employee_salaries

training_data = fetch_employee_salaries(split="train").employee_salaries

We can take a look at the dataset using the TableReport. This dataset contains numerical, categorical, and datetime features. The column current_annual_salary is the target variable we want to predict.

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



Assembling our DataOps plan#

Our goal is to predict the current_annual_salary of employees based on their other features. We will use skrub’s DataOps to combine both skrub and scikit-learn objects into a single DataOps plan, which will allow us to preprocess the data, train a model, and tune hyperparameters.

We begin by defining a skrub var(), which is the entry point for our DataOps plan.

Next, we define the initial features X and the target variable y. We use the DataOp.skb.mark_as_X() and DataOp.skb.mark_as_y() methods to mark these variables in the DataOps plan. This allows skrub to properly split these objects into training and validation steps when executing cross-validation or hyperparameter tuning.

X = data_var.drop("current_annual_salary", axis=1).skb.mark_as_X()
y = data_var["current_annual_salary"].skb.mark_as_y()

Our first step is to vectorize the features in X. We will use the TableVectorizer to convert the categorical and numerical features into a numerical format that can be used by machine learning algorithms. We apply the vectorizer to X using the .skb.apply() method, which allows us to apply any scikit-learn compatible transformer to the skrub variable.

from skrub import TableVectorizer

vectorizer = TableVectorizer()

X_vec = X.skb.apply(vectorizer)
X_vec
<Apply TableVectorizer>
Show graph Var 'data' X: CallMethod 'drop' Apply TableVectorizer

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



By clicking on Show graph, we can see the DataOps plan that has been created: the plan shows the steps that have been applied to the data so far. Now that we have the vectorized features, we can proceed to train a model. We use a scikit-learn HistGradientBoostingRegressor to predict the target variable. We apply the model to the vectorized features using .skb.apply, and pass y as the target variable. Note that the resulting predictor will show the prediction results on the preview subsample, but the actual model has not been fitted yet.

<Apply HistGradientBoostingRegressor>
Show graph Var 'data' X: CallMethod 'drop' y: GetItem 'current_annual_salary' Apply TableVectorizer Apply HistGradientBoostingRegressor

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



Now that we have built our entire plan, we can have explore it in more detail with the .skb.eval() method:

predictions.skb.full_report()

This produces a folder on disk rather than displaying inline in a notebook so we do not run it here. But you can see the output here.

This method evaluates each step in the plan and shows detailed information about the operations that are being performed.

Turning the DataOps plan into a learner, for later reuse#

Now that we have defined the predictor, we can create a learner, a standalone object that contains all the steps in the DataOps plan. We fit the learner, so that it can be used to make predictions on new data.

trained_learner = predictor.skb.make_learner(fitted=True)

A big advantage of the learner is that it can be pickled and saved to disk, allowing us to reuse the trained model later without needing to retrain it. The learner contains all steps in the DataOps plan, including the fitted vectorizer and the trained model. We can save it using Python’s pickle module: here we use pickle.dumps to serialize the learner object into a byte string.

We can now load the saved model back into memory using pickle.loads.

Now, we can make predictions on new data using the loaded model, by passing a dictionary with the skrub variable names as keys. We don’t have to create a new variable, as this will be done internally by the learner. In fact, the learner is similar to a scikit-learn estimator, but rather than taking X and y as inputs, it takes a dictionary (the “environment”), where each key is the name of one of the skrub variables in the plan.

We can now get the test set of the employee salaries dataset:

unseen_data = fetch_employee_salaries(split="test").employee_salaries

Then, we can use the loaded model to make predictions on the unseen data by passing the environment as dictionary.

array([116382.06417108,  45114.33938599,  46680.82086958, ...,
       105486.55018287, 146020.37131876,  73028.94144409], shape=(1228,))

We can also evaluate the model’s performance using the score method, which uses the scikit-learn scoring function used by the predictor:

loaded_model.score({"data": unseen_data})
0.9407037991754476

Conclusion#

In this example, we have briefly introduced the skrub DataOps, and how they can be used to build powerful machine learning pipelines. We have seen how to preprocess data, train a model. We have also shown how to save and load the trained model, and how to make predictions on new data using the trained model.

However, skrub DataOps are significantly more powerful than what we have shown here: for more advanced examples, see Skrub DataOps.

Total running time of the script: (0 minutes 5.708 seconds)

Gallery generated by Sphinx-Gallery