.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/data_ops/10_data_ops_intro.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_data_ops_10_data_ops_intro.py: Introduction to machine-learning pipelines with skrub DataOps ============================================================== In this example, we show how we can use Skrub's :ref:`DataOps ` to build a machine learning pipeline that records all the operations involved in pre-processing data and training a model. We will also show how to save the model, load it back, and then use it to make predictions on new, unseen data. This example is meant to be an introduction to Skrub DataOps, and as such it will not cover all the features: further examples in the gallery :ref:`data_ops_examples_ref` will go into more detail on how to use Skrub DataOps for more complex tasks. .. currentmodule:: skrub .. |fetch_employee_salaries| replace:: :func:`datasets.fetch_employee_salaries` .. |TableReport| replace:: :class:`TableReport` .. |var| replace:: :func:`var` .. |skb.mark_as_X| replace:: :meth:`DataOp.skb.mark_as_X` .. |skb.mark_as_y| replace:: :meth:`DataOp.skb.mark_as_y` .. |TableVectorizer| replace:: :class:`TableVectorizer` .. |skb.apply| replace:: :meth:`.skb.apply() ` .. |HistGradientBoostingRegressor| replace:: :class:`~sklearn.ensemble.HistGradientBoostingRegressor` .. |.skb.full_report()| replace:: :meth:`.skb.eval() ` .. |choose_float| replace:: :func:`choose_float` .. |make_randomized_search| replace:: :meth:`.skb.make_randomized_search ` .. GENERATED FROM PYTHON SOURCE LINES 35-42 The data --------- We begin by loading the employee salaries dataset, which is a regression dataset that contains information about employees and their current annual salaries. By default, the |fetch_employee_salaries| function returns the training set. We will load the test set later, to evaluate our model on unseen data. .. GENERATED FROM PYTHON SOURCE LINES 42-47 .. code-block:: Python from skrub.datasets import fetch_employee_salaries training_data = fetch_employee_salaries(split="train").employee_salaries .. GENERATED FROM PYTHON SOURCE LINES 48-52 We can take a look at the dataset using the |TableReport|. This dataset contains numerical, categorical, and datetime features. The column ``current_annual_salary`` is the target variable we want to predict. .. GENERATED FROM PYTHON SOURCE LINES 52-56 .. code-block:: Python import skrub skrub.TableReport(training_data) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 57-66 Assembling our DataOps plan ---------------------------- Our goal is to predict the ``current_annual_salary`` of employees based on their other features. We will use skrub's DataOps to combine both skrub and scikit-learn objects into a single DataOps plan, which will allow us to preprocess the data, train a model, and tune hyperparameters. We begin by defining a skrub |var|, which is the entry point for our DataOps plan. .. GENERATED FROM PYTHON SOURCE LINES 66-69 .. code-block:: Python data_var = skrub.var("data", training_data) .. GENERATED FROM PYTHON SOURCE LINES 70-75 Next, we define the initial features ``X`` and the target variable ``y``. We use the |skb.mark_as_X| and |skb.mark_as_y| methods to mark these variables in the DataOps plan. This allows skrub to properly split these objects into training and validation steps when executing cross-validation or hyperparameter tuning. .. GENERATED FROM PYTHON SOURCE LINES 75-78 .. code-block:: Python X = data_var.drop("current_annual_salary", axis=1).skb.mark_as_X() y = data_var["current_annual_salary"].skb.mark_as_y() .. GENERATED FROM PYTHON SOURCE LINES 79-84 Our first step is to vectorize the features in ``X``. We will use the |TableVectorizer| to convert the categorical and numerical features into a numerical format that can be used by machine learning algorithms. We apply the vectorizer to ``X`` using the |skb.apply| method, which allows us to apply any scikit-learn compatible transformer to the skrub variable. .. GENERATED FROM PYTHON SOURCE LINES 84-91 .. code-block:: Python from skrub import TableVectorizer vectorizer = TableVectorizer() X_vec = X.skb.apply(vectorizer) X_vec .. raw:: html
<Apply TableVectorizer>
Show graph Var 'data' X: CallMethod 'drop' Apply TableVectorizer

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 92-100 By clicking on ``Show graph``, we can see the DataOps plan that has been created: the plan shows the steps that have been applied to the data so far. Now that we have the vectorized features, we can proceed to train a model. We use a scikit-learn |HistGradientBoostingRegressor| to predict the target variable. We apply the model to the vectorized features using ``.skb.apply``, and pass ``y`` as the target variable. Note that the resulting ``predictor`` will show the prediction results on the preview subsample, but the actual model has not been fitted yet. .. GENERATED FROM PYTHON SOURCE LINES 100-108 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingRegressor hgb = HistGradientBoostingRegressor() predictor = X_vec.skb.apply(hgb, y=y) predictor .. raw:: html
<Apply HistGradientBoostingRegressor>
Show graph Var 'data' X: CallMethod 'drop' y: GetItem 'current_annual_salary' Apply TableVectorizer Apply HistGradientBoostingRegressor

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 109-120 Now that we have built our entire plan, we can have explore it in more detail with the |.skb.full_report()| method:: predictions.skb.full_report() This produces a folder on disk rather than displaying inline in a notebook so we do not run it here. But you can `see the output here <../../_static/employee_salaries_report/index.html>`_. This method evaluates each step in the plan and shows detailed information about the operations that are being performed. .. GENERATED FROM PYTHON SOURCE LINES 122-128 Turning the DataOps plan into a learner, for later reuse --------------------------------------------------------- Now that we have defined the predictor, we can create a ``learner``, a standalone object that contains all the steps in the DataOps plan. We fit the learner, so that it can be used to make predictions on new data. .. GENERATED FROM PYTHON SOURCE LINES 128-131 .. code-block:: Python trained_learner = predictor.skb.make_learner(fitted=True) .. GENERATED FROM PYTHON SOURCE LINES 132-137 A big advantage of the learner is that it can be pickled and saved to disk, allowing us to reuse the trained model later without needing to retrain it. The learner contains all steps in the DataOps plan, including the fitted vectorizer and the trained model. We can save it using Python's ``pickle`` module: here we use ``pickle.dumps`` to serialize the learner object into a byte string. .. GENERATED FROM PYTHON SOURCE LINES 137-142 .. code-block:: Python import pickle saved_model = pickle.dumps(trained_learner) .. GENERATED FROM PYTHON SOURCE LINES 143-144 We can now load the saved model back into memory using `pickle.loads`. .. GENERATED FROM PYTHON SOURCE LINES 144-146 .. code-block:: Python loaded_model = pickle.loads(saved_model) .. GENERATED FROM PYTHON SOURCE LINES 147-156 Now, we can make predictions on new data using the loaded model, by passing a dictionary with the skrub variable names as keys. We don't have to create a new variable, as this will be done internally by the learner. In fact, the ``learner`` is similar to a scikit-learn estimator, but rather than taking ``X`` and ``y`` as inputs, it takes a dictionary (the "environment"), where each key is the name of one of the skrub variables in the plan. We can now get the test set of the employee salaries dataset: .. GENERATED FROM PYTHON SOURCE LINES 156-158 .. code-block:: Python unseen_data = fetch_employee_salaries(split="test").employee_salaries .. GENERATED FROM PYTHON SOURCE LINES 159-161 Then, we can use the loaded model to make predictions on the unseen data by passing the environment as dictionary. .. GENERATED FROM PYTHON SOURCE LINES 161-165 .. code-block:: Python predicted_values = loaded_model.predict({"data": unseen_data}) predicted_values .. rst-class:: sphx-glr-script-out .. code-block:: none array([116382.06417108, 45114.33938599, 46680.82086958, ..., 105486.55018287, 146020.37131876, 73028.94144409], shape=(1228,)) .. GENERATED FROM PYTHON SOURCE LINES 166-168 We can also evaluate the model's performance using the `score` method, which uses the scikit-learn scoring function used by the predictor: .. GENERATED FROM PYTHON SOURCE LINES 168-170 .. code-block:: Python loaded_model.score({"data": unseen_data}) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.9407037991754476 .. GENERATED FROM PYTHON SOURCE LINES 171-182 Conclusion ---------- In this example, we have briefly introduced the skrub DataOps, and how they can be used to build powerful machine learning pipelines. We have seen how to preprocess data, train a model. We have also shown how to save and load the trained model, and how to make predictions on new data using the trained model. However, skrub DataOps are significantly more powerful than what we have shown here: for more advanced examples, see :ref:`data_ops_examples_ref`. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 6.759 seconds) .. _sphx_glr_download_auto_examples_data_ops_10_data_ops_intro.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/data_ops/10_data_ops_intro.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/data_ops/10_data_ops_intro.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 10_data_ops_intro.ipynb <10_data_ops_intro.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 10_data_ops_intro.py <10_data_ops_intro.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 10_data_ops_intro.zip <10_data_ops_intro.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_