.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorials/1110_data_ops_intro.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via JupyterLite or Binder. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorials_1110_data_ops_intro.py: Tutorial: Using Data Ops to build a machine-learning pipeline ======================================================================= .. currentmodule:: skrub .. |fetch_employee_salaries| replace:: :func:`datasets.fetch_employee_salaries` .. |TableReport| replace:: :class:`TableReport` .. |var| replace:: :func:`var` .. |skb.mark_as_X| replace:: :meth:`DataOp.skb.mark_as_X` .. |skb.mark_as_y| replace:: :meth:`DataOp.skb.mark_as_y` .. |TableVectorizer| replace:: :class:`TableVectorizer` .. |ToDatetime| replace:: :class:`ToDatetime` .. |skb.apply| replace:: :meth:`.skb.apply() ` .. |HistGradientBoostingRegressor| replace:: :class:`~sklearn.ensemble.HistGradientBoostingRegressor` .. |.skb.full_report()| replace:: :meth:`.skb.full_report() ` .. |choose_float| replace:: :func:`choose_float` .. |make_randomized_search| replace:: :meth:`.skb.make_randomized_search ` This example shows data how we can use skrub's :ref:`DataOps ` for building a machine learning pipeline. The challenge of preparing data for machine learning is the need to apply the same data preparation and wrangling operations to new data, for prediction. Skrub's DataOps build pipelines that blend data wrangling and machine learning by recording all the operations involved in pre-processing data and training models, as well as the state of the transformers and models used to make predictions. .. admonition:: What is a state? :collapsible: closed The state of a transformer or model refers to the internal parameters and attributes that are learned or set during the fitting process. For example, in a :class:`~sklearn.preprocessing.StandardScaler`, the state would include the mean and standard deviation calculated from the training data. In a pre-processing transformer like |ToDatetime|, the state would include the inferred datetime format based on the data it was fitted on. In a machine learning model like |HistGradientBoostingRegressor|, the state would include the fitted parameters of the model after training on the data. The result of building a DataOps plan is a *learner*, an object with an interface similar to that of a scikit-learn estimator, but which contains all the steps in the data preparation and model training process, along with the state of all the transformers and models: this allows to save the learner, load it back later, and use it to make predictions on new data. This example is meant to be an introduction to skrub DataOps, and as such it will not cover all the features. Further examples in the gallery :ref:`data_ops_examples_ref` go into more detail on skrub DataOps for more complex tasks. .. GENERATED FROM PYTHON SOURCE LINES 60-67 The data --------- We begin by loading the employee salaries dataset, which is a regression dataset that contains information about employees and their current annual salaries. By default, the |fetch_employee_salaries| function returns the training set. We will load the test set later, to evaluate our model on unseen data. .. GENERATED FROM PYTHON SOURCE LINES 67-76 .. code-block:: Python import pandas as pd from skrub.datasets import fetch_employee_salaries training_data = pd.read_csv( fetch_employee_salaries(split="train").employee_salaries_path ) .. GENERATED FROM PYTHON SOURCE LINES 77-81 We can take a look at the dataset using the |TableReport|. This dataset contains numerical, categorical, and datetime features. The column ``current_annual_salary`` is the target variable we want to predict. .. GENERATED FROM PYTHON SOURCE LINES 81-85 .. code-block:: Python import skrub skrub.TableReport(training_data) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 86-95 Assembling our DataOps plan ---------------------------- Our goal is to predict the ``current_annual_salary`` of employees based on their other features. We will use skrub's DataOps to combine both skrub and scikit-learn objects into a single DataOps plan, which will allow us to preprocess the data, train a model, and tune hyperparameters. We begin by defining a skrub |var|, which is the entry point for our DataOps plan. .. GENERATED FROM PYTHON SOURCE LINES 95-98 .. code-block:: Python data_var = skrub.var("data", training_data) .. GENERATED FROM PYTHON SOURCE LINES 99-104 Next, we define the initial features ``X`` and the target variable ``y``. We use the |skb.mark_as_X| and |skb.mark_as_y| methods to mark these variables in the DataOps plan. This allows skrub to properly split these objects into training and validation steps when executing cross-validation or hyperparameter tuning. .. GENERATED FROM PYTHON SOURCE LINES 104-107 .. code-block:: Python X = data_var.drop("current_annual_salary", axis=1).skb.mark_as_X() y = data_var["current_annual_salary"].skb.mark_as_y() .. GENERATED FROM PYTHON SOURCE LINES 108-113 Our first step is to vectorize the features in ``X``. We will use the |TableVectorizer| to convert the categorical and numerical features into a numerical format that can be used by machine learning algorithms. We apply the vectorizer to ``X`` using the |skb.apply| method, which allows us to apply any scikit-learn compatible transformer to the skrub variable. .. GENERATED FROM PYTHON SOURCE LINES 113-120 .. code-block:: Python from skrub import TableVectorizer vectorizer = TableVectorizer() X_vec = X.skb.apply(vectorizer) X_vec .. raw:: html
<Apply TableVectorizer>
Show/Hide graph Var 'data' X: CallMethod 'drop' Apply TableVectorizer

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 121-129 By clicking on ``Show graph``, we can see the DataOps plan that has been created: the plan shows the steps that have been applied to the data so far. Now that we have the vectorized features, we can proceed to train a model. We use a scikit-learn |HistGradientBoostingRegressor| to predict the target variable. We apply the model to the vectorized features using ``.skb.apply``, and pass ``y`` as the target variable. Note that the resulting ``predictor`` variable shows prediction results on the preview subsample, but the model will be properly fitted when we create the learner. .. GENERATED FROM PYTHON SOURCE LINES 129-137 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingRegressor hgb = HistGradientBoostingRegressor() predictor = X_vec.skb.apply(hgb, y=y) predictor .. raw:: html
<Apply HistGradientBoostingRegressor>
Show/Hide graph Var 'data' X: CallMethod 'drop' y: GetItem 'current_annual_salary' Apply TableVectorizer Apply HistGradientBoostingRegressor

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 138-149 Now that we have built our entire plan, we can explore it in more detail with the |.skb.full_report()| method:: predictor.skb.full_report() This produces a folder on disk rather than displaying inline in a notebook so we do not run it here. But you can `see the output here <../../_static/employee_salaries_report/index.html>`_. This method evaluates each step in the plan and shows detailed information about the operations that are being performed. .. GENERATED FROM PYTHON SOURCE LINES 151-157 Turning the DataOps plan into a learner, for later reuse --------------------------------------------------------- Now that we have defined the predictor, we can create a ``learner``, a standalone object that contains all the steps in the DataOps plan. We fit the learner, so that it can be used to make predictions on new data. .. GENERATED FROM PYTHON SOURCE LINES 157-160 .. code-block:: Python trained_learner = predictor.skb.make_learner(fitted=True) .. GENERATED FROM PYTHON SOURCE LINES 161-166 A big advantage of the learner is that it can be pickled and saved to disk, allowing us to reuse the trained model later without needing to retrain it. The learner contains all steps in the DataOps plan, including the fitted vectorizer and the trained model. We can save it using Python's ``pickle`` module. Here we use ``pickle.dumps`` to serialize the learner object into a byte string. .. GENERATED FROM PYTHON SOURCE LINES 166-171 .. code-block:: Python import pickle saved_model = pickle.dumps(trained_learner) .. GENERATED FROM PYTHON SOURCE LINES 172-173 We can now load the saved model back into memory using ``pickle.loads``. .. GENERATED FROM PYTHON SOURCE LINES 173-175 .. code-block:: Python loaded_model = pickle.loads(saved_model) .. GENERATED FROM PYTHON SOURCE LINES 176-186 Now, we can make predictions on new data using the loaded model, by passing a dictionary with the skrub variable names as keys. We don't have to create a new variable, as this will be done internally by the learner. In fact, the ``learner`` is similar to a scikit-learn estimator, but rather than taking ``X`` and ``y`` as inputs, it takes a dictionary (the "environment") where each key corresponds to the name of a skrub variable in the plan (in this case, "data"). We can now get the test set of the employee salaries dataset: .. GENERATED FROM PYTHON SOURCE LINES 186-188 .. code-block:: Python unseen_data = pd.read_csv(fetch_employee_salaries(split="test").employee_salaries_path) .. GENERATED FROM PYTHON SOURCE LINES 189-191 Then, we can use the loaded model to make predictions on the unseen data by passing a dictionary with the variable name as the key. .. GENERATED FROM PYTHON SOURCE LINES 191-195 .. code-block:: Python predicted_values = loaded_model.predict({"data": unseen_data}) predicted_values .. rst-class:: sphx-glr-script-out .. code-block:: none array([115840.66895998, 45209.29098473, 49350.47329576, ..., 97004.31811655, 147905.19168078, 76249.07726998], shape=(1228,)) .. GENERATED FROM PYTHON SOURCE LINES 196-198 We can also evaluate the model's performance using the `score` method, which uses the scikit-learn scoring function used by the predictor: .. GENERATED FROM PYTHON SOURCE LINES 198-200 .. code-block:: Python loaded_model.score({"data": unseen_data}) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.9422822507801908 .. GENERATED FROM PYTHON SOURCE LINES 201-211 Conclusion ---------- In this example, we have briefly introduced the skrub DataOps and how they can be used to build powerful machine learning pipelines. We have shown how to preprocess data and train a model. We have also demonstrated how to save and load the trained model, and how to make predictions on new data. However, skrub DataOps are significantly more powerful than what we have shown here. For more advanced examples, see :ref:`data_ops_examples_ref`. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 15.090 seconds) **Estimated memory usage:** 520 MB .. _sphx_glr_download_auto_tutorials_1110_data_ops_intro.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.9.0?urlpath=lab/tree/notebooks/auto_tutorials/1110_data_ops_intro.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_tutorials/1110_data_ops_intro.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 1110_data_ops_intro.ipynb <1110_data_ops_intro.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 1110_data_ops_intro.py <1110_data_ops_intro.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 1110_data_ops_intro.zip <1110_data_ops_intro.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_