.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/data_ops/15_use_case.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_data_ops_15_use_case.py>`
        to download the full example code. or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_data_ops_15_use_case.py:


Use case: developing locally and deploying to production
=======================================================

.. GENERATED FROM PYTHON SOURCE LINES 7-32

As a team of data scientists, we are tasked with a project to predict whether an email
is potentially malicious (i.e., spam or phishing). We develop and test our models
locally, either in a Jupyter notebook or within a Python script. Once we are satisfied
with the model's performance, we move on to deploying it.

In this use case, every time the email provider receives a new email, they want to
verify whether it is spam before displaying it in the recipient's inbox. To achieve
this, they plan to integrate a machine learning model within a microservice. This
microservice will accept an email's data as a JSON payload and return a score between
0 and 1, indicating the likelihood that the email is spam.

To avoid rewriting the entire data pipeline when moving from model validation to
production deployment, which is both error-prone and inefficient, we prefer to load an
object that encapsulates the same processing pipeline used during model development.
This is where the :class:`~skrub.SkrubLearner` can help.

Adopting this workflow also has the benefit of forcing us to clearly define the type
of data that will be available at the input of the microservice. It helps ensure we
build models that rely only on information accessible at this specific point in the
product pipeline. For example, since we want to detect spam before the email reaches
the recipient's inbox, we cannot use features that are only available after the
recipient opens the email.

Since this example is focused on the pipeline construction itself, we won't look at
our model performance.

.. GENERATED FROM PYTHON SOURCE LINES 34-38

Generating the training data
----------------------------
In this section, we define a few functions that help us with generating the
training data in dictionary form. We are going to generate a fully random data set.

.. GENERATED FROM PYTHON SOURCE LINES 38-73

.. code-block:: Python

    import random
    import string
    import uuid
    from datetime import datetime, timedelta

    import numpy as np


    def generate_id():
        return str(uuid.uuid4())


    def generate_email():
        length = random.randint(5, 10)
        username = "".join(random.choice(string.ascii_lowercase) for _ in range(length))
        domain = ["google", "yahoo", "whatever"]
        tld = ["fr", "en", "com", "net"]
        return f"{username}@{random.choice(domain)}.{random.choice(tld)}"


    def generate_datetime():
        random_seconds = random.randint(0, int(timedelta(days=2).total_seconds()))
        random_datetime = datetime.now() - timedelta(seconds=random_seconds)
        return random_datetime


    def generate_text(min_str_length, max_str_length):
        random_length = random.randint(min_str_length, max_str_length)
        random_text = "".join(
            random.choice(string.ascii_letters + string.digits + string.punctuation)
            for _ in range(random_length)
        )
        return random_text


.. GENERATED FROM PYTHON SOURCE LINES 74-75

We generate 1000 training samples and store them in a list of dictionaries:

.. GENERATED FROM PYTHON SOURCE LINES 75-78

.. code-block:: Python


    n_samples = 1000


.. GENERATED FROM PYTHON SOURCE LINES 79-82

In this use case, the emails to be tested when the model is put in production
are not contained in a dataframe, but in a JSON. As a result, our training data
should also be contained in a list of dictionaries.

.. GENERATED FROM PYTHON SOURCE LINES 82-99

.. code-block:: Python


    X = [
        {
            "id": generate_id(),
            "sender": generate_email(),
            "title": generate_text(max_str_length=10, min_str_length=2),
            "content": generate_text(max_str_length=100, min_str_length=10),
            "date": generate_datetime(),
            "cc_emails": [generate_email() for _ in range(random.randint(0, 5))],
        }
        for _ in range(n_samples)
    ]


    # generate array of 1 and 0 to represent the target variable
    y = np.random.binomial(n=1, p=0.9, size=n_samples)


.. GENERATED FROM PYTHON SOURCE LINES 100-104

Building the DataOps plan
-------------------------
Let's start our DataOps plan by indicating what the features and the target
variables are.

.. GENERATED FROM PYTHON SOURCE LINES 104-109

.. code-block:: Python

    import skrub

    X = skrub.X(X)
    y = skrub.y(y)


.. GENERATED FROM PYTHON SOURCE LINES 110-113

The variable X is currently a list of dictionaries, which estimators cannot
handle directly. Let's convert it to a pandas DataFrame using
:func:`~skrub.DataOp.skb.apply_func`.

.. GENERATED FROM PYTHON SOURCE LINES 113-117

.. code-block:: Python

    import pandas as pd

    df = X.skb.apply_func(pd.DataFrame)


.. GENERATED FROM PYTHON SOURCE LINES 118-120

For this example, we will use a strong baseline, with skrub's
:func:`~skrub.tabular_pipeline()`.

.. GENERATED FROM PYTHON SOURCE LINES 120-129

.. code-block:: Python

    tab_pipeline = skrub.tabular_pipeline("classification")

    # We can now apply the predictive model to the data.
    # The DataOps plan is ready after applying the model to the data.
    predictions = df.skb.apply(tab_pipeline, y=y)

    # We can then explore the full plan:
    predictions.skb.draw_graph()


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
     "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
    <!-- Generated by graphviz version 13.1.2 (20250809.0930)
     -->
    <!-- Title: G Pages: 1 -->
    <svg width="214pt" height="173pt"
     viewBox="0.00 0.00 214.00 173.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
    <g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 169.31)">

    <polygon fill="white" stroke="none" points="-4,4 -4,-169.31 209.78,-169.31 209.78,4 -4,4"/>
    <!-- node_0 -->
    <g id="node_0" class="node">

    <g id="a_node_0"><a xlink:title=" &#160;&#160;&#160;&#160;&#160;File &quot;/home/circleci/project/examples/data_ops/15_use_case.py&quot;, line 106, in &lt;module&gt;&#10; &#160;&#160;&#160;&#160;&#160;&#160;&#160;X = skrub.X(X)">
    <polygon fill="#c6d5f0" stroke="black" points="91.65,-161.31 21,-161.31 21,-135.54 91.65,-135.54 91.65,-161.31"/>
    <polygon fill="none" stroke="black" points="95.65,-165.31 17,-165.31 17,-131.54 95.65,-131.54 95.65,-165.31"/>
    <text xml:space="preserve" text-anchor="middle" x="56.33" y="-144.15" font-family="sans-serif" font-size="12.00">X: Var &#39;X&#39;</text>
    </a>
    </g>
    </g>
    <!-- node_1 -->
    <g id="node_1" class="node">

    <g id="a_node_1"><a xlink:title=" &#160;&#160;&#160;&#160;&#160;File &quot;/home/circleci/project/examples/data_ops/15_use_case.py&quot;, line 115, in &lt;module&gt;&#10; &#160;&#160;&#160;&#160;&#160;&#160;&#160;df = X.skb.apply_func(pd.DataFrame)">
    <polygon fill="none" stroke="black" points="112.65,-91.54 0,-91.54 0,-65.77 112.65,-65.77 112.65,-91.54"/>
    <text xml:space="preserve" text-anchor="middle" x="56.33" y="-74.38" font-family="sans-serif" font-size="12.00">Call &#39;DataFrame&#39;</text>
    </a>
    </g>
    </g>
    <!-- node_0&#45;&gt;node_1 -->
    <g id="edge3" class="edge">

    <path fill="none" stroke="black" d="M56.33,-131.36C56.33,-122.98 56.33,-112.57 56.33,-103.28"/>
    <polygon fill="black" stroke="black" points="59.83,-103.51 56.33,-93.51 52.83,-103.51 59.83,-103.51"/>
    </g>
    <!-- node_3 -->
    <g id="node_3" class="node">

    <g id="a_node_3"><a xlink:title=" &#160;&#160;&#160;&#160;&#160;File &quot;/home/circleci/project/examples/data_ops/15_use_case.py&quot;, line 124, in &lt;module&gt;&#10; &#160;&#160;&#160;&#160;&#160;&#160;&#160;predictions = df.skb.apply(tab_pipeline, y=y)">
    <polygon fill="none" stroke="black" points="160.78,-25.77 63.88,-25.77 63.88,0 160.78,0 160.78,-25.77"/>
    <text xml:space="preserve" text-anchor="middle" x="112.33" y="-8.61" font-family="sans-serif" font-size="12.00">Apply Pipeline</text>
    </a>
    </g>
    </g>
    <!-- node_1&#45;&gt;node_3 -->
    <g id="edge1" class="edge">

    <path fill="none" stroke="black" d="M67.12,-65.36C74.79,-56.63 85.27,-44.69 94.21,-34.51"/>
    <polygon fill="black" stroke="black" points="96.59,-37.1 100.56,-27.28 91.33,-32.48 96.59,-37.1"/>
    </g>
    <!-- node_2 -->
    <g id="node_2" class="node">

    <g id="a_node_2"><a xlink:title=" &#160;&#160;&#160;&#160;&#160;File &quot;/home/circleci/project/examples/data_ops/15_use_case.py&quot;, line 107, in &lt;module&gt;&#10; &#160;&#160;&#160;&#160;&#160;&#160;&#160;y = skrub.y(y)">
    <polygon fill="#fad9c6" stroke="black" points="201.77,-91.54 134.88,-91.54 134.88,-65.77 201.77,-65.77 201.77,-91.54"/>
    <polygon fill="none" stroke="black" points="205.77,-95.54 130.88,-95.54 130.88,-61.77 205.77,-61.77 205.77,-95.54"/>
    <text xml:space="preserve" text-anchor="middle" x="168.32" y="-74.38" font-family="sans-serif" font-size="12.00">y: Var &#39;y&#39;</text>
    </a>
    </g>
    </g>
    <!-- node_2&#45;&gt;node_3 -->
    <g id="edge2" class="edge">

    <path fill="none" stroke="black" d="M154.19,-61.56C146.98,-53.35 138.12,-43.26 130.41,-34.48"/>
    <polygon fill="black" stroke="black" points="133.14,-32.29 123.91,-27.08 127.88,-36.91 133.14,-32.29"/>
    </g>
    </g>
    </svg>

    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 130-135

To end the explorative work, we need to build the learner, fit it, and save it to a
file.
Passing ``fitted=True`` to the :func:`~skrub.DataOp.skb.make_learner`
function makes it so that the learner is fitted on the data that has been passed to
the variables of the DataOps plan.

.. GENERATED FROM PYTHON SOURCE LINES 135-141

.. code-block:: Python

    import joblib

    with open("learner.pkl", "wb") as f:
        learner = predictions.skb.make_learner(fitted=True)
        joblib.dump(learner, f)


.. GENERATED FROM PYTHON SOURCE LINES 142-146

Production phase
----------------

In our microservice, we receive a payload in JSON format.

.. GENERATED FROM PYTHON SOURCE LINES 146-163

.. code-block:: Python

    X_input = {
        "id": generate_id(),
        "sender": generate_email(),
        "title": generate_text(max_str_length=10, min_str_length=2),
        "content": generate_text(max_str_length=100, min_str_length=10),
        "date": generate_datetime(),
        "cc_emails": [generate_email() for _ in range(random.randint(0, 5))],
    }

    # We just have to load the learner and use it to predict the score for this input.
    with open("learner.pkl", "rb") as f:
        loaded_learner = joblib.load(f)
    # ``X_input`` must be passed as a list so that it can be parsed correctly as a dataframe
    # by Pandas.
    prediction = loaded_learner.predict({"X": [X_input]})
    prediction


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([1])


.. GENERATED FROM PYTHON SOURCE LINES 164-171

Conclusion
----------

Thanks to the skrub DataOps and learner, we ensure that all the transformations
and preprocessing done during model development are exactly the same as those done in
production. This makes deployment straightforward and reduces the risk of errors
when moving from development to production environments.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 5.756 seconds)


.. _sphx_glr_download_auto_examples_data_ops_15_use_case.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.6.2?urlpath=lab/tree/notebooks/auto_examples/data_ops/15_use_case.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../../lite/lab/index.html?path=auto_examples/data_ops/15_use_case.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 15_use_case.ipynb <15_use_case.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 15_use_case.py <15_use_case.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: 15_use_case.zip <15_use_case.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_