.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/data_ops/14_use_case.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_data_ops_14_use_case.py>`
        to download the full example code. or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_data_ops_14_use_case.py:


Usecase: developing locally, and avoiding to repeat code in production
======================================================================

.. GENERATED FROM PYTHON SOURCE LINES 8-33

As a team of data scientists, we are tasked with a project to predict whether an email
is potentially malicious (i.e., spam or phishing). We develop and test our models
locally, either in a Jupyter notebook or within a Python script. Once we are satisfied
with the model's performance, we move on to deploying it.

In this use case, every time the email provider receives a new email, they want to
verify whether it is spam before displaying it in the recipient’s inbox. To achieve
this, they plan to integrate a machine learning model within a microservice. This
microservice will accept an email’s data as a JSON payload and return a score between
0 and 1, indicating the likelihood that the email is spam.

To avoid rewriting the entire data pipeline when moving from model validation to
production deployment, which is both error-prone and inefficient, we prefer to load an
object that encapsulates the same processing pipeline used during model development.
This is where the :class:`~skrub.SkrubLearner` can help.

Adopting this workflow also has the benefit of forcing us to clearly define the type
of data that will be available at the input of the microservice. It helps ensure we
build models that rely only on information accessible at this specific point in the
product pipeline. For example, since we want to detect spam before the email reaches
the recipient’s inbox, we cannot use features that are only available after the
recipient opens the email.

Since this example is focused on the pipeline construction itself, we won't look at
our model performance.

.. GENERATED FROM PYTHON SOURCE LINES 35-39

Generating the training data
----------------------------
In this section, we define a few functions that help us with generating the
training data in dictionary form. We are going to generate a fully random data set.

.. GENERATED FROM PYTHON SOURCE LINES 39-74

.. code-block:: Python

    import random
    import string
    import uuid
    from datetime import datetime, timedelta

    import numpy as np


    def generate_id():
        return str(uuid.uuid4())


    def generate_email():
        length = random.randint(5, 10)
        username = "".join(random.choice(string.ascii_lowercase) for _ in range(length))
        domain = ["google", "yahoo", "whatever"]
        tld = ["fr", "en", "com", "net"]
        return f"{username}@{random.choice(domain)}.{random.choice(tld)}"


    def generate_datetime():
        random_seconds = random.randint(0, int(timedelta(days=2).total_seconds()))
        random_datetime = datetime.now() - timedelta(seconds=random_seconds)
        return random_datetime


    def generate_text(min_str_length, max_str_length):
        random_length = random.randint(min_str_length, max_str_length)
        random_text = "".join(
            random.choice(string.ascii_letters + string.digits + string.punctuation)
            for _ in range(random_length)
        )
        return random_text


.. GENERATED FROM PYTHON SOURCE LINES 75-76

We generate 1000 training samples and store them in a list of dictionaries:

.. GENERATED FROM PYTHON SOURCE LINES 76-79

.. code-block:: Python


    n_samples = 1000


.. GENERATED FROM PYTHON SOURCE LINES 80-83

In this use case, the emails to be tested when the model is put in production
are not contained in a dataframe, but in a json. As a result, our training data
should also be contained in a list of dictionaries.

.. GENERATED FROM PYTHON SOURCE LINES 83-100

.. code-block:: Python


    X = [
        {
            "id": generate_id(),
            "sender": generate_email(),
            "title": generate_text(max_str_length=10, min_str_length=2),
            "content": generate_text(max_str_length=100, min_str_length=10),
            "date": generate_datetime(),
            "cc_emails": [generate_email() for _ in range(random.randint(0, 5))],
        }
        for _ in range(n_samples)
    ]


    # generate array of 1 and 0 to represent the target variable
    y = np.random.binomial(n=1, p=0.9, size=n_samples)


.. GENERATED FROM PYTHON SOURCE LINES 101-105

Building the DataOps plan
-------------------------
Let's start our DataOps plan by indicating what are the features and the target
variable.

.. GENERATED FROM PYTHON SOURCE LINES 105-110

.. code-block:: Python

    import skrub

    X = skrub.X(X)
    y = skrub.y(y)


.. GENERATED FROM PYTHON SOURCE LINES 111-114

The variable X for now is a list of dicts. It's not something that an estimator can
handle directly.
Let's convert it to a pandas DataFrame using :func:`~skrub.DataOp.skb.apply_func`.

.. GENERATED FROM PYTHON SOURCE LINES 114-118

.. code-block:: Python

    import pandas as pd

    df = X.skb.apply_func(pd.DataFrame)


.. GENERATED FROM PYTHON SOURCE LINES 119-121

For this example, we will use a strong baseline, with skrub's
:func:`~skrub.tabular_pipeline()`.

.. GENERATED FROM PYTHON SOURCE LINES 121-130

.. code-block:: Python

    tab_pipeline = skrub.tabular_pipeline("classification")

    # We can now apply the predictive model to the data.
    # The DataOps plan is ready after applying the model to the data.
    predictions = df.skb.apply(tab_pipeline, y=y)

    # We can then explore the full plan:
    predictions.skb.draw_graph()


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
     "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
    <!-- Generated by graphviz version 12.2.1 (20250203.1708)
     -->
    <!-- Title: G Pages: 1 -->
    <svg width="214pt" height="173pt"
     viewBox="0.00 0.00 213.78 173.31" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
    <g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 169.31)">

    <polygon fill="white" stroke="none" points="-4,4 -4,-169.31 209.78,-169.31 209.78,4 -4,4"/>
    <!-- node_0 -->
    <g id="node_0" class="node">

    <g id="a_node_0"><a xlink:title=" &#160;&#160;&#160;&#160;&#160;File &quot;/home/circleci/project/examples/data_ops/14_use_case.py&quot;, line 107, in &lt;module&gt;&#10; &#160;&#160;&#160;&#160;&#160;&#160;&#160;X = skrub.X(X)">
    <polygon fill="#c6d5f0" stroke="black" points="91.65,-161.31 21,-161.31 21,-135.54 91.65,-135.54 91.65,-161.31"/>
    <polygon fill="none" stroke="black" points="95.65,-165.31 17,-165.31 17,-131.54 95.65,-131.54 95.65,-165.31"/>
    <text text-anchor="middle" x="56.33" y="-144.15" font-family="sans-serif" font-size="12.00">X: Var &#39;X&#39;</text>
    </a>
    </g>
    </g>
    <!-- node_1 -->
    <g id="node_1" class="node">

    <g id="a_node_1"><a xlink:title=" &#160;&#160;&#160;&#160;&#160;File &quot;/home/circleci/project/examples/data_ops/14_use_case.py&quot;, line 116, in &lt;module&gt;&#10; &#160;&#160;&#160;&#160;&#160;&#160;&#160;df = X.skb.apply_func(pd.DataFrame)">
    <polygon fill="none" stroke="black" points="112.65,-91.54 0,-91.54 0,-65.77 112.65,-65.77 112.65,-91.54"/>
    <text text-anchor="middle" x="56.33" y="-74.38" font-family="sans-serif" font-size="12.00">Call &#39;DataFrame&#39;</text>
    </a>
    </g>
    </g>
    <!-- node_0&#45;&gt;node_1 -->
    <g id="edge3" class="edge">

    <path fill="none" stroke="black" d="M56.33,-131.36C56.33,-122.98 56.33,-112.57 56.33,-103.28"/>
    <polygon fill="black" stroke="black" points="59.83,-103.51 56.33,-93.51 52.83,-103.51 59.83,-103.51"/>
    </g>
    <!-- node_3 -->
    <g id="node_3" class="node">

    <g id="a_node_3"><a xlink:title=" &#160;&#160;&#160;&#160;&#160;File &quot;/home/circleci/project/examples/data_ops/14_use_case.py&quot;, line 125, in &lt;module&gt;&#10; &#160;&#160;&#160;&#160;&#160;&#160;&#160;predictions = df.skb.apply(tab_pipeline, y=y)">
    <polygon fill="none" stroke="black" points="160.78,-25.77 63.88,-25.77 63.88,0 160.78,0 160.78,-25.77"/>
    <text text-anchor="middle" x="112.33" y="-8.61" font-family="sans-serif" font-size="12.00">Apply Pipeline</text>
    </a>
    </g>
    </g>
    <!-- node_1&#45;&gt;node_3 -->
    <g id="edge1" class="edge">

    <path fill="none" stroke="black" d="M67.12,-65.36C74.79,-56.63 85.27,-44.69 94.21,-34.51"/>
    <polygon fill="black" stroke="black" points="96.59,-37.1 100.56,-27.28 91.33,-32.48 96.59,-37.1"/>
    </g>
    <!-- node_2 -->
    <g id="node_2" class="node">

    <g id="a_node_2"><a xlink:title=" &#160;&#160;&#160;&#160;&#160;File &quot;/home/circleci/project/examples/data_ops/14_use_case.py&quot;, line 108, in &lt;module&gt;&#10; &#160;&#160;&#160;&#160;&#160;&#160;&#160;y = skrub.y(y)">
    <polygon fill="#fad9c6" stroke="black" points="201.77,-91.54 134.88,-91.54 134.88,-65.77 201.77,-65.77 201.77,-91.54"/>
    <polygon fill="none" stroke="black" points="205.77,-95.54 130.88,-95.54 130.88,-61.77 205.77,-61.77 205.77,-95.54"/>
    <text text-anchor="middle" x="168.32" y="-74.38" font-family="sans-serif" font-size="12.00">y: Var &#39;y&#39;</text>
    </a>
    </g>
    </g>
    <!-- node_2&#45;&gt;node_3 -->
    <g id="edge2" class="edge">

    <path fill="none" stroke="black" d="M154.19,-61.56C146.98,-53.35 138.12,-43.26 130.41,-34.48"/>
    <polygon fill="black" stroke="black" points="133.14,-32.29 123.91,-27.08 127.88,-36.91 133.14,-32.29"/>
    </g>
    </g>
    </svg>

    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 131-136

To end the explorative work, we need to build the learner, fit it, and save it to a
file.
Passing ``fitted=True`` to the :func:`~skrub.DataOp.skb.make_learner`
function makes it so that the learner is fitted on the data that has been passed to
the variables of the DataOps plan.

.. GENERATED FROM PYTHON SOURCE LINES 136-142

.. code-block:: Python

    import joblib

    with open("learner.pkl", "wb") as f:
        learner = predictions.skb.make_learner(fitted=True)
        joblib.dump(learner, f)


.. GENERATED FROM PYTHON SOURCE LINES 143-147

Production phase
----------------

In our microservice, we receive a payload in json format.

.. GENERATED FROM PYTHON SOURCE LINES 147-164

.. code-block:: Python

    X_input = {
        "id": generate_id(),
        "sender": generate_email(),
        "title": generate_text(max_str_length=10, min_str_length=2),
        "content": generate_text(max_str_length=100, min_str_length=10),
        "date": generate_datetime(),
        "cc_emails": [generate_email() for _ in range(random.randint(0, 5))],
    }

    # We just have to load the learner and use it to predict the score for this input.
    with open("learner.pkl", "rb") as f:
        loaded_learner = joblib.load(f)
    # ``X_input`` must be passed as a list so that it can be parsed correctly as a dataframe
    # by Pandas.
    prediction = loaded_learner.predict({"X": [X_input]})
    prediction


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([1])


.. GENERATED FROM PYTHON SOURCE LINES 165-172

Conclusion
----------

Thanks to the skrub DataOps and learner, we are assured that all the transformations
and preprocessing done during model development are exactly the same that are done in
production.
It becomes easy and straightforward to deploy.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 5.467 seconds)


.. _sphx_glr_download_auto_examples_data_ops_14_use_case.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/data_ops/14_use_case.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../../lite/lab/index.html?path=auto_examples/data_ops/14_use_case.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 14_use_case.ipynb <14_use_case.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 14_use_case.py <14_use_case.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: 14_use_case.zip <14_use_case.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_