.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/data_ops/15_use_case.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_data_ops_15_use_case.py: Use case: developing locally and deploying to production ======================================================= .. GENERATED FROM PYTHON SOURCE LINES 7-32 As a team of data scientists, we are tasked with a project to predict whether an email is potentially malicious (i.e., spam or phishing). We develop and test our models locally, either in a Jupyter notebook or within a Python script. Once we are satisfied with the model's performance, we move on to deploying it. In this use case, every time the email provider receives a new email, they want to verify whether it is spam before displaying it in the recipient's inbox. To achieve this, they plan to integrate a machine learning model within a microservice. This microservice will accept an email's data as a JSON payload and return a score between 0 and 1, indicating the likelihood that the email is spam. To avoid rewriting the entire data pipeline when moving from model validation to production deployment, which is both error-prone and inefficient, we prefer to load an object that encapsulates the same processing pipeline used during model development. This is where the :class:`~skrub.SkrubLearner` can help. Adopting this workflow also has the benefit of forcing us to clearly define the type of data that will be available at the input of the microservice. It helps ensure we build models that rely only on information accessible at this specific point in the product pipeline. For example, since we want to detect spam before the email reaches the recipient's inbox, we cannot use features that are only available after the recipient opens the email. Since this example is focused on the pipeline construction itself, we won't look at our model performance. .. GENERATED FROM PYTHON SOURCE LINES 34-38 Generating the training data ---------------------------- In this section, we define a few functions that help us with generating the training data in dictionary form. We are going to generate a fully random data set. .. GENERATED FROM PYTHON SOURCE LINES 38-73 .. code-block:: Python import random import string import uuid from datetime import datetime, timedelta import numpy as np def generate_id(): return str(uuid.uuid4()) def generate_email(): length = random.randint(5, 10) username = "".join(random.choice(string.ascii_lowercase) for _ in range(length)) domain = ["google", "yahoo", "whatever"] tld = ["fr", "en", "com", "net"] return f"{username}@{random.choice(domain)}.{random.choice(tld)}" def generate_datetime(): random_seconds = random.randint(0, int(timedelta(days=2).total_seconds())) random_datetime = datetime.now() - timedelta(seconds=random_seconds) return random_datetime def generate_text(min_str_length, max_str_length): random_length = random.randint(min_str_length, max_str_length) random_text = "".join( random.choice(string.ascii_letters + string.digits + string.punctuation) for _ in range(random_length) ) return random_text .. GENERATED FROM PYTHON SOURCE LINES 74-75 We generate 1000 training samples and store them in a list of dictionaries: .. GENERATED FROM PYTHON SOURCE LINES 75-78 .. code-block:: Python n_samples = 1000 .. GENERATED FROM PYTHON SOURCE LINES 79-82 In this use case, the emails to be tested when the model is put in production are not contained in a dataframe, but in a JSON. As a result, our training data should also be contained in a list of dictionaries. .. GENERATED FROM PYTHON SOURCE LINES 82-99 .. code-block:: Python X = [ { "id": generate_id(), "sender": generate_email(), "title": generate_text(max_str_length=10, min_str_length=2), "content": generate_text(max_str_length=100, min_str_length=10), "date": generate_datetime(), "cc_emails": [generate_email() for _ in range(random.randint(0, 5))], } for _ in range(n_samples) ] # generate array of 1 and 0 to represent the target variable y = np.random.binomial(n=1, p=0.9, size=n_samples) .. GENERATED FROM PYTHON SOURCE LINES 100-104 Building the DataOps plan ------------------------- Let's start our DataOps plan by indicating what the features and the target variables are. .. GENERATED FROM PYTHON SOURCE LINES 104-109 .. code-block:: Python import skrub X = skrub.X(X) y = skrub.y(y) .. GENERATED FROM PYTHON SOURCE LINES 110-113 The variable X is currently a list of dictionaries, which estimators cannot handle directly. Let's convert it to a pandas DataFrame using :func:`~skrub.DataOp.skb.apply_func`. .. GENERATED FROM PYTHON SOURCE LINES 113-117 .. code-block:: Python import pandas as pd df = X.skb.apply_func(pd.DataFrame) .. GENERATED FROM PYTHON SOURCE LINES 118-120 For this example, we will use a strong baseline, with skrub's :func:`~skrub.tabular_pipeline()`. .. GENERATED FROM PYTHON SOURCE LINES 120-129 .. code-block:: Python tab_pipeline = skrub.tabular_pipeline("classification") # We can now apply the predictive model to the data. # The DataOps plan is ready after applying the model to the data. predictions = df.skb.apply(tab_pipeline, y=y) # We can then explore the full plan: predictions.skb.draw_graph() .. raw:: html
X: Var 'X' Call 'DataFrame' Apply Pipeline y: Var 'y'


.. GENERATED FROM PYTHON SOURCE LINES 130-135 To end the explorative work, we need to build the learner, fit it, and save it to a file. Passing ``fitted=True`` to the :func:`~skrub.DataOp.skb.make_learner` function makes it so that the learner is fitted on the data that has been passed to the variables of the DataOps plan. .. GENERATED FROM PYTHON SOURCE LINES 135-141 .. code-block:: Python import joblib with open("learner.pkl", "wb") as f: learner = predictions.skb.make_learner(fitted=True) joblib.dump(learner, f) .. GENERATED FROM PYTHON SOURCE LINES 142-146 Production phase ---------------- In our microservice, we receive a payload in JSON format. .. GENERATED FROM PYTHON SOURCE LINES 146-163 .. code-block:: Python X_input = { "id": generate_id(), "sender": generate_email(), "title": generate_text(max_str_length=10, min_str_length=2), "content": generate_text(max_str_length=100, min_str_length=10), "date": generate_datetime(), "cc_emails": [generate_email() for _ in range(random.randint(0, 5))], } # We just have to load the learner and use it to predict the score for this input. with open("learner.pkl", "rb") as f: loaded_learner = joblib.load(f) # ``X_input`` must be passed as a list so that it can be parsed correctly as a dataframe # by Pandas. prediction = loaded_learner.predict({"X": [X_input]}) prediction .. rst-class:: sphx-glr-script-out .. code-block:: none array([1]) .. GENERATED FROM PYTHON SOURCE LINES 164-171 Conclusion ---------- Thanks to the skrub DataOps and learner, we ensure that all the transformations and preprocessing done during model development are exactly the same as those done in production. This makes deployment straightforward and reduces the risk of errors when moving from development to production environments. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 5.756 seconds) .. _sphx_glr_download_auto_examples_data_ops_15_use_case.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.6.2?urlpath=lab/tree/notebooks/auto_examples/data_ops/15_use_case.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/data_ops/15_use_case.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 15_use_case.ipynb <15_use_case.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 15_use_case.py <15_use_case.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 15_use_case.zip <15_use_case.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_