Building extensive pipelines with DataOps

Skrub DataOps

Extend the scikit-learn machinery to multi-table operations
Take care of data leakage
Track all operations with a computational graph (a Data Ops plan)
Allow tuning any operation in the plan
Can be persisted and shared easily

How do DataOps work, though?

DataOps wrap around user operations, where user operations are:

any dataframe operation (e.g., merge, group by, aggregate etc.)
scikit-learn estimators (a Random Forest, RidgeCV etc.)
custom user code (load data from a path, fetch from an URL etc.)

Important

DataOps record user operations, so that they can later be replayed in the same order and with the same arguments on unseen data.

Starting with the `DataOps`

import skrub
data = skrub.datasets.fetch_credit_fraud()

baskets = skrub.var("baskets", data.baskets)
products = skrub.var("products", data.products) # add a new variable

X = baskets[["ID"]].skb.mark_as_X()
y = baskets["fraud_flag"].skb.mark_as_y()

baskets and products represent inputs to the pipeline.
Skrub tracks X and y so that training and test splits are never mixed.

Applying a transformer

from skrub import selectors as s

vectorizer = skrub.TableVectorizer(
    high_cardinality=skrub.StringEncoder()
)
vectorized_products = products.skb.apply(
    vectorizer, cols=s.all() - "basket_ID"
)

Executing dataframe operations

aggregated_products = vectorized_products.groupby(
    "basket_ID"
).agg("mean").reset_index()

features = X.merge(
    aggregated_products, left_on="ID", right_on="basket_ID"
)
features = features.drop(columns=["ID", "basket_ID"])

Applying a ML model

from sklearn.ensemble import ExtraTreesClassifier  
predictions = features.skb.apply(
    ExtraTreesClassifier(n_jobs=-1), y=y
)

Inspecting the Data Ops plan

predictions.skb.full_report()

Execution report

Each node:

Shows a preview of the data resulting from the operation
Reports the location in the code where the code is defined
Can include node names and descriptions
Shows the run time of each node

Exporting the plan in a `learner`

The Learner is an estimator that takes a dictionary as input rather than just X and y.

learner = predictions.skb.make_learner(fitted=True)

Then, the learner can be pickled …

import pickle 

learner_bytes = pickle.dumps(learner)

import pickle

with open("learner.bin", "wb") as fp:
    pickle.dump(learner, fp)

loaded_learner = pickle.loads(learner_bytes)

… loaded and applied to new data:

with open("learner.bin", "rb") as fp:
    loaded_learner = pickle.load(fp)
data = skrub.datasets.fetch_credit_fraud(split="test")
new_baskets = data.baskets
new_products = data.products
loaded_learner.predict({"baskets": new_baskets, "products": new_products})

Tuning in `scikit-learn` can be complex

pipe = Pipeline([("dim_reduction", PCA()), ("regressor", Ridge())])
grid = [
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [RandomForestRegressor()],
        "regressor__n_estimators": loguniform(20, 200),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [RandomForestRegressor()],
        "regressor__n_estimators": loguniform(20, 200),
    },
]

Tuning with Data Ops is simple!

dim_reduction = X.skb.apply(
    skrub.choose_from(
        {
            "PCA": PCA(n_components=skrub.choose_int(10, 30)),
            "SelectKBest": SelectKBest(k=skrub.choose_int(10, 30))
        }, name="dim_reduction"
    )
)
regressor = dim_reduction.skb.apply(
    skrub.choose_from(
        {
            "Ridge": Ridge(alpha=skrub.choose_float(0.1, 10.0, log=True)),
            "RandomForest": RandomForestRegressor(
                n_estimators=skrub.choose_int(20, 200, log=True)
            )
        }, name="regressor"
    )
)

Run hyperparameter search

# fit the search 
search = regressor.skb.make_randomized_search(
    scoring="roc_auc", fitted=True, cv=5
)

# save the best learner
best_learner = search.best_learner_

With Optuna as backend:

# fit the search 
search = regressor.skb.make_randomized_search(
    scoring="roc_auc", fitted=True, cv=5, backend="optuna"
)

A parallel coordinate plot to explore hyperparameters

search = pred.skb.get_randomized_search(fitted=True)
search.plot_parallel_coord()

import os
from plotly.io import read_json
json_path = os.path.join("..", "data", "parallel_coordinates_hgbr.json")

fig = read_json(json_path)
fig.update_layout(margin=dict(l=200))

Hyperparameter tuning in a Data Ops plan

Skrub implements four choose_* functions:

choose_from: select from the given list of options
choose_int: select an integer within a range
choose_float: select a float within a range
choose_bool: select a bool
optional: chooses whether to execute the given operation

What we have seen in this chapter

DataOps can be used to build complex multi-table pipelines
They track all the operations that are executed and replay them with the same parameters
DataOps simplify building hyperparameter grids
Learners can be exported and used with new data

Building extensive pipelines with DataOps

Skrub DataOps

How do DataOps work, though?

Starting with the DataOps

Applying a transformer

Executing dataframe operations

Applying a ML model

Inspecting the Data Ops plan

Exporting the plan in a learner

Tuning in scikit-learn can be complex

Tuning with Data Ops is simple!

Run hyperparameter search

A parallel coordinate plot to explore hyperparameters

Hyperparameter tuning in a Data Ops plan

What we have seen in this chapter

Starting with the `DataOps`

Exporting the plan in a `learner`

Tuning in `scikit-learn` can be complex