Building extensive pipelines with DataOps

Skrub DataOps

  • Extend the scikit-learn machinery to multi-table operations
  • Take care of data leakage
  • Track all operations with a computational graph (a Data Ops plan)
  • Allow tuning any operation in the plan
  • Can be persisted and shared easily

How do DataOps work, though?

DataOps wrap around user operations, where user operations are:

  • any dataframe operation (e.g., merge, group by, aggregate etc.)
  • scikit-learn estimators (a Random Forest, RidgeCV etc.)
  • custom user code (load data from a path, fetch from an URL etc.)

Important

DataOps record user operations, so that they can later be replayed in the same order and with the same arguments on unseen data.

Starting with the DataOps

import skrub
data = skrub.datasets.fetch_credit_fraud()

baskets = skrub.var("baskets", data.baskets)
products = skrub.var("products", data.products) # add a new variable

X = baskets[["ID"]].skb.mark_as_X()
y = baskets["fraud_flag"].skb.mark_as_y()
  • baskets and products represent inputs to the pipeline.
  • Skrub tracks X and y so that training and test splits are never mixed.

Applying a transformer

from skrub import selectors as s

vectorizer = skrub.TableVectorizer(
    high_cardinality=skrub.StringEncoder()
)
vectorized_products = products.skb.apply(
    vectorizer, cols=s.all() - "basket_ID"
)

Executing dataframe operations

aggregated_products = vectorized_products.groupby(
    "basket_ID"
).agg("mean").reset_index()

features = X.merge(
    aggregated_products, left_on="ID", right_on="basket_ID"
)
features = features.drop(columns=["ID", "basket_ID"])

Applying a ML model

from sklearn.ensemble import ExtraTreesClassifier  
predictions = features.skb.apply(
    ExtraTreesClassifier(n_jobs=-1), y=y
)

Inspecting the Data Ops plan

predictions.skb.full_report()


Execution report

Each node:

  • Shows a preview of the data resulting from the operation
  • Reports the location in the code where the code is defined
  • Can include node names and descriptions
  • Shows the run time of each node

Exporting the plan in a learner

The Learner is an estimator that takes a dictionary as input rather than just X and y.

learner = predictions.skb.make_learner(fitted=True)

Then, the learner can be pickled …

import pickle 

learner_bytes = pickle.dumps(learner)
import pickle

with open("learner.bin", "wb") as fp:
    pickle.dump(learner, fp)
loaded_learner = pickle.loads(learner_bytes)

… loaded and applied to new data:

with open("learner.bin", "rb") as fp:
    loaded_learner = pickle.load(fp)
data = skrub.datasets.fetch_credit_fraud(split="test")
new_baskets = data.baskets
new_products = data.products
loaded_learner.predict({"baskets": new_baskets, "products": new_products})

Tuning in scikit-learn can be complex

pipe = Pipeline([("dim_reduction", PCA()), ("regressor", Ridge())])
grid = [
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [RandomForestRegressor()],
        "regressor__n_estimators": loguniform(20, 200),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [RandomForestRegressor()],
        "regressor__n_estimators": loguniform(20, 200),
    },
]

Tuning with Data Ops is simple!

dim_reduction = X.skb.apply(
    skrub.choose_from(
        {
            "PCA": PCA(n_components=skrub.choose_int(10, 30)),
            "SelectKBest": SelectKBest(k=skrub.choose_int(10, 30))
        }, name="dim_reduction"
    )
)
regressor = dim_reduction.skb.apply(
    skrub.choose_from(
        {
            "Ridge": Ridge(alpha=skrub.choose_float(0.1, 10.0, log=True)),
            "RandomForest": RandomForestRegressor(
                n_estimators=skrub.choose_int(20, 200, log=True)
            )
        }, name="regressor"
    )
)

A parallel coordinate plot to explore hyperparameters

search = pred.skb.get_randomized_search(fitted=True)
search.plot_parallel_coord()
import os
from plotly.io import read_json
json_path = os.path.join("..", "data", "parallel_coordinates_hgbr.json")

fig = read_json(json_path)
fig.update_layout(margin=dict(l=200))

Hyperparameter tuning in a Data Ops plan

Skrub implements four choose_* functions:

  • choose_from: select from the given list of options
  • choose_int: select an integer within a range
  • choose_float: select a float within a range
  • choose_bool: select a bool
  • optional: chooses whether to execute the given operation

What we have seen in this chapter

  • DataOps can be used to build complex multi-table pipelines
  • They track all the operations that are executed and replay them with the same parameters
  • DataOps simplify building hyperparameter grids
  • Learners can be exported and used with new data