Building complex pipelines

  • Our learner contains several data-processing steps
    • joining tables
    • selecting columns
    • applying machine-learning estimators
  • Some steps have state that needs to be fitted
  • Often several tables and aggregations are involved

Example

  • We have e-commerce check-out baskets
  • Each containing one or more products
  • Predict if the transaction is fraudulent


Dataset

A first attempt …

  • Scikit-learn assumes a single table X of the right shape

Loading data

data = skrub.datasets.fetch_credit_fraud()

X = data.baskets[["ID"]]
y = data.baskets["fraud_flag"]
products = data.products

Encoding the products

product_vectorizer = skrub.TableVectorizer(
    high_cardinality=skrub.StringEncoder(n_components=5)
)

vectorized_products = product_vectorizer.fit_transform(data.products)

Encoding the products

product_vectorizer = skrub.TableVectorizer(
    high_cardinality=skrub.StringEncoder(n_components=5)
)

vectorized_products = product_vectorizer.fit_transform(data.products)

🤔

  • How to store product_vectorizer?
  • Fitted on whole products table: data leakage
  • Cannot tune hyper-parameters
  • Transforming only some columns is hard
    • ColumnTransformer 😟😰

Joining the product features

aggregated_products = (
    vectorized_products.groupby("basket_ID").agg("mean").reset_index()
)
X = X.merge(aggregated_products, left_on="ID", right_on="basket_ID").drop(
    columns=["ID", "basket_ID"]
)

🤔

  • How to keep track of these transformations?
  • Cannot tune choices

Adding the supervised estimator

classifier = HistGradientBoostingClassifier()

cross_val_score(classifier, X, y, scoring="roc_auc", n_jobs=5)

Skrub to the rescue

  • Build complex pipelines involving multiple tables

Loading data

data = skrub.datasets.fetch_credit_fraud()

X = skrub.X(data.baskets[["ID"]])
y = skrub.y(data.baskets["fraud_flag"])
products = skrub.var("products", data.products)
  • X, y, products represent inputs to the model
  • Operations on those objects are evaluated lazily
    • Recorded rather than evaluated immediately
    • But a preview is computed for interactive development
  • They forward all operations to the result of their evaluation
    • Full API of the underlying object is available

Encoding the products

from skrub import selectors as s

products = products[products["basket_ID"].isin(X["ID"])]

product_vectorizer = skrub.TableVectorizer(
    high_cardinality=skrub.StringEncoder(n_components=5)
)
vectorized_products = products.skb.apply(
    product_vectorizer, cols=s.all() - "basket_ID"
)
  • We can filter products based on X

Encoding the products

from skrub import selectors as s

products = products[products["basket_ID"].isin(X["ID"])]

product_vectorizer = skrub.TableVectorizer(
    high_cardinality=skrub.StringEncoder(n_components=5)
)
vectorized_products = products.skb.apply(
    product_vectorizer, cols=s.all() - "basket_ID"
)
  • We can filter products based on X
  • product_vectorizer is added to the model

Encoding the products

from skrub import selectors as s

products = products[products["basket_ID"].isin(X["ID"])]

product_vectorizer = skrub.TableVectorizer(
    high_cardinality=skrub.StringEncoder(n_components=5)
)
vectorized_products = products.skb.apply(
    product_vectorizer, cols=s.all() - "basket_ID"
)
  • We can filter products based on X
  • product_vectorizer is added to the model
  • We can select columns to transform

Encoding the products

from skrub import selectors as s

products = products[products["basket_ID"].isin(X["ID"])]

product_vectorizer = skrub.TableVectorizer(
    high_cardinality=skrub.StringEncoder(
        n_components=skrub.choose_int(2, 20)
    )
)
vectorized_products = products.skb.apply(
    product_vectorizer, cols=s.all() - "basket_ID"
)
  • We can tune hyperparameters (more later)

Joining the product features

aggregated_products = (
    vectorized_products.groupby("basket_ID").agg("mean").reset_index()
)
X = X.merge(aggregated_products, left_on="ID", right_on="basket_ID").drop(
    columns=["ID", "basket_ID"]
)
  • Transformations added to the model
  • Can tune choices
  • While having access to all the dataframe’s functionality

Adding the supervised estimator

classifier = HistGradientBoostingClassifier()
pred = X.skb.apply(classifier, y=y)

Evaluation

pred.skb.cross_validate(scoring="roc_auc", n_jobs=5)

Training & using a model

train.py
estimator = pred.skb.get_estimator(fitted=True)
with open("estimator.pickle", "wb") as ostream:
    pickle.dump(estimator, ostream)
predict.py
with open("estimator.pickle", "rb") as istream:
    estimator = pickle.load(istream)

estimator.predict({'X': unseen_baskets, 'products': unseen_products})

Easy inspection

pred.skb.full_report()


report

Hyperparameter tuning

  • Any choice in the pipeline can be tuned
  • Options are specified inline
  • Inspecting results is easy

Hyperparameter tuning

Without skrub: 😭😭😭

pipe = Pipeline([("dim_reduction", PCA()), ("regressor", Ridge())])
grid = [
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [RandomForestClassifier()],
        "regressor__n_estimators": loguniform(20, 200),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [RandomForestClassifier()],
        "regressor__n_estimators": loguniform(20, 200),
    },
]
model = RandomizedSearchCV(pipe, grid)

NO!

Hyperparameter tuning

With skrub: replace any value with a range

product_vectorizer = skrub.TableVectorizer(
    high_cardinality=skrub.StringEncoder(
        n_components=skrub.choose_int(2, 20)
    )
)

# ...

search = pred.skb.get_randomized_search(scoring="roc_auc", fitted=True)

search.plot_parallel_coord()