SkrubPipeline#

class skrub.SkrubPipeline(expr)[source]#

Pipeline that evaluates a skrub expression.

This class is not meant to be instantiated manually, SkrubPipeline objects are created by calling Expr.skb.get_pipeline() on an expression.

Methods

find_fitted_estimator(name)

Find the scikit-learn estimator that has been fitted in a .skb.apply() step.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

truncated_after(name)

Extract the part of the pipeline that leads up to the given step.

fit

report

find_fitted_estimator(name)[source]#

Find the scikit-learn estimator that has been fitted in a .skb.apply() step.

This can be useful for example to inspect the fitted attributes of the estimator. The apply step must have been given a name with .skb.set_name() (see examples below).

Parameters:
namestr

The name of the .skb.apply() step in which an estimator has been fitted.

Returns:
scikit-learn estimator

The fitted estimator. Depending on the nature of the estimator it may be wrapped in a skrub.OnEachColumn or skrub.OnSubFrame, see examples below.

See also

skrub.Expr.skb.set_name

Give a name to this expression.

skrub.Expr.skb.apply

Apply a scikit-learn estimator to a dataframe or numpy array.

Examples

>>> from sklearn.decomposition import PCA
>>> from sklearn.dummy import DummyClassifier
>>> import skrub
>>> from skrub import selectors as s
>>> orders = skrub.toy_orders()
>>> X, y = skrub.X(), skrub.y()
>>> pred = (
...     X.skb.apply(skrub.StringEncoder(n_components=2), cols=["product"])
...     .skb.set_name("product_encoder")
...     .skb.apply(skrub.ToDatetime(), cols=["date"])
...     .skb.apply(skrub.DatetimeEncoder(add_total_seconds=False), cols=["date"])
...     .skb.apply(PCA(n_components=2), cols=s.glob("date_*"))
...     .skb.set_name("pca")
...     .skb.apply(DummyClassifier(), y=y)
...     .skb.set_name("classifier")
... )
>>> pipeline = pred.skb.get_pipeline()
>>> pipeline.fit({'X': orders.X, 'y': orders.y})
SkrubPipeline(expr=<classifier | Apply DummyClassifier>)

We can retrieve the fitted transformer for a given step with find_fitted_estimator:

>>> pipeline.find_fitted_estimator("classifier")
DummyClassifier()

Depending on the parameters passed to skb.apply(), the estimator we provide can be wrapped in a skrub transformer that applies it to several columns in the input, or to a subset of the columns in a dataframe. In other cases it may be applied without any wrapping. We provide examples for those 3 different cases below.

Case 1: the StringEncoder is a skrub single-column transformer: it transforms a single column. In the pipeline it gets wrapped in a skrub.OnEachColumn which independently fits a separate instance of the StringEncoder to each of the columns it transforms (in this case there is only one column, 'product'). The individual transformers can be found in the fitted attribute transformers_ which maps column names to the corresponding fitted transformer.

>>> encoder = pipeline.find_fitted_estimator('product_encoder')
>>> encoder.transformers_
{'product': StringEncoder(n_components=2)}
>>> encoder.transformers_['product'].vectorizer_.vocabulary_
{' pe': 2, 'pen': 12, 'en ': 8, ' pen': 3, 'pen ': 13, ' cu': 0, 'cup': 6, 'up ': 18, ' cup': 1, 'cup ': 7, ' sp': 4, 'spo': 16, 'poo': 14, 'oon': 10, 'on ': 9, ' spo': 5, 'spoo': 17, 'poon': 15, 'oon ': 11}

This case (wrapping in OnEachColumn) happens when the estimator is a skrub single-column transformer (it has a __single_column_transformer__ attribute), we pass .skb.apply(how='columnwise') or we pass .skb.apply(allow_reject=True).

Case 2: the PCA is a regular scikit-learn transformer. In the pipeline it gets wrapped in a skrub.OnSubFrame which applies it to the subset of columns in the dataframe selected by the cols argument passed to .skb.apply(). The fitted PCA can be found in the fitted attribute transformer_.

>>> pca = pipeline.find_fitted_estimator('pca')
>>> pca
OnSubFrame(cols=glob('date_*'), transformer=PCA(n_components=2))
>>> pca.transformer_
PCA(n_components=2)
>>> pca.transformer_.mean_
array([2020.,    4.,    4.], dtype=float32)

This case (wrapping in OnSubFrame) happens when the estimator is a scikit-learn transformer but not a single-column transformer.

The DummyRegressor is a scikit-learn predictor. In the pipeline it gets applied directly to the input dataframe without any wrapping.

>>> classifier = pipeline.find_fitted_estimator('classifier')
>>> classifier
DummyClassifier()
>>> classifier.class_prior_
array([0.75, 0.25])

This case (no wrapping) happens when the estimator is a scikit-learn predictor (not a transformer), the input is not a dataframe (e.g. it is a numpy array), or we pass .skb.apply(how='full_frame').

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

truncated_after(name)[source]#

Extract the part of the pipeline that leads up to the given step.

This is similar to slicing a scikit-learn pipeline. It can be useful for example to drive the hyperparameter selection with a supervised task but then extract only the part of the pipeline that performs feature extraction.

The target step must have been given a name with .skb.set_name().

Parameters:
namestr

The name of the intermediate step we want to extract.

Returns:
SkrubPipeline

A skrub pipeline that performs all the transformations leading up to (and including) the required step.

Examples

>>> from sklearn.dummy import DummyClassifier
>>> import skrub
>>> orders = skrub.toy_orders()
>>> X, y = skrub.X(), skrub.y()
>>> pred = (
...     X.skb.apply(
...         skrub.TableVectorizer(datetime=skrub.DatetimeEncoder(add_total_seconds=False))
...     )
...     .skb.set_name("vectorizer")
...     .skb.apply(DummyClassifier(), y=y)
... )
>>> pipeline = pred.skb.get_pipeline()
>>> pipeline.fit({"X": orders.X, "y": orders.y})
SkrubPipeline(expr=<Apply DummyClassifier>)
>>> pipeline.predict({"X": orders.X})
array([False, False, False, False])

Truncate the pipeline after vectorization:

>>> vectorizer = pipeline.truncated_after("vectorizer")
>>> vectorizer
SkrubPipeline(expr=<vectorizer | Apply TableVectorizer>)
>>> vectorizer.transform({"X": orders.X})
    ID  product_cup  product_pen  ...  date_year  date_month  date_day
0  1.0          0.0          1.0  ...     2020.0         4.0       3.0
1  2.0          1.0          0.0  ...     2020.0         4.0       4.0
2  3.0          1.0          0.0  ...     2020.0         4.0       4.0
3  4.0          0.0          0.0  ...     2020.0         4.0       5.0

Note this differs from find_fitted_estimator which extracts the inner scikit-learn estimator that has been fitted inside of a single step.

This contains the full transformation up to the given step:

>>> pipeline.truncated_after("vectorizer")
SkrubPipeline(expr=<vectorizer | Apply TableVectorizer>)

The result of find_fitted_estimator only contains the inner TableVectorizer that was fitted inside of the "vectorizer" step:

>>> pipeline.find_fitted_estimator("vectorizer")
OnSubFrame(transformer=TableVectorizer(datetime=DatetimeEncoder(add_total_seconds=False)))