SkrubPipeline#

class skrub.SkrubPipeline(expr)[source]#

Pipeline that evaluates a skrub expression.

This class is not meant to be instantiated manually, SkrubPipeline objects are created by calling Expr.skb.get_pipeline() on an expression.

Methods

`describe_params`()	Describe parameters for this pipeline.
`find_fitted_estimator`(name)	Find the scikit-learn estimator that has been fitted in a `.skb.apply()` step.
`get_params`([deep])	Get parameters for this estimator.
`report`(*, environment, mode, ...)	Call the method specified by mode and return the result and full report.
`set_params`(**params)	Set the parameters of this estimator.
`truncated_after`(name)	Extract the part of the pipeline that leads up to the given step.

fit
get_param_grid

describe_params()[source]#

Describe parameters for this pipeline.

Returns a human-readable description (in form of a dict) of the parameters (outcomes of choose_* objects contained in the expression).

find_fitted_estimator(name)[source]#

Find the scikit-learn estimator that has been fitted in a .skb.apply() step.

This can be useful for example to inspect the fitted attributes of the estimator. The apply step must have been given a name with .skb.set_name() (see examples below).

Parameters:

namestr: The name of the .skb.apply() step in which an estimator has been fitted.

Returns:

scikit-learn estimator: The fitted estimator. Depending on the nature of the estimator it may be wrapped in a skrub.ApplyToCols or skrub.ApplyToFrame, see examples below.

See also

skrub.Expr.skb.set_name: Give a name to this expression.
skrub.Expr.skb.apply: Apply a scikit-learn estimator to a dataframe or numpy array.

Examples

>>> from sklearn.decomposition import PCA
>>> from sklearn.dummy import DummyClassifier
>>> import skrub
>>> from skrub import selectors as s

>>> orders = skrub.toy_orders()
>>> X, y = skrub.X(), skrub.y()
>>> pred = (
...     X.skb.apply(skrub.StringEncoder(n_components=2), cols=["product"])
...     .skb.set_name("product_encoder")
...     .skb.apply(skrub.ToDatetime(), cols=["date"])
...     .skb.apply(skrub.DatetimeEncoder(add_total_seconds=False), cols=["date"])
...     .skb.apply(PCA(n_components=2), cols=s.glob("date_*"))
...     .skb.set_name("pca")
...     .skb.apply(DummyClassifier(), y=y)
...     .skb.set_name("classifier")
... )
>>> pipeline = pred.skb.get_pipeline()
>>> pipeline.fit({'X': orders.X, 'y': orders.y})
SkrubPipeline(expr=<classifier | Apply DummyClassifier>)

We can retrieve the fitted transformer for a given step with find_fitted_estimator:

>>> pipeline.find_fitted_estimator("classifier")
DummyClassifier()

Depending on the parameters passed to skb.apply(), the estimator we provide can be wrapped in a skrub transformer that applies it to several columns in the input, or to a subset of the columns in a dataframe. In other cases it may be applied without any wrapping. We provide examples for those 3 different cases below.

Case 1: the StringEncoder is a skrub single-column transformer: it transforms a single column. In the pipeline it gets wrapped in a skrub.ApplyToCols which independently fits a separate instance of the StringEncoder to each of the columns it transforms (in this case there is only one column, 'product'). The individual transformers can be found in the fitted attribute transformers_ which maps column names to the corresponding fitted transformer.

>>> encoder = pipeline.find_fitted_estimator('product_encoder')
>>> encoder.transformers_
{'product': StringEncoder(n_components=2)}
>>> encoder.transformers_['product'].vectorizer_.vocabulary_
{' pe': 2, 'pen': 12, 'en ': 8, ' pen': 3, 'pen ': 13, ' cu': 0, 'cup': 6, 'up ': 18, ' cup': 1, 'cup ': 7, ' sp': 4, 'spo': 16, 'poo': 14, 'oon': 10, 'on ': 9, ' spo': 5, 'spoo': 17, 'poon': 15, 'oon ': 11}

This case (wrapping in ApplyToCols) happens when the estimator is a skrub single-column transformer (it has a __single_column_transformer__ attribute), we pass .skb.apply(how='columnwise') or we pass .skb.apply(allow_reject=True).

Case 2: the PCA is a regular scikit-learn transformer. In the pipeline it gets wrapped in a skrub.ApplyToFrame which applies it to the subset of columns in the dataframe selected by the cols argument passed to .skb.apply(). The fitted PCA can be found in the fitted attribute transformer_.

>>> pca = pipeline.find_fitted_estimator('pca')
>>> pca
ApplyToFrame(cols=glob('date_*'), transformer=PCA(n_components=2))
>>> pca.transformer_
PCA(n_components=2)
>>> pca.transformer_.mean_
array([2020.,    4.,    4.], dtype=float32)

This case (wrapping in ApplyToFrame) happens when the estimator is a scikit-learn transformer but not a single-column transformer.

The DummyRegressor is a scikit-learn predictor. In the pipeline it gets applied directly to the input dataframe without any wrapping.

>>> classifier = pipeline.find_fitted_estimator('classifier')
>>> classifier
DummyClassifier()
>>> classifier.class_prior_
array([0.75, 0.25])

This case (no wrapping) happens when the estimator is a scikit-learn predictor (not a transformer), the input is not a dataframe (e.g. it is a numpy array), or we pass .skb.apply(how='full_frame').

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

report(*, environment, mode, **full_report_kwargs)[source]#

Call the method specified by mode and return the result and full report.

See Expr.skb.full_report() for more information.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

truncated_after(name)[source]#

Extract the part of the pipeline that leads up to the given step.

This is similar to slicing a scikit-learn pipeline. It can be useful for example to drive the hyperparameter selection with a supervised task but then extract only the part of the pipeline that performs feature extraction.

The target step must have been given a name with .skb.set_name().

Parameters:

namestr: The name of the intermediate step we want to extract.

Returns:

SkrubPipeline: A skrub pipeline that performs all the transformations leading up to (and including) the required step.

Examples

>>> from sklearn.dummy import DummyClassifier
>>> import skrub

>>> orders = skrub.toy_orders()
>>> X, y = skrub.X(), skrub.y()
>>> pred = (
...     X.skb.apply(
...         skrub.TableVectorizer(datetime=skrub.DatetimeEncoder(add_total_seconds=False))
...     )
...     .skb.set_name("vectorizer")
...     .skb.apply(DummyClassifier(), y=y)
... )
>>> pipeline = pred.skb.get_pipeline()
>>> pipeline.fit({"X": orders.X, "y": orders.y})
SkrubPipeline(expr=<Apply DummyClassifier>)
>>> pipeline.predict({"X": orders.X})
array([False, False, False, False])

Truncate the pipeline after vectorization:

>>> vectorizer = pipeline.truncated_after("vectorizer")
>>> vectorizer
SkrubPipeline(expr=<vectorizer | Apply TableVectorizer>)
>>> vectorizer.transform({"X": orders.X})
    ID  product_cup  product_pen  ...  date_year  date_month  date_day
0  1.0          0.0          1.0  ...     2020.0         4.0       3.0
1  2.0          1.0          0.0  ...     2020.0         4.0       4.0
2  3.0          1.0          0.0  ...     2020.0         4.0       4.0
3  4.0          0.0          0.0  ...     2020.0         4.0       5.0

Note this differs from find_fitted_estimator which extracts the inner scikit-learn estimator that has been fitted inside of a single step.

This contains the full transformation up to the given step:

>>> pipeline.truncated_after("vectorizer")
SkrubPipeline(expr=<vectorizer | Apply TableVectorizer>)

The result of find_fitted_estimator only contains the inner TableVectorizer that was fitted inside of the "vectorizer" step:

>>> pipeline.find_fitted_estimator("vectorizer")
ApplyToFrame(transformer=TableVectorizer(datetime=DatetimeEncoder(add_total_seconds=False)))

Gallery examples#

Building complex tabular pipelines

SkrubPipeline#

Gallery examples#

This Page