SkrubPipeline#
- class skrub.SkrubPipeline(expr)[source]#
Pipeline that evaluates a skrub expression.
This class is not meant to be instantiated manually,
SkrubPipeline
objects are created by callingExpr.skb.get_pipeline()
on an expression.Methods
find_fitted_estimator
(name)Find the scikit-learn estimator that has been fitted in a
.skb.apply()
step.get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
truncated_after
(name)Extract the part of the pipeline that leads up to the given step.
fit
report
- find_fitted_estimator(name)[source]#
Find the scikit-learn estimator that has been fitted in a
.skb.apply()
step.This can be useful for example to inspect the fitted attributes of the estimator. The
apply
step must have been given a name with.skb.set_name()
(see examples below).- Parameters:
- name
str
The name of the
.skb.apply()
step in which an estimator has been fitted.
- name
- Returns:
- scikit-learn estimator
The fitted estimator. Depending on the nature of the estimator it may be wrapped in a
skrub.OnEachColumn
orskrub.OnSubFrame
, see examples below.
See also
skrub.Expr.skb.set_name
Give a name to this expression.
skrub.Expr.skb.apply
Apply a scikit-learn estimator to a dataframe or numpy array.
Examples
>>> from sklearn.decomposition import PCA >>> from sklearn.dummy import DummyClassifier >>> import skrub >>> from skrub import selectors as s
>>> orders = skrub.toy_orders() >>> X, y = skrub.X(), skrub.y() >>> pred = ( ... X.skb.apply(skrub.StringEncoder(n_components=2), cols=["product"]) ... .skb.set_name("product_encoder") ... .skb.apply(skrub.ToDatetime(), cols=["date"]) ... .skb.apply(skrub.DatetimeEncoder(add_total_seconds=False), cols=["date"]) ... .skb.apply(PCA(n_components=2), cols=s.glob("date_*")) ... .skb.set_name("pca") ... .skb.apply(DummyClassifier(), y=y) ... .skb.set_name("classifier") ... ) >>> pipeline = pred.skb.get_pipeline() >>> pipeline.fit({'X': orders.X, 'y': orders.y}) SkrubPipeline(expr=<classifier | Apply DummyClassifier>)
We can retrieve the fitted transformer for a given step with
find_fitted_estimator
:>>> pipeline.find_fitted_estimator("classifier") DummyClassifier()
Depending on the parameters passed to
skb.apply()
, the estimator we provide can be wrapped in a skrub transformer that applies it to several columns in the input, or to a subset of the columns in a dataframe. In other cases it may be applied without any wrapping. We provide examples for those 3 different cases below.Case 1: the
StringEncoder
is a skrub single-column transformer: it transforms a single column. In the pipeline it gets wrapped in askrub.OnEachColumn
which independently fits a separate instance of theStringEncoder
to each of the columns it transforms (in this case there is only one column,'product'
). The individual transformers can be found in the fitted attributetransformers_
which maps column names to the corresponding fitted transformer.>>> encoder = pipeline.find_fitted_estimator('product_encoder') >>> encoder.transformers_ {'product': StringEncoder(n_components=2)} >>> encoder.transformers_['product'].vectorizer_.vocabulary_ {' pe': 2, 'pen': 12, 'en ': 8, ' pen': 3, 'pen ': 13, ' cu': 0, 'cup': 6, 'up ': 18, ' cup': 1, 'cup ': 7, ' sp': 4, 'spo': 16, 'poo': 14, 'oon': 10, 'on ': 9, ' spo': 5, 'spoo': 17, 'poon': 15, 'oon ': 11}
This case (wrapping in
OnEachColumn
) happens when the estimator is a skrub single-column transformer (it has a__single_column_transformer__
attribute), we pass.skb.apply(how='columnwise')
or we pass.skb.apply(allow_reject=True)
.Case 2: the
PCA
is a regular scikit-learn transformer. In the pipeline it gets wrapped in askrub.OnSubFrame
which applies it to the subset of columns in the dataframe selected by thecols
argument passed to.skb.apply()
. The fittedPCA
can be found in the fitted attributetransformer_
.>>> pca = pipeline.find_fitted_estimator('pca') >>> pca OnSubFrame(cols=glob('date_*'), transformer=PCA(n_components=2)) >>> pca.transformer_ PCA(n_components=2) >>> pca.transformer_.mean_ array([2020., 4., 4.], dtype=float32)
This case (wrapping in
OnSubFrame
) happens when the estimator is a scikit-learn transformer but not a single-column transformer.The
DummyRegressor
is a scikit-learn predictor. In the pipeline it gets applied directly to the input dataframe without any wrapping.>>> classifier = pipeline.find_fitted_estimator('classifier') >>> classifier DummyClassifier() >>> classifier.class_prior_ array([0.75, 0.25])
This case (no wrapping) happens when the estimator is a scikit-learn predictor (not a transformer), the input is not a dataframe (e.g. it is a numpy array), or we pass
.skb.apply(how='full_frame')
.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- truncated_after(name)[source]#
Extract the part of the pipeline that leads up to the given step.
This is similar to slicing a scikit-learn pipeline. It can be useful for example to drive the hyperparameter selection with a supervised task but then extract only the part of the pipeline that performs feature extraction.
The target step must have been given a name with
.skb.set_name()
.- Parameters:
- name
str
The name of the intermediate step we want to extract.
- name
- Returns:
- SkrubPipeline
A skrub pipeline that performs all the transformations leading up to (and including) the required step.
Examples
>>> from sklearn.dummy import DummyClassifier >>> import skrub
>>> orders = skrub.toy_orders() >>> X, y = skrub.X(), skrub.y() >>> pred = ( ... X.skb.apply( ... skrub.TableVectorizer(datetime=skrub.DatetimeEncoder(add_total_seconds=False)) ... ) ... .skb.set_name("vectorizer") ... .skb.apply(DummyClassifier(), y=y) ... ) >>> pipeline = pred.skb.get_pipeline() >>> pipeline.fit({"X": orders.X, "y": orders.y}) SkrubPipeline(expr=<Apply DummyClassifier>) >>> pipeline.predict({"X": orders.X}) array([False, False, False, False])
Truncate the pipeline after vectorization:
>>> vectorizer = pipeline.truncated_after("vectorizer") >>> vectorizer SkrubPipeline(expr=<vectorizer | Apply TableVectorizer>) >>> vectorizer.transform({"X": orders.X}) ID product_cup product_pen ... date_year date_month date_day 0 1.0 0.0 1.0 ... 2020.0 4.0 3.0 1 2.0 1.0 0.0 ... 2020.0 4.0 4.0 2 3.0 1.0 0.0 ... 2020.0 4.0 4.0 3 4.0 0.0 0.0 ... 2020.0 4.0 5.0
Note this differs from
find_fitted_estimator
which extracts the inner scikit-learn estimator that has been fitted inside of a single step.This contains the full transformation up to the given step:
>>> pipeline.truncated_after("vectorizer") SkrubPipeline(expr=<vectorizer | Apply TableVectorizer>)
The result of
find_fitted_estimator
only contains the innerTableVectorizer
that was fitted inside of the"vectorizer"
step:>>> pipeline.find_fitted_estimator("vectorizer") OnSubFrame(transformer=TableVectorizer(datetime=DatetimeEncoder(add_total_seconds=False)))