Assembling Skrub DataOps into complex machine learning pipelines#
Applying machine-learning estimators#
In addition to working directly with the API provided by the underlying data,
DataOps can also be used to apply machine-learning estimators from
scikit-learn or Skrub to the data. This is done through the
.skb.apply()
method:
>>> import pandas as pd
>>> import skrub
>>> orders_df = pd.DataFrame(
... {
... "item": ["pen", "cup", "pen", "fork"],
... "price": [1.5, None, 1.5, 2.2],
... "qty": [1, 1, 2, 4],
... }
... )
>>> orders = skrub.var("orders", orders_df)
>>> orders.skb.apply(skrub.TableVectorizer())
<Apply TableVectorizer>
Result:
―――――――
item_cup item_fork item_pen price qty
0 0.0 0.0 1.0 1.5 1.0
1 1.0 0.0 0.0 NaN 1.0
2 0.0 0.0 1.0 1.5 2.0
3 0.0 1.0 0.0 2.2 4.0
It is also possible to apply a transformer to a subset of the columns:
>>> vectorized_orders = orders.skb.apply(
... skrub.StringEncoder(n_components=3), cols="item"
... )
>>> vectorized_orders
<Apply StringEncoder>
Result:
―――――――
item_0 item_1 item_2 price qty
0 9.999999e-01 1.666000e-08 4.998001e-08 1.5 1
1 -1.332800e-07 -1.199520e-07 1.000000e+00 NaN 1
2 9.999999e-01 1.666000e-08 4.998001e-08 1.5 2
3 3.942477e-08 9.999999e-01 7.884953e-08 2.2 4
Then, we can export the transformation as a learner with
.skb.make_learner()
>>> pipeline = vectorized_orders.skb.make_learner(fitted=True)
>>> new_orders = pd.DataFrame({"item": ["fork"], "price": [2.2], "qty": [5]})
>>> pipeline.transform({"orders": new_orders})
item_0 item_1 item_2 price qty
0 5.984116e-09 1.0 -1.323546e-07 2.2 5
Note that here the learner is fitted on the preview data, but in general it can be exported without fitting, and then fitted on new data provided as an environment dictionary.
Applying different transformers using Skrub selectors and DataOps#
It is possible to use Skrub selectors to define which columns to apply transformers to, and then apply different transformers to different subsets of the data.
For example, this can be useful to apply TextEncoder
to columns
that contain free-flowing text, and StringEncoder
to other string
columns that contain categorical data such as country names.
In the example below, we apply a StringEncoder
to columns
with high cardinality, a mathematical operation to columns with nulls, and a
TableVectorizer
to all other columns.
>>> from skrub import selectors as s
>>> high_cardinality = s.string() - s.cardinality_below(2)
>>> has_nulls = s.has_nulls()
>>> leftover = s.all() - high_cardinality - has_nulls
>>> vectorizer = skrub.StringEncoder(n_components=2)
>>> vectorized_items = orders.skb.select(high_cardinality).skb.apply(vectorizer)
>>> vectorized_items
<Apply StringEncoder>
Result:
―――――――
item_0 item_1 price qty
0 1.511858e+00 9.380015e-08 1.5 1
1 -1.704687e-07 1.511858e+00 NaN 1
2 1.511858e+00 9.380015e-08 1.5 2
3 -5.458670e-09 -6.917769e-08 2.2 4
>>> vectorized_has_nulls = orders.skb.select(cols=has_nulls) * 11
>>> vectorized_has_nulls
<BinOp: mul>
Result:
―――――――
price
0 16.5
1 NaN
2 16.5
3 24.2
>>> everything_else = orders.skb.select(cols=leftover).skb.apply(skrub.TableVectorizer())
After encoding the columns, the resulting DataOps can be concatenated together to obtain the final result:
>>> encoded = (
... everything_else.skb.concat([vectorized_items, vectorized_has_nulls], axis=1)
... )
>>> encoded
qty item_0 item_1 price
0 1.0 1.594282e+00 -1.224524e-07 16.5
1 1.0 9.228692e-08 1.473794e+00 NaN
2 2.0 1.594282e+00 -1.224524e-07 16.5
3 4.0 7.643604e-09 6.080018e-01 24.2
More info on advanced column selection and manipulation be found in Skrub Selectors: helpers for selecting columns in a dataframe and example Hands-On with Column Selection and Transformers.
Documenting the DataOps plan with node names and descriptions#
We can improve the readability of the DataOps plan by giving names and descriptions
to the nodes in the plan. This is done with .skb.set_name()
and .skb.set_description()
.
>>> import skrub
>>> a = skrub.var('a', 1)
>>> b = skrub.var('b', 2)
>>> c = (a + b).skb.set_description('the addition of a and b')
>>> c.skb.description
'the addition of a and b'
>>> d = c.skb.set_name('d')
>>> d.skb.name
'd'
Both names and descriptions can be used to mark relevant parts of the learner, and they can be accessed from the computational graph and the plan report.
Additionally, names can be used to bypass the computation of a node and override its
result by passing it as a key in the environment
dictionary.
>>> e = d * 10
>>> e
<BinOp: mul>
Result:
―――――――
30
>>> e.skb.eval()
30
>>> e.skb.eval({'a': 10, 'b': 5})
150
>>> e.skb.eval({'d': -1}) # -1 * 10
-10
More info can be found in section Using only a part of a DataOps plan.
Evaluating and debugging the DataOps plan with .skb.full_report()
#
All operations on DataOps are recorded in a computational graph, which can be
inspected with .skb.full_report()
. This method
generates a html report that shows the full plan, including all nodes,
their names, descriptions, and the transformations applied to the data.
An example of the report can be found here.
For each node in the plan, the report shows:
The name and the description of the node, if present.
Predecessor and successor nodes in the computational graph.
Where the code relative to the node is defined.
The estimator fitted in the node along with its parameters (if applicable).
The preview of the data at that node.
Additionally, if computations fail in the plan, the report shows the offending node and the error message, which can help in debugging the plan.
By default, reports are saved in the skrub_data/execution_reports
directory, but
they can be saved to a different location with the ouptut_dir
parameter.
Note that the default path can be altered with the
SKRUB_DATA_DIR
environment variable. See Example datasets, utilities, and customization for more details.
Using only a part of a DataOps plan#
Besides documenting a DataOps plan, the .skb.set_name()
has additional functions. By setting a name, we can:
Bypass the computation of that node and override its result by passing it as a key in the
environment
argument.Truncate the pipeline after this node to obtain the intermediate result with
SkrubLearner.truncated_after()
.Retrieve that node and inspect the estimator that was fitted in it, if the node was created with
.skb.apply()
.
Here is a toy example with 3 steps:
>>> def load_data(url):
... print("load: ", url)
... return [1, 2, 3, 4]
>>> def transform(x):
... print("transform")
... return [item * 10 for item in x]
>>> def agg(x):
... print("agg")
... return max(x)
>>> url = skrub.var("url")
>>> output = (
... url.skb.apply_func(load_data)
... .skb.set_name("loaded")
... .skb.apply_func(transform)
... .skb.set_name("transformed")
... .skb.apply_func(agg)
... )
Above, we give a name to each intermediate result with .skb.set_name()
so
that we can later refer to it when manipulating a fitted pipeline.
>>> pipeline = output.skb.make_learner()
>>> pipeline.fit({"url": "file:///example.db"})
load: file:///example.db
transform
agg
SkrubLearner(data_op=<Call 'agg'>)
>>> pipeline.transform({"url": "file:///example.db"})
load: file:///example.db
transform
agg
40
Below, we bypass the data loading. Because we directly provide a value for the
intermediate result that we named "loaded"
, the corresponding computation is
skipped and the provided value is used instead. We can see that
"load: ..."
is not printed and that the rest of the computation proceeds
using [6, 5, 4]
(instead of [1, 2, 3, 4]
as before).
>>> pipeline.transform({"loaded": [6, 5, 4]})
transform
agg
60
Now we show how to stop at the result we named "transformed"
. With
truncated_after
, we obtain a pipeline that computes that intermediate result
and returns it instead of applying the last transformation; note that "agg"
is not printed and we get the output of transform()
, not of agg()
:
>>> truncated = pipeline.truncated_after("transformed")
>>> truncated.transform({"url": "file:///example.db"})
load: file:///example.db
transform
[10, 20, 30, 40]
Subsampling data for easier development and debugging#
If the data used for the preview is large, it can be useful to work on a
subsample of the data to speed up the development and debugging process.
This can be done by calling the .skb.subsample()
method
on a variable: this signals to Skrub that what is shown when printing DataOps, or
returned by .skb.preview()
is computed on a subsample
of the data.
Note that subsampling is “local”: if it is applied to a variable, it only
affects the variable itself. This may lead to unexpected results and errors
if, for example, X
is subsampled but y
is not.
Subsampling is turned off by default when we call other methods such as
.skb.eval()
,
.skb.cross_validate()
,
.skb.train_test_split
,
DataOp.skb.make_learner()
,
DataOp.skb.make_randomized_search()
, etc.
However, all of those methods have a keep_subsampling
parameter that we can
set to True
to force using the subsampling when we call them. Note that
even if we set keep_subsampling=True
, subsampling is not applied when using
predict
.
See more details in a full example.