Using only a part of a DataOps plan#

Besides documenting a DataOps plan, the .skb.set_name() has additional functions. By setting a name, we can:

  • Bypass the computation of that node and override its result by passing it as a key in the environment argument.

  • Truncate the computational graph after this node to obtain the intermediate result with SkrubLearner.truncated_after().

  • Retrieve that node and inspect the estimator that was fitted in it, if the node was created with .skb.apply().

Here is a toy example with 4 steps:

>>> def load_data(url):
...     print("load: ", url)
...     return [1, 2, 3, 4]
>>> def transform(x):
...     print("transform")
...     return [item * 10 for item in x]
>>> def agg(x):
...     print("agg")
...     return max(x)
>>> import skrub
>>> url = skrub.var("url")
>>> output = (
...     url.skb.apply_func(load_data)
...     .skb.set_name("loaded")
...     .skb.apply_func(transform)
...     .skb.set_name("transformed")
...     .skb.apply_func(agg)
... )

Above, we give a name to each intermediate result with .skb.set_name() so that we can later refer to it when manipulating a fitted learner.

>>> learner = output.skb.make_learner()
>>> learner.fit({"url": "file:///example.db"})
load:  file:///example.db
transform
agg
SkrubLearner(data_op=<Call 'agg'>)
>>> learner.transform({"url": "file:///example.db"})
load:  file:///example.db
transform
agg
40

Below, we bypass the data loading. Because we directly provide a value for the intermediate result that we named "loaded", the corresponding computation is skipped and the provided value is used instead. We can see that "load: ..." is not printed and that the rest of the computation proceeds using [6, 5, 4] (instead of [1, 2, 3, 4] as before).

>>> learner.transform({"loaded": [6, 5, 4]})
transform
agg
60

Now we show how to stop at the result we named "transformed". With truncated_after, we obtain a learner that computes that intermediate result and returns it instead of applying the last transformation; note that "agg" is not printed and we get the output of transform(), not of agg():

>>> truncated = learner.truncated_after("transformed")
>>> truncated.transform({"url": "file:///example.db"})
load:  file:///example.db
transform
[10, 20, 30, 40]