Using previews for easier development and debugging#
To make interactive development easier without having to call eval() after
each step, it is possible to preview the result of a DataOp by passing a value
along with its name when creating a variable.
>>> import skrub
>>> a = skrub.var("a", 10) # we pass the value 10 in addition to the name
>>> b = skrub.var("b", 6)
>>> c = a + b
>>> c # now the display of c includes a preview of the result
<BinOp: add>
Result:
―――――――
16
Previews are eager computations on the current data, and since they are computed immediately they can spot errors early on:
>>> import pandas as pd
>>> df = pd.DataFrame({"col": [1, 2, 3]})
>>> a = skrub.var("a", df) # we pass the DataFrame as a value
Next, we use the pandas drop column and try to drop a column without
specifying the axis:
>>> a.drop("col")
Traceback (most recent call last):
...
RuntimeError: Evaluation of '.drop()' failed.
You can see the full traceback above. The error message was:
KeyError: "['col'] not found in axis"
Note that seeing results for the values we provided does not change the fact that we are building a pipeline that we want to reuse, not just computing the result for a fixed input. The displayed result is only preview of the output on one example dataset.
>>> c.skb.eval({"a": 3, "b": 2})
5
It is not necessary to provide a value for every variable: it is however advisable to do so when possible, as it allows to catch errors early on.
Disabling previews and eager checks#
By default, as soon as a DataOp is defined, some validity checks are performed
and the preview results are computed eagerly. In very complex DataOps plans
(100+ nodes), running checks after adding each node can cause a noticeable overhead.
To avoid this, it is possible to disable eager checks with the "eager_data_ops"
is easily achieved with the "eager_data_ops" configuration option.
>>> with skrub.config_context(eager_data_ops=False):
... # no checks are performed when b is defined so no error in the line below:
... b = skrub.var('a', 1) + skrub.var('a', 2)
... # checks are still performed (once) before the DataOp is actually used so
... # evaluating the DataOp, using .skb.make_learner() etc _would_ still raise:
... # b.skb.eval() ## raises ValueError: Choice and node names must be unique.
>>> b # Note there is no preview, even though we provided values for the variables
<BinOp: add>