Tuning and validating skrub DataOps plans#
To evaluate the prediction performance of our plan, we can fit it on a training dataset, then obtaining prediction on an unseen, test dataset.
In scikit-learn, we pass to estimators and pipelines an X and y matrix
with one row per observation from the start. Therefore, we can split the
data into a training and test set independently from the pipeline.
However, in many real-world scenarios, our data sources are not already
organized into X and y matrices. Some transformations may be necessary to
build them, and we want to keep those transformations inside the pipeline so
that they can be reliably re-applied to new data.
Therefore, we must start our pipeline by creating the design matrix and targets,
then tell skrub which intermediate results in the pipeline constitute X and
y respectively.
Let us consider a toy example where we simply obtain X and
y from a single table. More complex transformations would be handled in
the same way.
>>> from sklearn.datasets import load_diabetes
>>> from sklearn.linear_model import Ridge
>>> import skrub
>>> diabetes_df = load_diabetes(as_frame=True)["frame"]
In the original data, all features and the target are in the same dataframe.
>>> data = skrub.var("data", diabetes_df)
We build our design matrix by dropping the target. Note we use
errors="ignore" so that pandas does not raise an error if the column we want
to drop is already missing. Indeed, when we will need to make actual useful
predictions on unlabelled data, the “target” column will not be available.
>>> X = data.drop(columns="target", errors="ignore").skb.mark_as_X()
We use .skb.mark_as_X() to indicate that this
intermediate result (the dataframe obtained after dropping “target”) is the
X design matrix. This is the dataframe that will be split into a training
and a testing part when we split our dataset or perform cross-validation.
Similarly for y, we use .skb.mark_as_y():
>>> y = data["target"].skb.mark_as_y()
Now we can add our supervised estimator:
>>> pred = X.skb.apply(Ridge(), y=y)
>>> pred
<Apply Ridge>
Result:
―――――――
target
0 182.673354
1 90.998607
2 166.113476
3 156.034880
4 133.659575
.. ...
437 180.323365
438 135.798908
439 139.855630
440 182.645829
441 83.564413
[442 rows x 1 columns]
Once a pipeline is defined and the X and y nodes are identified, skrub
is able to split the dataset and perform cross-validation.
Improving the confidence in our score through cross-validation#
We can increase our confidence in our score by using cross-validation instead of
a single split. The same mechanism is used but we now fit and evaluate the model
on several splits. This is done with .skb.cross_validate().
>>> pred.skb.cross_validate()
fit_time score_time test_score
0 0.002816 0.001344 0.321665
1 0.002685 0.001323 0.440485
2 0.002468 0.001308 0.422104
3 0.002748 0.001321 0.424661
4 0.002649 0.001309 0.441961
Splitting the data in train and test sets#
We can use .skb.train_test_split() to
perform a single train-test split. skrub first evaluates the DataOps on
which we used .skb.mark_as_X() and
.skb.mark_as_y(): the first few steps of the
pipeline are executed until we have a value for X and for y.
Then, those
dataframes are split using the provided split function (by default
scikit-learn’s sklearn.model_selection.train_test_split()).
>>> split = pred.skb.train_test_split(shuffle=False)
>>> split.keys()
dict_keys(['train', 'test', 'X_train', 'X_test', 'y_train', 'y_test'])
train and test are the full dictionaries corresponding to the training
and testing data. The corresponding X and y are the values, in those
dictionaries, for the nodes marked with
.skb.mark_as_X()
and .skb.mark_as_y().
We can now fit our pipeline on the training data:
>>> learner = pred.skb.make_learner()
>>> learner.fit(split["train"])
SkrubLearner(data_op=<Apply Ridge>)
Only the training part of X and y are used. The subsequent steps are
evaluated, using this data, to fit the rest of the pipeline.
And we can obtain predictions on the test part:
>>> test_pred = learner.predict(split["test"])
>>> test_y_true = split["y_test"]
>>> from sklearn.metrics import r2_score
>>> r2_score(test_y_true, test_pred)
0.440999149220359
It is possible to define a custom split function to use instead of
sklearn.model_selection.train_test_split().
Passing additional arguments to the splitter#
Sometimes we want to pass additional data to the cross-validation splitter.
For example, if there is a group structure in our data (such as sites,
hospitals, etc.) and we want the model to generalize to unseen groups, we must
ensure while evaluating it that each group goes entirely in the train set or the
test set, but is not divided among the 2. This can be done with
sklearn.model_selection.GroupKFold,
sklearn.model_selection.LeavePGroupsOut, etc. . The split function
of those objects accepts a groups parameter. We can compute the groups
inside of the DataOp and pass them to DataOp.skb.mark_as_X() and they will
be passed to the splitter.
>>> df = skrub.datasets.toy_products()
>>> df
description price seller category
0 screen 100 supermarket.com electronics
1 hammer 15 bestproducts.com tools
2 keyboard 20 supermarket.com electronics
3 usb key 9 bestproducts.com electronics
4 charger 13 bestproducts.com electronics
5 screwdriver 12 supermarket.com tools
Suppose we want to assess generalization to new sellers. While splitting for
cross-validation we must group products by seller. We do it with
sklearn.model_selection.LeaveOneGroupOut.
>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.model_selection import LeaveOneGroupOut
>>> data = skrub.var("df", df)
>>> groups = data["seller"]
>>> X = data[["description", "price"]].skb.mark_as_X(
... cv=LeaveOneGroupOut(), split_kwargs={"groups": groups}
... )
>>> y = data["category"].skb.mark_as_y()
>>> pred = X.skb.apply(DummyClassifier(), y=y)
>>> split = pred.skb.train_test_split()
The train set only contains data from the “supermarket.com” seller.
>>> split["X_train"]
description price
0 screen 100
2 keyboard 20
5 screwdriver 12
The test set only contains data from the “bestproducts.com” seller.
>>> split["X_test"]
description price
1 hammer 15
3 usb key 9
4 charger 13
Passing additional arguments to the scorer#
Sometimes we have additional information to pass to the scorer such as sample weights, group information etc.
We can control how scoring is performed by using
DataOp.skb.with_scoring(). It has a scoring parameter, which can be
anything scikit-learn’s cross_validate() accepts
for scoring such as a metric name, callable scorer, or dict mapping metric
names to scorers (see the reference documentation of
DataOp.skb.with_scoring() for details).
It also accepts a kwargs argument, which are passed to the scorer when
evaluating the learner.
Importantly, the scoring and kwargs can be DataOps, which will be
computed when scoring the learner – so for example, sample weights can be
computed dynamically.
Using the same toy dataset as above, suppose we want to give more weight to more expensive products:
>>> X = data[["description", "price"]].skb.mark_as_X(cv=2)
>>> y = data["category"].skb.mark_as_y()
>>> pred = X.skb.apply(DummyClassifier(), y=y)
The default score is the (unweighted) accuracy:
>>> pred.skb.cross_validate()
fit_time score_time test_score
0 0.003982 0.002405 0.666667
1 0.002582 0.002169 0.666667
We set the scoring to provide the sample weights:
>>> sample_weight = X["price"]
>>> pred.skb.with_scoring(
... "accuracy", kwargs={"sample_weight": sample_weight}
... ).skb.cross_validate()
fit_time score_time test_accuracy
0 0.003045 0.003275 0.888889
1 0.002659 0.003026 0.647059
Besides passing extra arguments, DataOp.skb.with_scoring() can also be
useful to control what should be used as the default scoring metric for our
learner, just as the cv parameter of DataOp.skb.mark_as_X() defines
the default cross-validation splitting strategy.
>>> split = pred.skb.train_test_split()
>>> learner = pred.skb.with_scoring('neg_log_loss').skb.make_learner()
>>> learner.fit(split['train'])
SkrubLearner(data_op=<Scoring <Apply DummyClassifier> (1 scorers)>
This DataOp will be scored with:
- 'neg_log_loss'
Use .skb.cross_validate(…) or .skb.make_learner(…).score(…) to compute scores.)
>>> learner.score(split['test'])
-0.6365141682948128
Note that the score above is negative: it is the negative log loss we passed to
with_scoring, and not the default score (accuracy, which would be positive).
DataOp.skb.with_scoring() only changes how scoring is performed
(the outputs of DataOp.skb.cross_validate(),
DataOp.skb.make_randomized_search(), SkrubLearner.score etc.),
not the actual outputs of the learner (it does _not_ affect the outputs of
DataOp.skb.eval(), SkrubLearner.predict, etc.)
This method can be called several times to add scorers that take different kwargs. See the reference documentation for details.