Tuning and validating skrub DataOps plans#
To evaluate the prediction performance of our plan, we can fit it on a training dataset, then obtaining prediction on an unseen, test dataset.
In scikit-learn, we pass to estimators and pipelines an X
and y
matrix
with one row per observation from the start. Therefore, we can split the
data into a training and test set independently from the pipeline.
However, in many real-world scenarios, our data sources are not already
organized into X
and y
matrices. Some transformations may be necessary to
build them, and we want to keep those transformations inside the pipeline so
that they can be reliably re-applied to new data.
Therefore, we must start our pipeline by creating the design matrix and targets,
then tell skrub which intermediate results in the pipeline constitute X
and
y
respectively.
Let us consider a toy example where we simply obtain X
and
y
from a single table. More complex transformations would be handled in
the same way.
>>> from sklearn.datasets import load_diabetes
>>> from sklearn.linear_model import Ridge
>>> import skrub
>>> diabetes_df = load_diabetes(as_frame=True)["frame"]
In the original data, all features and the target are in the same dataframe.
>>> data = skrub.var("data", diabetes_df)
We build our design matrix by dropping the target. Note we use
errors="ignore"
so that pandas does not raise an error if the column we want
to drop is already missing. Indeed, when we will need to make actual useful
predictions on unlabelled data, the “target” column will not be available.
>>> X = data.drop(columns="target", errors="ignore").skb.mark_as_X()
We use .skb.mark_as_X()
to indicate that this
intermediate result (the dataframe obtained after dropping “target”) is the
X
design matrix. This is the dataframe that will be split into a training
and a testing part when we split our dataset or perform cross-validation.
Similarly for y
, we use .skb.mark_as_y()
:
>>> y = data["target"].skb.mark_as_y()
Now we can add our supervised estimator:
>>> pred = X.skb.apply(Ridge(), y=y)
>>> pred
<Apply Ridge>
Result:
―――――――
target
0 182.673354
1 90.998607
2 166.113476
3 156.034880
4 133.659575
.. ...
437 180.323365
438 135.798908
439 139.855630
440 182.645829
441 83.564413
[442 rows x 1 columns]
Once a pipeline is defined and the X
and y
nodes are identified, skrub
is able to split the dataset and perform cross-validation.
Improving the confidence in our score through cross-validation#
We can increase our confidence in our score by using cross-validation instead of
a single split. The same mechanism is used but we now fit and evaluate the model
on several splits. This is done with .skb.cross_validate()
.
>>> pred.skb.cross_validate()
fit_time score_time test_score
0 0.002816 0.001344 0.321665
1 0.002685 0.001323 0.440485
2 0.002468 0.001308 0.422104
3 0.002748 0.001321 0.424661
4 0.002649 0.001309 0.441961
Splitting the data in train and test sets#
We can use .skb.train_test_split()
to
perform a single train-test split. skrub first evaluates the DataOps on
which we used .skb.mark_as_X()
and
.skb.mark_as_y()
: the first few steps of the
pipeline are executed until we have a value for X
and for y
.
Then, those
dataframes are split using the provided split function (by default
scikit-learn’s sklearn.model_selection.train_test_split()
).
>>> split = pred.skb.train_test_split(shuffle=False)
>>> split.keys()
dict_keys(['train', 'test', 'X_train', 'X_test', 'y_train', 'y_test'])
train
and test
are the full dictionaries corresponding to the training
and testing data. The corresponding X
and y
are the values, in those
dictionaries, for the nodes marked with
.skb.mark_as_X()
and .skb.mark_as_y()
.
We can now fit our pipeline on the training data:
>>> learner = pred.skb.make_learner()
>>> learner.fit(split["train"])
SkrubLearner(data_op=<Apply Ridge>)
Only the training part of X
and y
are used. The subsequent steps are
evaluated, using this data, to fit the rest of the pipeline.
And we can obtain predictions on the test part:
>>> test_pred = learner.predict(split["test"])
>>> test_y_true = split["y_test"]
>>> from sklearn.metrics import r2_score
>>> r2_score(test_y_true, test_pred)
0.440999149220359
It is possible to define a custom split function to use instead of
sklearn.model_selection.train_test_split()
.