.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/expressions/11_choices.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_expressions_11_choices.py: .. _example_tuning_pipelines: Tuning pipelines ================ Our machine-learning pipeline typically contains some values or choices which may influence its prediction performance, such as hyperparameters (e.g. the regularization parameter ``alpha`` of a ``RidgeClassifier``, the ``learning_rate`` of a ``HistGradientBoostingClassifier``), which estimator to use (e.g. ``RidgeClassifier`` or ``HistGradientBoostingClassifier``), or which steps to include (e.g. should we join a table to bring additional information or not). We want to tune those choices by trying several options and keeping those that give the best performance on a validation set. Skrub :ref:`expressions ` provide a convenient way to specify the range of possible values, by inserting it directly in place of the actual value. For example we can write: ``RidgeClassifier(alpha=skrub.choose_from([0.1, 1.0, 10.0], name='α'))`` instead of: ``RidgeClassifier(alpha=1.0)``. Skrub then inspects our pipeline to discover all the places where we used objects like ``skrub.choose_from()`` and builds a grid of hyperparameters for us. .. GENERATED FROM PYTHON SOURCE LINES 34-39 We will illustrate hyperparameter tuning on the "toxicity" dataset. This dataset contains 1,000 texts and the task is to predict if they are flagged as being toxic or not. We start from a very simple pipeline without any hyperparameters. .. GENERATED FROM PYTHON SOURCE LINES 41-54 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingClassifier import skrub import skrub.datasets data = skrub.datasets.fetch_toxicity().toxicity # This dataset is sorted -- all toxic tweets appear first, so we shuffle it data = data.sample(frac=1.0, random_state=1) texts = data[["text"]] labels = data["is_toxic"] .. GENERATED FROM PYTHON SOURCE LINES 55-60 We mark the ``texts`` column as the input and the ``labels`` column as the target. See `the previous example <10_expressions.html>`_ for a more detailed explanation of ``skrub.X`` and ``skrub.y``. We then encode the text with a ``MinHashEncoder`` and fit a ``HistGradientBoostingClassifier`` on the resulting features. .. GENERATED FROM PYTHON SOURCE LINES 62-65 .. code-block:: Python X = skrub.X(texts) X .. raw:: html
<Var 'X'>
Show graph X: VAR 'X'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 66-69 .. code-block:: Python y = skrub.y(labels) y .. raw:: html
<Var 'y'>
Show graph y: VAR 'y'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 70-76 .. code-block:: Python pred = X.skb.apply(skrub.MinHashEncoder()).skb.apply( HistGradientBoostingClassifier(), y=y ) pred.skb.cross_validate(n_jobs=4)["test_score"] .. rst-class:: sphx-glr-script-out .. code-block:: none 0 0.635 1 0.590 2 0.645 3 0.595 4 0.585 Name: test_score, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 77-103 For the sake of the example, we will focus on the number of ``MinHashEncoder`` components and the ``learning_rate`` of the ``HistGradientBoostingClassifier`` to illustrate the ``skrub.choose_from(...)`` objects. When we use a scikit-learn hyperparameter-tuner like ``GridSearchCV`` or ``RandomizedSearchCV``, we need to specify a grid of hyperparameters separately from the estimator, with something similar to ``GridSearchCV(my_pipeline, param_grid={"encoder__n_components: [5, 10, 20]"})``. Instead, with skrub we can use ``skrub.choose_from(...)`` directly where the actual value would normally go. Skrub then takes care of constructing the ``GridSearchCV``'s parameter grid for us. Several utilities are available: - ``choose_from`` to choose from a discrete set of values - ``choose_float`` and ``choose_int`` to sample numbers in a given range - ``choose_bool`` to choose between ``True`` and ``False`` - ``optional`` to choose between something and ``None``; typically to make a transformation step optional such as ``X.skb.apply(skrub.optional(StandardScaler()))`` Choices can be given a name which is used to display hyperparameter search results and plots or to override their outcome. The name is optional. Note that ``skrub.choose_float()`` and ``skrub.choose_int()`` can be given a ``log`` argument to sample in log scale. .. GENERATED FROM PYTHON SOURCE LINES 105-115 .. code-block:: Python X, y = skrub.X(texts), skrub.y(labels) encoder = skrub.MinHashEncoder( n_components=skrub.choose_int(5, 15, name="N components") ) classifier = HistGradientBoostingClassifier( learning_rate=skrub.choose_float(0.01, 0.9, log=True, name="lr") ) pred = X.skb.apply(encoder).skb.apply(classifier, y=y) .. GENERATED FROM PYTHON SOURCE LINES 116-122 We can then obtain an estimator that performs the hyperparameter search with ``.skb.get_grid_search()`` or ``.skb.get_randomized_search()``. They accept the same arguments as their scikit-learn counterparts (e.g. ``scoring`` and ``n_jobs``). Also, like ``.skb.get_pipeline()``, they accept a ``fitted`` argument and if it is ``True`` the search is fitted on the data we provided when initializing our pipeline's variables. .. GENERATED FROM PYTHON SOURCE LINES 122-126 .. code-block:: Python search = pred.skb.get_randomized_search(n_iter=8, n_jobs=4, random_state=1, fitted=True) search.results_ .. raw:: html
mean_test_score N components lr
0 0.573 11 0.218316
1 0.568 11 0.255675
2 0.566 10 0.112974
3 0.551 5 0.204294
4 0.545 5 0.038979
5 0.545 7 0.015151
6 0.541 7 0.047349
7 0.535 8 0.520061


.. GENERATED FROM PYTHON SOURCE LINES 127-140 If the plotly library is installed, we can visualize the results of the hyperparameter search with ``.plot_results()``. In the plot below, each line represents a combination of hyperparameters (in this case, only ``N components`` and ``learning rate``), and each column of points represents either a hyperparameter, or the score of a given combination of hyperparameters. The color of the line represents the score of the combination of hyperparameters. The plot is interactive, and it is possible to select only a subset of the hyperparameters to visualize by dragging the mouse over each column to select the desired range. This is particularly useful when there are many combinations of hyperparameters, and we are interested in understanding which hyperparameters have the largest impact on the score. .. GENERATED FROM PYTHON SOURCE LINES 142-144 .. code-block:: Python search.plot_results() .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 145-158 Choices can appear in many places --------------------------------- Choices are not limited to selecting estimator hyperparameters. They can also be used to choose between different estimators, or in place of any value used in our pipeline. For example, here we pass a choice to pandas DataFrame's ``assign`` method. We want to add a feature that captures the length of the text, but we are not sure if it is better to count length in characters or in words. We do not want to add both because it would be redundant. We can add a column to the dataframe, which will be chosen among the length in characters or the length in words: .. GENERATED FROM PYTHON SOURCE LINES 160-169 .. code-block:: Python X, y = skrub.X(texts), skrub.y(labels) X.assign( length=skrub.choose_from( {"words": X["text"].str.count(r"\b\w+\b"), "chars": X["text"].str.len()}, name="length", ) ) .. raw:: html
<CallMethod 'assign'>
Show graph X: VAR 'X' GETITEM 'text' GETITEM 'text' CALLMETHOD 'assign' GETATTR 'str' CALLMETHOD 'count' GETATTR 'str' CALLMETHOD 'len'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 170-179 ``choose_from`` can be given a dictionary if we want to provide names for the individual outcomes, or a list, when names are not needed: ``choose_from([1, 100], name='N')``, ``choose_from({'small': 1, 'big': 100}, name='N')``. Choices can be nested arbitrarily. For example, here we want to choose between 2 possible encoder types: the ``MinHashEncoder`` or the ``StringEncoder``. Each of the possible outcomes contains a choice itself: the number of components. .. GENERATED FROM PYTHON SOURCE LINES 181-192 .. code-block:: Python n_components = skrub.choose_int(5, 15, name="N components") encoder = skrub.choose_from( { "minhash": skrub.MinHashEncoder(n_components=n_components), "lse": skrub.StringEncoder(n_components=n_components), }, name="encoder", ) X.skb.apply(encoder, cols="text") .. raw:: html
<Apply MinHashEncoder>
Show graph X: VAR 'X' APPLY MinHashEncoder

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 193-197 In a similar vein, we might want to choose between a HGB classifier and a Ridge classifier, each with its own set of hyperparameters. We can then define a choice for the classifier and a choice for the hyperparameters of each classifier. .. GENERATED FROM PYTHON SOURCE LINES 199-209 .. code-block:: Python from sklearn.linear_model import RidgeClassifier hgb = HistGradientBoostingClassifier( learning_rate=skrub.choose_float(0.01, 0.9, log=True, name="lr") ) ridge = RidgeClassifier(alpha=skrub.choose_float(0.01, 100, log=True, name="α")) classifier = skrub.choose_from({"hgb": hgb, "ridge": ridge}, name="classifier") pred = X.skb.apply(encoder).skb.apply(classifier, y=y) print(pred.skb.describe_param_grid()) .. rst-class:: sphx-glr-script-out .. code-block:: none - encoder: 'minhash' N components: choose_int(5, 15, name='N components') classifier: 'hgb' lr: choose_float(0.01, 0.9, log=True, name='lr') - encoder: 'minhash' N components: choose_int(5, 15, name='N components') classifier: 'ridge' α: choose_float(0.01, 100, log=True, name='α') - encoder: 'lse' N components: choose_int(5, 15, name='N components') classifier: 'hgb' lr: choose_float(0.01, 0.9, log=True, name='lr') - encoder: 'lse' N components: choose_int(5, 15, name='N components') classifier: 'ridge' α: choose_float(0.01, 100, log=True, name='α') .. GENERATED FROM PYTHON SOURCE LINES 210-215 .. code-block:: Python search = pred.skb.get_randomized_search( n_iter=16, n_jobs=4, random_state=1, fitted=True ) search.plot_results() .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 216-222 Now that we have a more complex pipeline, we can draw more conclusions from the parallel coordinate plot. For example, we can see that the ``HistGradientBoostingClassifier`` performs better than the ``RidgeClassifier`` in most cases, that the ``StringEncoder`` outperforms the ``MinHashEncoder``, and that the choice of the additional ``length`` feature does not have a significant impact on the score. .. GENERATED FROM PYTHON SOURCE LINES 224-239 Advanced usage -------------- This section shows some more advanced or less frequently needed use cases. Choices can depend on each other ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sometimes not all combinations (cross-product) of hyperparameter values make sense, and instead choices may be linked. For example, our downstream estimator can be a ``RidgeClassifier`` or ``HistGradientBoostingClassifier``, and standard scaling should be applied only when it is a ``Ridge``. Skrub choices have a ``match`` method to obtain different results depending on the outcome of the choice. .. GENERATED FROM PYTHON SOURCE LINES 241-258 .. code-block:: Python from sklearn.preprocessing import StandardScaler X, y = skrub.X(texts), skrub.y(labels) vectorized_X = X.skb.apply(skrub.MinHashEncoder()) estimator_kind = skrub.choose_from(["ridge", "HGB"], name="estimator kind") scaling = estimator_kind.match({"ridge": StandardScaler(), "HGB": "passthrough"}) scaled_X = vectorized_X.skb.apply(scaling) classifier = estimator_kind.match( {"ridge": RidgeClassifier(), "HGB": HistGradientBoostingClassifier()} ) pred = scaled_X.skb.apply(classifier, y=y) print(pred.skb.describe_param_grid()) .. rst-class:: sphx-glr-script-out .. code-block:: none - estimator kind: ['ridge', 'HGB'] .. GENERATED FROM PYTHON SOURCE LINES 259-265 Here we can see that there is only one parameter: the estimator kind. When it is ``"ridge"``, the ``StandardScaler`` and the ``RidgeClassifier`` are used; when it is ``"HGB"`` ``"passthrough"`` and the ``HistGradientBoostingClassifier`` are used. Similarly, objects returned by ``choose_bool`` have a ``if_else()`` method. .. GENERATED FROM PYTHON SOURCE LINES 267-276 Choices can be turned into expressions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can turn a choice (or the result of a choice ``match()`` or ``if_else``) into an expression, so that we can keep chaining more operations onto it. Here, we create a ``.choose_bool()`` object to choose whether to add the length of the text as a feature or not. Then, ``if_else()`` will assign the length of the text to a new column ``length`` if the choice is ``True``, or do nothing if the choice is ``False``. .. GENERATED FROM PYTHON SOURCE LINES 278-289 .. code-block:: Python X, y = skrub.X(texts), skrub.y(labels) add_length = skrub.choose_bool(name="add_length") with_length = add_length.if_else(X.assign(length=X["text"].str.len()), X).as_expr() vectorized_X = with_length.skb.apply(skrub.MinHashEncoder(n_components=2), cols="text") # Note: we can manually set the outcome of a choice when evaluating an # expression (or fitting an estimator) vectorized_X.skb.eval({"add_length": False}) .. raw:: html
text_0 text_1
507 -2.135328e+09 -2.092208e+09
818 -2.108546e+09 -2.130021e+09
452 -2.135784e+09 -2.134704e+09
368 -2.135784e+09 -2.143365e+09
242 -2.141841e+09 -2.140976e+09
... ... ...
767 -2.135328e+09 -2.137638e+09
72 -2.126502e+09 -2.134934e+09
908 -2.100929e+09 -2.066336e+09
235 -2.121470e+09 -2.140573e+09
37 -2.127834e+09 -2.143365e+09

1000 rows × 2 columns



.. GENERATED FROM PYTHON SOURCE LINES 290-292 .. code-block:: Python vectorized_X.skb.eval({"add_length": True}) .. raw:: html
text_0 text_1 length
507 -2.135328e+09 -2.092208e+09 87
818 -2.108546e+09 -2.130021e+09 35
452 -2.135784e+09 -2.134704e+09 145
368 -2.135784e+09 -2.143365e+09 324
242 -2.141841e+09 -2.140976e+09 239
... ... ... ...
767 -2.135328e+09 -2.137638e+09 107
72 -2.126502e+09 -2.134934e+09 85
908 -2.100929e+09 -2.066336e+09 44
235 -2.121470e+09 -2.140573e+09 70
37 -2.127834e+09 -2.143365e+09 108

1000 rows × 3 columns



.. GENERATED FROM PYTHON SOURCE LINES 293-300 Arbitrary logic depending on a choice ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When ``match`` or ``if_else`` are not enough and we need to apply arbitrary, eager logic based on a choice we can resort to using ``skrub.deferred``. For example the choice of adding the text length or not could also have been written as: .. GENERATED FROM PYTHON SOURCE LINES 302-318 .. code-block:: Python X, y = skrub.X(texts), skrub.y(labels) @skrub.deferred def extract_features(df, add_length): if add_length: return df.assign(length=df["text"].str.len()) return df feat = extract_features(X, skrub.choose_bool(name="add_length")).skb.apply( skrub.MinHashEncoder(n_components=2), cols="text" ) feat.skb.eval({"add_length": False}) .. raw:: html
text_0 text_1
507 -2.135328e+09 -2.092208e+09
818 -2.108546e+09 -2.130021e+09
452 -2.135784e+09 -2.134704e+09
368 -2.135784e+09 -2.143365e+09
242 -2.141841e+09 -2.140976e+09
... ... ...
767 -2.135328e+09 -2.137638e+09
72 -2.126502e+09 -2.134934e+09
908 -2.100929e+09 -2.066336e+09
235 -2.121470e+09 -2.140573e+09
37 -2.127834e+09 -2.143365e+09

1000 rows × 2 columns



.. GENERATED FROM PYTHON SOURCE LINES 319-321 .. code-block:: Python feat.skb.eval({"add_length": True}) .. raw:: html
text_0 text_1 length
507 -2.135328e+09 -2.092208e+09 87
818 -2.108546e+09 -2.130021e+09 35
452 -2.135784e+09 -2.134704e+09 145
368 -2.135784e+09 -2.143365e+09 324
242 -2.141841e+09 -2.140976e+09 239
... ... ... ...
767 -2.135328e+09 -2.137638e+09 107
72 -2.126502e+09 -2.134934e+09 85
908 -2.100929e+09 -2.066336e+09 44
235 -2.121470e+09 -2.140573e+09 70
37 -2.127834e+09 -2.143365e+09 108

1000 rows × 3 columns



.. GENERATED FROM PYTHON SOURCE LINES 322-329 Concluding, we have seen how to use skrub's ``choose_from`` objects to tune hyperparameters, choose optional configurations, add features, and nest choices. We then looked at how the different choices affect the pipeline and the prediction scores. Thanks to the ``choose_from`` objects, Skrub expressions ease the process of hyperparameter tuning. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 47.509 seconds) .. _sphx_glr_download_auto_examples_expressions_11_choices.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/expressions/11_choices.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/expressions/11_choices.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 11_choices.ipynb <11_choices.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 11_choices.py <11_choices.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 11_choices.zip <11_choices.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_