.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/data_ops/1160_pytorch.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via JupyterLite or Binder. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_data_ops_1160_pytorch.py: Using PyTorch (via skorch) in DataOps ====================================== This example shows how to wrap a PyTorch model with skorch and plug it into a skrub DataOps plan. .. note:: This example requires the optional dependencies ``torch`` and ``skorch``. The main goal here is to show the *integration pattern*: - **PyTorch** defines the model (an ``nn.Module``) - **skorch** wraps it as a scikit-learn compatible estimator - **skrub DataOps** builds a plan and can tune skorch (and therefore PyTorch) hyperparameters using the skrub choices. .. GENERATED FROM PYTHON SOURCE LINES 20-27 Loading the data ================= We use scikit-learn's digits dataset because it is small and ships with scikit-learn. Each sample is an 8x8 grayscale image of a handwritten digit, encoded as 64 pixel intensity values and displays a number from 0 to 9. .. GENERATED FROM PYTHON SOURCE LINES 27-34 .. code-block:: Python from sklearn.datasets import load_digits digits = load_digits() X, y = digits.data, digits.target print(f"Dataset shape: {X.shape}") print(f"Number of classes: {len(set(y))}") .. rst-class:: sphx-glr-script-out .. code-block:: none Dataset shape: (1797, 64) Number of classes: 10 .. GENERATED FROM PYTHON SOURCE LINES 35-39 Start of the DataOps plan ========================== We start the DataOps plan by creating the skrub variables X and y. .. GENERATED FROM PYTHON SOURCE LINES 39-44 .. code-block:: Python import skrub X = skrub.X(X) y = skrub.y(y) .. GENERATED FROM PYTHON SOURCE LINES 45-65 Data preprocessing ================== We start by normalizing the pixel values to [0, 1] by first computing the global max value and then dividing the pixel values by this max value. Importantly, we freeze the max value (scaling factor) after fitting so that the same rescaling is applied later when we use our dataop for prediction on new (test) data. A convolutional network expects images with shape (N, C, H, W) where: - N: number of samples - C: number of color channels (1 for grayscale) - H, W: image height and width So we reshape the images to (N, 1, 8, 8) for the CNN. The -1 means the first dimension (N) is inferred automatically from the array size. The advantage of using DataOps is that the preprocessing steps are tracked in the plan and will be automatically applied during prediction. .. GENERATED FROM PYTHON SOURCE LINES 65-72 .. code-block:: Python max_value = X.max().skb.freeze_after_fit() X_scaled = X / max_value X_reshaped = X_scaled.reshape(-1, 1, 8, 8).astype("float32") X_reshaped.skb.draw_graph() .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 73-86 Building a NN Classifier ========================= We'll build a tiny CNN using PyTorch and wrap it with skorch to make it scikit-learn compatible. The architecture uses a single convolution + pooling stage and a small MLP head. The architectural choices below are meant to be: - **standard**: 3x3 convolutions and 2x2 max-pooling are very common - **small**: the dataset and images are tiny, so we keep the model tiny too If you want more background on CNN building blocks and how convolution/pooling changes tensor shapes, see the CS231n notes: https://cs231n.github.io/convolutional-networks/ .. GENERATED FROM PYTHON SOURCE LINES 86-120 .. code-block:: Python import torch.nn as nn import torch.nn.functional as F import torch.optim as optim class TinyCNN(nn.Module): def __init__(self, conv_channels: int = 8, hidden_units: int = 32): super().__init__() self.conv_channels = conv_channels self.hidden_units = hidden_units # 2-level CNN with 2x2 max-pooling self.conv1 = nn.Conv2d( in_channels=1, out_channels=conv_channels, kernel_size=3, padding=1 ) self.conv2 = nn.Conv2d(conv_channels, conv_channels, kernel_size=3, padding=1) self.pool = nn.MaxPool2d(kernel_size=2) # input shape = (8,8) -> conv1: (8,8) -> conv2: (8,8) -> pool: (4,4) image_shape_after_conv = 4 * 4 # MLP head self.fc1 = nn.Linear(conv_channels * image_shape_after_conv, hidden_units) self.dropout = nn.Dropout(p=0.25) # Regularization to avoid overfitting self.fc2 = nn.Linear(hidden_units, 10) # 10 digit classes (0..9) def forward(self, x): x = F.relu(self.conv1(x)) x = self.pool(F.relu(self.conv2(x))) x = x.flatten(start_dim=1) x = self.dropout(F.relu(self.fc1(x))) return self.fc2(x) .. GENERATED FROM PYTHON SOURCE LINES 121-128 Skorch provides scikit-learn compatible wrappers around torch training loops. That makes the torch model usable by skrub DataOps (and scikit-learn tools in general). We use :func:`skrub.choose_from()` to define hyperparameters that the DataOps grid search will tune: conv_channels, hidden_units, and max_epochs. The other parameters are set to common choices for this task and training data size. .. GENERATED FROM PYTHON SOURCE LINES 128-148 .. code-block:: Python from skorch import NeuralNetClassifier device = "cpu" # use "cuda" or "mps" if available net = NeuralNetClassifier( module=TinyCNN, # These choices are intentionally small so the example runs quickly. module__conv_channels=skrub.choose_from([8, 16], name="conv_channels"), module__hidden_units=skrub.choose_from([8, 16, 32], name="hidden_units"), max_epochs=skrub.choose_from([10, 15], name="max_epochs"), optimizer__lr=0.01, optimizer=optim.Adam, criterion=nn.CrossEntropyLoss, device=device, train_split=None, # We'll use skrub's grid search for validation verbose=0, ) .. GENERATED FROM PYTHON SOURCE LINES 149-155 Tuning the model's hyperparameters with DataOps =============================================== We integrate the model into the DataOps plan. First, we convert the target labels to integers for the loss computation and apply the model to the preprocessed X and y. .. GENERATED FROM PYTHON SOURCE LINES 155-160 .. code-block:: Python y_int = y.astype("int64") predictor = X_reshaped.skb.apply(net, y=y_int) predictor.skb.draw_graph() .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 161-163 Finally, we use 4-fold cross-validation for the hyperparameter tuning on our DataOps plan. .. GENERATED FROM PYTHON SOURCE LINES 163-175 .. code-block:: Python from sklearn.model_selection import KFold cv = KFold(n_splits=4, shuffle=True, random_state=42) search = predictor.skb.make_grid_search( cv=cv, fitted=True, n_jobs=-1, ) print("\nSearch results:") print(search.results_.to_string(index=False)) .. rst-class:: sphx-glr-script-out .. code-block:: none Search results: max_epochs conv_channels hidden_units mean_test_score 15 16 32 0.978295 10 16 32 0.969949 15 8 32 0.969943 15 8 16 0.968275 15 16 16 0.967165 10 8 32 0.959371 10 8 16 0.934885 10 16 16 0.928750 15 16 8 0.878707 15 8 8 0.853070 10 16 8 0.762852 10 8 8 0.692296 .. GENERATED FROM PYTHON SOURCE LINES 176-179 Let's take a better look at the well-performing models by looking at the parallel coordinates plot. We filter to models with score >= 0.94 to focus on the top-performing configurations. .. GENERATED FROM PYTHON SOURCE LINES 179-183 .. code-block:: Python fig = search.plot_results(min_score=0.94) fig .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 184-196 Interpreting the results ======================== Looking at the search results, we can observe several patterns: - **Model capacity matters**: Larger configurations with ``conv_channels=16`` and ``hidden_units=32`` tend to perform best. Smaller models with ``conv_channels=8`` and/or ``hidden_units=8`` perform significantly worse, indicating that the task benefits from increased model capacity. - **More epochs generally help**: Configurations with ``max_epochs=15`` tend to perform slightly better than those with ``max_epochs=10``, though the gains are modest compared to architectural changes. .. GENERATED FROM PYTHON SOURCE LINES 198-219 Conclusion ========== In this example, we've shown how to use **PyTorch** and **skorch** within skrub DataOps. The key steps were: 1. Define a PyTorch ``nn.Module`` (our ``TinyCNN``) 2. Wrap it with skorch's ``NeuralNetClassifier`` to make it scikit-learn compatible 3. Use :func:`skrub.choose_from()` to specify hyperparameters for tuning 4. Integrate it into a DataOps plan and use grid search to find the best configuration This pattern lets you leverage PyTorch's flexibility for model definition while benefiting from skrub's hyperparameter tuning and data preprocessing capabilities. .. seealso:: * :ref:`example_tuning_pipelines`: Learn more about using ``skrub.choose_from()`` and other choice objects to tune hyperparameters in DataOps plans. * :ref:`example_optuna_choices`: Discover how to use Optuna as a backend for more sophisticated hyperparameter search strategies with skrub DataOps. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 20.085 seconds) **Estimated memory usage:** 538 MB .. _sphx_glr_download_auto_examples_data_ops_1160_pytorch.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.7.2?urlpath=lab/tree/notebooks/auto_examples/data_ops/1160_pytorch.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/data_ops/1160_pytorch.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 1160_pytorch.ipynb <1160_pytorch.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 1160_pytorch.py <1160_pytorch.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 1160_pytorch.zip <1160_pytorch.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_