.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorials/0000_getting_started.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via JupyterLite or Binder. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorials_0000_getting_started.py: Getting Started with skrub ========================== This guide showcases some of the features of skrub. Much of skrub revolves around simplifying many of the tasks that are involved in pre-processing raw data into a format that shallow or classic machine-learning models can understand, that is, numerical data. Skrub achieves this by vectorizing, assembling, and encoding tabular data through the features we present in this example and the following ones. .. |TableReport| replace:: :class:`~skrub.TableReport` .. |Cleaner| replace:: :class:`~skrub.Cleaner` .. |set_config| replace:: :func:`~skrub.set_config` .. |tabular_pipeline| replace:: :func:`~skrub.tabular_pipeline` .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |Joiner| replace:: :class:`~skrub.Joiner` .. |SquashingScaler| replace:: :class:`~skrub.SquashingScaler` .. |DatetimeEncoder| replace:: :class:`~skrub.DatetimeEncoder` .. |ApplyToCols| replace:: :class:`~skrub.ApplyToCols` .. |StringEncoder| replace:: :class:`~skrub.StringEncoder` .. |TextEncoder| replace:: :class:`~skrub.TextEncoder` .. GENERATED FROM PYTHON SOURCE LINES 27-32 Preliminary exploration with the |TableReport| ---------------------------------------------- We start by loading the "employee salaries". Skrub dataset fetching functions return a Bunch object, which contains the paths to the data files. We can load the data into a dataframe using pandas. .. GENERATED FROM PYTHON SOURCE LINES 32-40 .. code-block:: Python import pandas as pd from skrub.datasets import fetch_employee_salaries file_path = fetch_employee_salaries().path employees_df = pd.read_csv(file_path) .. GENERATED FROM PYTHON SOURCE LINES 41-43 The target variable is the current annual salary. We pop it from the dataframe to keep only the features in ``employees_df``. .. GENERATED FROM PYTHON SOURCE LINES 43-45 .. code-block:: Python salaries = employees_df.pop("current_annual_salary") .. GENERATED FROM PYTHON SOURCE LINES 46-48 Typically, the first step with new data is exploration and parsing. To quickly get an overview of a dataframe's contents, use the |TableReport|. .. GENERATED FROM PYTHON SOURCE LINES 50-54 .. code-block:: Python from skrub import TableReport TableReport(employees_df) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 55-73 You can use the interactive display above to explore the dataset visually. .. admonition:: Additional examples :collapsible: closed You can see a few more `example reports`_ online. We also provide an experimental online demo_ that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything. .. _example reports: https://skrub-data.org/skrub-reports/examples/ .. _demo: https://skrub-data.org/skrub-reports/ From the report above, we see that there are columns with date and time stored as ``object`` dtype (cf. "Stats" tab of the report). Datatypes not being parsed correctly is a scenario that occurs commonly after reading a table. We can use the |Cleaner| to address this. In the next section, we show that this transformer does additional cleaning. .. GENERATED FROM PYTHON SOURCE LINES 75-81 Sanitizing data with the |Cleaner| ---------------------------------- Here, we use the |Cleaner|, a transformer that sanitizes the dataframe by parsing nulls and dates, and by dropping "uninformative" columns (e.g., columns with too many nulls or that are constant). .. GENERATED FROM PYTHON SOURCE LINES 81-87 .. code-block:: Python from skrub import Cleaner employees_df = Cleaner().fit_transform(employees_df) TableReport(employees_df) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 88-90 We can see from the "Stats" tab that now the column ``date_first_hired`` has been parsed correctly as a Datetime. .. GENERATED FROM PYTHON SOURCE LINES 92-98 Easily building a strong baseline for tabular machine learning -------------------------------------------------------------- The goal of skrub is to ease tabular data preparation for machine learning. The |tabular_pipeline| function provides an easy way to build a simple but reliable machine learning model that works well on most tabular data. .. GENERATED FROM PYTHON SOURCE LINES 101-107 .. code-block:: Python from sklearn.model_selection import cross_validate from skrub import tabular_pipeline model = tabular_pipeline("regressor") model .. raw:: html
Pipeline(steps=[('tablevectorizer',
                     TableVectorizer(low_cardinality=ToCategorical())),
                    ('histgradientboostingregressor',
                     HistGradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 108-111 .. code-block:: Python results = cross_validate(model, employees_df, salaries) results["test_score"] .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.90895547, 0.87873961, 0.91514314, 0.92267765, 0.92381959]) .. GENERATED FROM PYTHON SOURCE LINES 112-118 To handle rich tabular data and feed it to a machine learning model, the pipeline returned by |tabular_pipeline| preprocesses and encodes strings, categories and dates using the |TableVectorizer|. See its documentation or :ref:`sphx_glr_auto_examples_0010_encodings.py` for more details. An overview of the chosen defaults is available in :ref:`user_guide_tabular_pipeline`. .. GENERATED FROM PYTHON SOURCE LINES 121-143 Encoding any data as numerical features --------------------------------------- Tabular data can contain a variety of datatypes, from numerical to datetimes, categories, strings, and text. Encoding features in a meaningful way requires significant effort and is a major part of the feature engineering process required to properly train machine learning models. Skrub helps with this by providing various transformers that automatically encode different datatypes into ``float32`` features. For **numerical features**, the |SquashingScaler| applies a robust scaling technique that is less sensitive to outliers. Check the :ref:`relative example ` for more information on the feature. For **datetime columns**, skrub provides the |DatetimeEncoder| which can extract useful features such as year, month, day, as well as additional features such as weekday or day of year. Periodic encoding with trigonometric or spline features is also available. Refer to the |DatetimeEncoder| documentation for more detail. .. GENERATED FROM PYTHON SOURCE LINES 145-156 .. code-block:: Python import pandas as pd data = pd.DataFrame( { "event": ["A", "B", "C"], "date_1": ["2020-01-01", "2020-06-15", "2021-03-22"], "date_2": ["2020-01-15", "2020-07-01", "2021-04-05"], } ) data = Cleaner().fit_transform(data) TableReport(data) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 157-161 Skrub transformers are applied column-by-column, but it's possible to use the |ApplyToCols| meta-transformer to apply a transformer to multiple columns at once. Complex column selection is possible using :ref:`skrub's column selectors `. .. GENERATED FROM PYTHON SOURCE LINES 161-168 .. code-block:: Python from skrub import ApplyToCols, DatetimeEncoder ApplyToCols( DatetimeEncoder(add_total_seconds=False), cols=["date_1", "date_2"] ).fit_transform(data) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 169-175 Finally, when a column contains **categorical or string data**, it can be encoded using various encoders provided by skrub. The default encoder is the |StringEncoder|, which encodes categories using `Latent Semantic Analysis (LSA) `_. It is a simple and efficient way to encode categories and works well in practice. .. GENERATED FROM PYTHON SOURCE LINES 175-187 .. code-block:: Python data = pd.DataFrame( { "city": ["Paris", "London", "Berlin", "Madrid", "Rome"], "country": ["France", "UK", "Germany", "Spain", "Italy"], } ) TableReport(data) from skrub import StringEncoder StringEncoder(n_components=3).fit_transform(data["city"]) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 188-196 If your data includes a lot of text, you may want to use the |TextEncoder|, which uses pre-trained language models retrieved from the HuggingFace hub to create meaningful text embeddings. See :ref:`user_guide_encoders_index` for more details on all the categorical encoders provided by skrub, and :ref:`sphx_glr_auto_examples_0010_encodings.py` for a comparison between the different methods. .. GENERATED FROM PYTHON SOURCE LINES 198-208 Advanced use cases ---------------------- If your use case involves more complex data preparation, hyperparameter tuning, or model selection, if you want to build a multi-table pipeline that requires assembling and preparing multiple tables, or if you want to ensure that the data preparation can be reproduced exactly, you can use the skrub Data Ops, a powerful framework that provides tools to build complex data processing pipelines. See the related :ref:`user guide ` and the :ref:`data_ops_examples_ref` examples for more details. .. GENERATED FROM PYTHON SOURCE LINES 210-222 Next steps ---------- We have briefly covered pipeline creation, vectorizing, assembling, and encoding data. We presented the main functionalities of skrub, but there is much more to explore! Please refer to our :ref:`user_guide` for a more in-depth presentation of skrub's concepts, or visit our `examples `_ for more illustrations of the tools that we provide! .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 21.429 seconds) **Estimated memory usage:** 504 MB .. _sphx_glr_download_auto_tutorials_0000_getting_started.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.9.0?urlpath=lab/tree/notebooks/auto_tutorials/0000_getting_started.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_tutorials/0000_getting_started.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 0000_getting_started.ipynb <0000_getting_started.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 0000_getting_started.py <0000_getting_started.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 0000_getting_started.zip <0000_getting_started.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_