.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/00_getting_started.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_00_getting_started.py: Getting Started =============== This guide showcases the features of ``skrub``, an open-source package that aims at bridging the gap between tabular data sources and machine-learning models. Much of ``skrub`` revolves around vectorizing, assembling, and encoding tabular data, to prepare data in a format that shallow or classic machine-learning models understand. .. GENERATED FROM PYTHON SOURCE LINES 13-18 Downloading example datasets ---------------------------- The :obj:`~skrub.datasets` module allows us to download tabular datasets and demonstrate ``skrub``'s features. .. GENERATED FROM PYTHON SOURCE LINES 20-25 .. code-block:: Python from skrub.datasets import fetch_employee_salaries dataset = fetch_employee_salaries() employees_df, salaries = dataset.X, dataset.y .. GENERATED FROM PYTHON SOURCE LINES 26-27 Explore all the available datasets in :ref:`downloading_a_dataset_ref`. .. GENERATED FROM PYTHON SOURCE LINES 30-35 Generating an interactive report for a dataframe ------------------------------------------------- To quickly get an overview of a dataframe's contents, use the :class:`~skrub.TableReport`. .. GENERATED FROM PYTHON SOURCE LINES 37-41 .. code-block:: Python from skrub import TableReport TableReport(employees_df) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 42-53 You can use the interactive display above to explore the dataset visually. .. note:: You can see a few more `example reports`_ online. We also provide an experimental online demo_ that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything. .. _example reports: https://skrub-data.org/skrub-reports/examples/ .. _demo: https://skrub-data.org/skrub-reports/ .. GENERATED FROM PYTHON SOURCE LINES 57-63 Easily building a strong baseline for tabular machine learning -------------------------------------------------------------- The goal of ``skrub`` is to ease tabular data preparation for machine learning. The :func:`~skrub.tabular_learner` function provides an easy way to build a simple but reliable machine-learning model, working well on most tabular data. .. GENERATED FROM PYTHON SOURCE LINES 66-74 .. code-block:: Python from sklearn.model_selection import cross_validate from skrub import tabular_learner model = tabular_learner("regressor") results = cross_validate(model, employees_df, salaries) results["test_score"] .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666]) .. GENERATED FROM PYTHON SOURCE LINES 75-81 To handle rich tabular data and feed it to a machine-learning model, the pipeline returned by :func:`~skrub.tabular_learner` preprocesses and encodes strings, categories and dates using the :class:`~skrub.TableVectorizer`. See its documentation or :ref:`sphx_glr_auto_examples_01_encodings.py` for more details. An overview of the chosen defaults is available in :ref:`end_to_end_pipeline`. .. GENERATED FROM PYTHON SOURCE LINES 84-95 Assembling data --------------- ``Skrub`` allows imperfect assembly of data, such as joining dataframes on columns that contain typos. ``Skrub``'s joiners have ``fit`` and ``transform`` methods, storing information about the data across calls. The :class:`~skrub.Joiner` allows fuzzy-joining multiple tables, each row of a main table will be augmented with values from the best match in the auxiliary table. You can control how distant fuzzy-matches are allowed to be with the ``max_dist`` parameter. .. GENERATED FROM PYTHON SOURCE LINES 97-99 In the following, we add information about countries to a table containing airports and the cities they are in: .. GENERATED FROM PYTHON SOURCE LINES 101-125 .. code-block:: Python import pandas as pd from skrub import Joiner airports = pd.DataFrame( { "airport_id": [1, 2], "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"], "city": ["Paris", "Roma"], } ) # notice the "Rome" instead of "Roma" capitals = pd.DataFrame( {"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]} ) joiner = Joiner( capitals, main_key="city", aux_key="capital", max_dist=0.8, add_match_info=False, ) joiner.fit_transform(airports) .. raw:: html
airport_id airport_name city capital country
0 1 Charles de Gaulle Paris Paris France
1 2 Aeroporto Leonardo da Vinci Roma Rome Italy


.. GENERATED FROM PYTHON SOURCE LINES 126-132 Information about countries have been added, even if the rows aren't exactly matching. It's also possible to augment data by joining and aggregating multiple dataframes with the :class:`~skrub.AggJoiner`. This is particularly useful to summarize information scattered across tables, for instance adding statistics about flights to the dataframe of airports: .. GENERATED FROM PYTHON SOURCE LINES 134-153 .. code-block:: Python from skrub import AggJoiner flights = pd.DataFrame( { "flight_id": range(1, 7), "from_airport": [1, 1, 1, 2, 2, 2], "total_passengers": [90, 120, 100, 70, 80, 90], "company": ["DL", "AF", "AF", "DL", "DL", "TR"], } ) agg_joiner = AggJoiner( aux_table=flights, main_key="airport_id", aux_key="from_airport", cols=["total_passengers", "company"], # the cols to perform aggregation on operations=["mean", "mode"], # the operations to compute ) agg_joiner.fit_transform(airports) .. raw:: html
airport_id airport_name city company_mode total_passengers_mean
0 1 Charles de Gaulle Paris AF 103.333333
1 2 Aeroporto Leonardo da Vinci Roma DL 80.000000


.. GENERATED FROM PYTHON SOURCE LINES 154-158 For joining multiple auxiliary tables on a main table at once, use the :class:`~skrub.MultiAggJoiner`. See other ways to join multiple tables in :ref:`assembling`. .. GENERATED FROM PYTHON SOURCE LINES 161-174 Encoding data ------------- When a column contains categories with variations and typos, it can be encoded using one of ``skrub``'s encoders, such as the :class:`~skrub.GapEncoder`. The :class:`~skrub.GapEncoder` creates a continuous encoding, based on the activation of latent categories. It will create the encoding based on combinations of substrings which frequently co-occur. For instance, we might want to encode a column ``X`` that contains information about cities, being either Madrid or Rome : .. GENERATED FROM PYTHON SOURCE LINES 176-194 .. code-block:: Python from skrub import GapEncoder X = pd.Series( [ "Rome, Italy", "Rome", "Roma, Italia", "Madrid, SP", "Madrid, spain", "Madrid", "Romq", "Rome, It", ], name="city", ) enc = GapEncoder(n_components=2, random_state=0) # 2 topics in the data enc.fit(X) .. raw:: html
GapEncoder(n_components=2, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 195-196 The :class:`~skrub.GapEncoder` has found the following two topics: .. GENERATED FROM PYTHON SOURCE LINES 198-200 .. code-block:: Python enc.get_feature_names_out() .. rst-class:: sphx-glr-script-out .. code-block:: none ['city: madrid, spain, sp', 'city: italia, italy, romq'] .. GENERATED FROM PYTHON SOURCE LINES 201-204 Which correspond to the two cities. Let's see the activation of each topic depending on the rows of ``X``: .. GENERATED FROM PYTHON SOURCE LINES 206-209 .. code-block:: Python encoded = enc.fit_transform(X).assign(original=X) encoded .. raw:: html
city: madrid, spain, sp city: italia, italy, romq original
0 0.052257 13.547743 Rome, Italy
1 0.050202 3.049798 Rome
2 0.063282 15.036718 Roma, Italia
3 12.047028 0.052972 Madrid, SP
4 16.547818 0.052182 Madrid, spain
5 6.048861 0.051139 Madrid
6 0.050019 3.049981 Romq
7 0.053193 9.046807 Rome, It


.. GENERATED FROM PYTHON SOURCE LINES 210-214 The higher the activation, the closer the row to the latent topic. These columns can now be understood by a machine-learning model. The other encoders are presented in :ref:`encoding`. .. GENERATED FROM PYTHON SOURCE LINES 217-228 Next steps ---------- We have briefly covered pipeline creation, vectorizing, assembling, and encoding data. We presented the main functionalities of ``skrub``, but there is much more to it ! Please refer to our :ref:`user_guide` for a more in-depth presentation of ``skrub``'s concepts, or visit our `examples `_ for more illustrations of the tools that we provide ! .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 6.655 seconds) .. _sphx_glr_download_auto_examples_00_getting_started.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.3.0?urlpath=lab/tree/notebooks/auto_examples/00_getting_started.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_examples/00_getting_started.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 00_getting_started.ipynb <00_getting_started.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 00_getting_started.py <00_getting_started.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 00_getting_started.zip <00_getting_started.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_