Getting Started
===============

This guide showcases the features of ``skrub``, an open-source package that aims at bridging the gap between tabular data sources and machine-learning models.

Much of ``skrub`` revolves around vectorizing, assembling, and encoding tabular data, to prepare data in a format that shallow or classic machine-learning models understand.

Downloading example datasets
----------------------------

The :obj:`~skrub.datasets` module allows us to download tabular datasets and demonstrate ``skrub``'s features.

from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees_df, salaries = dataset.X, dataset.y

Explore all the available datasets in :ref:`downloading_a_dataset_ref`.


Generating an interactive report for a dataframe
-------------------------------------------------

To quickly get an overview of a dataframe's contents, use the :class:`~skrub.TableReport`.

from skrub import TableReport

TableReport(employees_df)

.. GENERATED FROM PYTHON SOURCE LINES 43-54 You can use the interactive display above to explore the dataset visually. .. note:: You can see a few more `example reports`_ online. We also provide an experimental online demo_ that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything. .. _example reports: https://skrub-data.org/skrub-reports/examples/ .. _demo: https://skrub-data.org/skrub-reports/ .. GENERATED FROM PYTHON SOURCE LINES 58-64 Easily building a strong baseline for tabular machine learning -------------------------------------------------------------- The goal of ``skrub`` is to ease tabular data preparation for machine learning. The :func:`~skrub.tabular_learner` function provides an easy way to build a simple but reliable machine-learning model, working well on most tabular data. .. GENERATED FROM PYTHON SOURCE LINES 67-75 .. code-block:: Python from sklearn.model_selection import cross_validate from skrub import tabular_learner model = tabular_learner("regressor") results = cross_validate(model, employees_df, salaries) results["test_score"] .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666]) .. GENERATED FROM PYTHON SOURCE LINES 76-82 To handle rich tabular data and feed it to a machine-learning model, the pipeline returned by :func:`~skrub.tabular_learner` preprocesses and encodes strings, categories and dates using the :class:`~skrub.TableVectorizer`. See its documentation or :ref:`sphx_glr_auto_examples_01_encodings.py` for more details. An overview of the chosen defaults is available in :ref:`end_to_end_pipeline`. .. GENERATED FROM PYTHON SOURCE LINES 85-96 Assembling data --------------- ``Skrub`` allows imperfect assembly of data, such as joining dataframes on columns that contain typos. ``Skrub``'s joiners have ``fit`` and ``transform`` methods, storing information about the data across calls. The :class:`~skrub.Joiner` allows fuzzy-joining multiple tables, each row of a main table will be augmented with values from the best match in the auxiliary table. You can control how distant fuzzy-matches are allowed to be with the ``max_dist`` parameter. .. GENERATED FROM PYTHON SOURCE LINES 98-100 In the following, we add information about countries to a table containing airports and the cities they are in: .. GENERATED FROM PYTHON SOURCE LINES 102-126 .. code-block:: Python import pandas as pd from skrub import Joiner airports = pd.DataFrame( { "airport_id": [1, 2], "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"], "city": ["Paris", "Roma"], } ) # notice the "Rome" instead of "Roma" capitals = pd.DataFrame( {"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]} ) joiner = Joiner( capitals, main_key="city", aux_key="capital", max_dist=0.8, add_match_info=False, ) joiner.fit_transform(airports) .. raw:: html
airport_id airport_name city capital country
0 1 Charles de Gaulle Paris Paris France
1 2 Aeroporto Leonardo da Vinci Roma Rome Italy

.. GENERATED FROM PYTHON SOURCE LINES 127-133 Information about countries have been added, even if the rows aren't exactly matching. It's also possible to augment data by joining and aggregating multiple dataframes with the :class:`~skrub.AggJoiner`. This is particularly useful to summarize information scattered across tables, for instance adding statistics about flights to the dataframe of airports: .. GENERATED FROM PYTHON SOURCE LINES 135-154 .. code-block:: Python from skrub import AggJoiner flights = pd.DataFrame( { "flight_id": range(1, 7), "from_airport": [1, 1, 1, 2, 2, 2], "total_passengers": [90, 120, 100, 70, 80, 90], "company": ["DL", "AF", "AF", "DL", "DL", "TR"], } ) agg_joiner = AggJoiner( aux_table=flights, main_key="airport_id", aux_key="from_airport", cols=["total_passengers", "company"], # the cols to perform aggregation on operations=["mean", "mode"], # the operations to compute ) agg_joiner.fit_transform(airports) .. raw:: html
airport_id airport_name city company_mode total_passengers_mean
0 1 Charles de Gaulle Paris AF 103.333333
1 2 Aeroporto Leonardo da Vinci Roma DL 80.000000

.. GENERATED FROM PYTHON SOURCE LINES 155-159 For joining multiple auxiliary tables on a main table at once, use the :class:`~skrub.MultiAggJoiner`. See other ways to join multiple tables in :ref:`assembling`. .. GENERATED FROM PYTHON SOURCE LINES 162-175 Encoding data ------------- When a column contains categories with variations and typos, it can be encoded using one of ``skrub``'s encoders, such as the :class:`~skrub.GapEncoder`. The :class:`~skrub.GapEncoder` creates a continuous encoding, based on the activation of latent categories. It will create the encoding based on combinations of substrings which frequently co-occur. For instance, we might want to encode a column ``X`` that contains information about cities, being either Madrid or Rome : .. GENERATED FROM PYTHON SOURCE LINES 177-195 .. code-block:: Python from skrub import GapEncoder X = pd.Series( [ "Rome, Italy", "Rome", "Roma, Italia", "Madrid, SP", "Madrid, spain", "Madrid", "Romq", "Rome, It", ], name="city", ) enc = GapEncoder(n_components=2, random_state=0) # 2 topics in the data enc.fit(X) .. raw:: html
GapEncoder(n_components=2, random_state=0)
.. GENERATED FROM PYTHON SOURCE LINES 196-197 The :class:`~skrub.GapEncoder` has found the following two topics: .. GENERATED FROM PYTHON SOURCE LINES 199-201 .. code-block:: Python enc.get_feature_names_out() .. rst-class:: sphx-glr-script-out .. code-block:: none ['city: madrid, spain, sp', 'city: italia, italy, romq'] .. GENERATED FROM PYTHON SOURCE LINES 202-205 Which correspond to the two cities. Let's see the activation of each topic depending on the rows of ``X``: .. GENERATED FROM PYTHON SOURCE LINES 207-210 .. code-block:: Python encoded = enc.fit_transform(X).assign(original=X) encoded .. raw:: html
city: madrid, spain, sp city: italia, italy, romq original
0 0.052257 13.547743 Rome, Italy
1 0.050202 3.049798 Rome
2 0.063282 15.036718 Roma, Italia
3 12.047028 0.052972 Madrid, SP
4 16.547818 0.052182 Madrid, spain
5 6.048861 0.051139 Madrid
6 0.050019 3.049981 Romq
7 0.053193 9.046807 Rome, It

