` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_00_getting_started.py: Getting Started =============== This guide showcases the features of ``skrub``, an open-source package that aims at bridging the gap between tabular data sources and machine-learning models. Much of ``skrub`` revolves around vectorizing, assembling, and encoding tabular data, to prepare data in a format that shallow or classic machine-learning models understand. .. GENERATED FROM PYTHON SOURCE LINES 14-31 Downloading example datasets ---------------------------- The :obj:`~skrub.datasets` module allows us to download tabular datasets and demonstrate ``skrub``'s features. .. note:: You can control the directory where the datasets are stored by: - setting in your environment the ``SKRUB_DATA_DIRECTORY`` variable to an absolute directory path, - using the parameter ``data_directory`` in fetch functions, which takes precedence over the envar. By default, the datasets are stored in a folder named "skrub_data" in the user home folder. .. GENERATED FROM PYTHON SOURCE LINES 34-39 .. code-block:: Python from skrub.datasets import fetch_employee_salaries dataset = fetch_employee_salaries() employees_df, salaries = dataset.X, dataset.y .. GENERATED FROM PYTHON SOURCE LINES 40-41 Explore all the available datasets in :ref:`downloading_a_dataset_ref`. .. GENERATED FROM PYTHON SOURCE LINES 44-51 Generating an interactive report for a dataframe ------------------------------------------------- The :class:`~skrub.Cleaner` allows to clean the dataframe, parsing nulls, dates, and dropping columns with too many nulls. To quickly get an overview of a dataframe's contents, use the :class:`~skrub.TableReport`. .. GENERATED FROM PYTHON SOURCE LINES 53-58 .. code-block:: Python from skrub import Cleaner, TableReport employees_df = Cleaner().fit_transform(employees_df) TableReport(employees_df) .. raw:: html

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	1986-09-22 00:00:00	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12 00:00:00	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19 00:00:00	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05 00:00:00	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05 00:00:00	2007

9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03 00:00:00	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28 00:00:00	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30 00:00:00	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05 00:00:00	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30 00:00:00	2012

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	0 (0.0%)	443 (4.8%)
6	date_first_hired	DateTime64DType	0 (0.0%)	2264 (24.5%)			1965-09-30T00:00:00		2016-12-27T00:00:00
7	year_first_hired	Int64DType	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
date_first_hired	year_first_hired	0.891
assignment_category	employee_position_title	0.656
division	assignment_category	0.618
division	employee_position_title	0.517
department_name	assignment_category	0.423
department	assignment_category	0.423
department	employee_position_title	0.412
department_name	employee_position_title	0.412
gender	department	0.371
gender	department_name	0.371
department	division	0.364
department_name	division	0.364
gender	assignment_category	0.277
gender	employee_position_title	0.275
gender	division	0.255
employee_position_title	year_first_hired	0.125
employee_position_title	date_first_hired	0.123
department	date_first_hired	0.0777
department_name	date_first_hired	0.0777

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 59-70 You can use the interactive display above to explore the dataset visually. .. note:: You can see a few more `example reports`_ online. We also provide an experimental online demo_ that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything. .. _example reports: https://skrub-data.org/skrub-reports/examples/ .. _demo: https://skrub-data.org/skrub-reports/ .. GENERATED FROM PYTHON SOURCE LINES 73-75 It is also possible to tell skrub to replace the default pandas & polars displays with ``TableReport``. .. GENERATED FROM PYTHON SOURCE LINES 75-82 .. code-block:: Python from skrub import set_config set_config(use_tablereport=True) employees_df .. raw:: html

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	1986-09-22 00:00:00	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12 00:00:00	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19 00:00:00	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05 00:00:00	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05 00:00:00	2007

9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03 00:00:00	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28 00:00:00	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30 00:00:00	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05 00:00:00	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30 00:00:00	2012

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	0 (0.0%)	443 (4.8%)
6	date_first_hired	DateTime64DType	0 (0.0%)	2264 (24.5%)			1965-09-30T00:00:00		2016-12-27T00:00:00
7	year_first_hired	Int64DType	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
date_first_hired	year_first_hired	0.948
division	assignment_category	0.567
assignment_category	employee_position_title	0.475
department_name	assignment_category	0.425
department	assignment_category	0.425
department	employee_position_title	0.410
department_name	employee_position_title	0.410
division	employee_position_title	0.408
gender	department	0.376
gender	department_name	0.376
department_name	division	0.362
department	division	0.362
gender	employee_position_title	0.266
gender	assignment_category	0.259
gender	division	0.248
employee_position_title	date_first_hired	0.140
employee_position_title	year_first_hired	0.140
department	date_first_hired	0.0914
department_name	date_first_hired	0.0914

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 83-84 This setting can easily be reverted: .. GENERATED FROM PYTHON SOURCE LINES 84-89 .. code-block:: Python set_config(use_tablereport=False) employees_df .. raw:: html

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	1986-09-22	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05	2007
...	...	...	...	...	...	...	...	...
9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30	2012

9228 rows × 8 columns

.. GENERATED FROM PYTHON SOURCE LINES 90-96 Easily building a strong baseline for tabular machine learning -------------------------------------------------------------- The goal of ``skrub`` is to ease tabular data preparation for machine learning. The :func:`~skrub.tabular_learner` function provides an easy way to build a simple but reliable machine-learning model, working well on most tabular data. .. GENERATED FROM PYTHON SOURCE LINES 99-107 .. code-block:: Python from sklearn.model_selection import cross_validate from skrub import tabular_learner model = tabular_learner("regressor") results = cross_validate(model, employees_df, salaries) results["test_score"] .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.91129818, 0.88013711, 0.91451364, 0.92117174, 0.92487738]) .. GENERATED FROM PYTHON SOURCE LINES 108-114 To handle rich tabular data and feed it to a machine-learning model, the pipeline returned by :func:`~skrub.tabular_learner` preprocesses and encodes strings, categories and dates using the :class:`~skrub.TableVectorizer`. See its documentation or :ref:`sphx_glr_auto_examples_01_encodings.py` for more details. An overview of the chosen defaults is available in :ref:`end_to_end_pipeline`. .. GENERATED FROM PYTHON SOURCE LINES 117-128 Assembling data --------------- ``Skrub`` allows imperfect assembly of data, such as joining dataframes on columns that contain typos. ``Skrub``'s joiners have ``fit`` and ``transform`` methods, storing information about the data across calls. The :class:`~skrub.Joiner` allows fuzzy-joining multiple tables, each row of a main table will be augmented with values from the best match in the auxiliary table. You can control how distant fuzzy-matches are allowed to be with the ``max_dist`` parameter. .. GENERATED FROM PYTHON SOURCE LINES 130-132 In the following, we add information about countries to a table containing airports and the cities they are in: .. GENERATED FROM PYTHON SOURCE LINES 134-158 .. code-block:: Python import pandas as pd from skrub import Joiner airports = pd.DataFrame( { "airport_id": [1, 2], "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"], "city": ["Paris", "Roma"], } ) # notice the "Rome" instead of "Roma" capitals = pd.DataFrame( {"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]} ) joiner = Joiner( capitals, main_key="city", aux_key="capital", max_dist=0.8, add_match_info=False, ) joiner.fit_transform(airports) .. raw:: html

	airport_id	airport_name	city	capital	country
0	1	Charles de Gaulle	Paris	Paris	France
1	2	Aeroporto Leonardo da Vinci	Roma	Rome	Italy

.. GENERATED FROM PYTHON SOURCE LINES 159-165 Information about countries have been added, even if the rows aren't exactly matching. It's also possible to augment data by joining and aggregating multiple dataframes with the :class:`~skrub.AggJoiner`. This is particularly useful to summarize information scattered across tables, for instance adding statistics about flights to the dataframe of airports: .. GENERATED FROM PYTHON SOURCE LINES 167-186 .. code-block:: Python from skrub import AggJoiner flights = pd.DataFrame( { "flight_id": range(1, 7), "from_airport": [1, 1, 1, 2, 2, 2], "total_passengers": [90, 120, 100, 70, 80, 90], "company": ["DL", "AF", "AF", "DL", "DL", "TR"], } ) agg_joiner = AggJoiner( aux_table=flights, main_key="airport_id", aux_key="from_airport", cols=["total_passengers"], # the cols to perform aggregation on operations=["mean", "std"], # the operations to compute ) agg_joiner.fit_transform(airports) .. raw:: html

	airport_id	airport_name	city	total_passengers_mean	total_passengers_std
0	1	Charles de Gaulle	Paris	103.333333	15.275252
1	2	Aeroporto Leonardo da Vinci	Roma	80.000000	10.000000

.. GENERATED FROM PYTHON SOURCE LINES 187-191 For joining multiple auxiliary tables on a main table at once, use the :class:`~skrub.MultiAggJoiner`. See other ways to join multiple tables in :ref:`assembling`. .. GENERATED FROM PYTHON SOURCE LINES 194-207 Encoding data ------------- When a column contains categories with variations and typos, it can be encoded using one of ``skrub``'s encoders, such as the :class:`~skrub.GapEncoder`. The :class:`~skrub.GapEncoder` creates a continuous encoding, based on the activation of latent categories. It will create the encoding based on combinations of substrings which frequently co-occur. For instance, we might want to encode a column ``X`` that contains information about cities, being either Madrid or Rome : .. GENERATED FROM PYTHON SOURCE LINES 209-227 .. code-block:: Python from skrub import GapEncoder X = pd.Series( [ "Rome, Italy", "Rome", "Roma, Italia", "Madrid, SP", "Madrid, spain", "Madrid", "Romq", "Rome, It", ], name="city", ) enc = GapEncoder(n_components=2, random_state=0) # 2 topics in the data enc.fit(X) .. raw:: html

GapEncoder(n_components=2, random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GapEncoder

iFitted

Parameters

	n_components	2
	batch_size	1024
	gamma_shape_prior	1.1
	gamma_scale_prior	1.0
	rho	0.95
	rescale_rho	False
	hashing	False
	hashing_n_features	4096
	init	'k-means++'
	max_iter	5
	ngram_range	(2, ...)
	analyzer	'char'
	add_words	False
	random_state	0
	rescale_W	True
	max_iter_e_step	1
	max_no_improvement	5
	verbose	0

.. GENERATED FROM PYTHON SOURCE LINES 228-229 The :class:`~skrub.GapEncoder` has found the following two topics: .. GENERATED FROM PYTHON SOURCE LINES 231-233 .. code-block:: Python enc.get_feature_names_out() .. rst-class:: sphx-glr-script-out .. code-block:: none ['city: madrid, spain, sp', 'city: italia, italy, romq'] .. GENERATED FROM PYTHON SOURCE LINES 234-237 Which correspond to the two cities. Let's see the activation of each topic depending on the rows of ``X``: .. GENERATED FROM PYTHON SOURCE LINES 239-242 .. code-block:: Python encoded = enc.fit_transform(X).assign(original=X) encoded .. raw:: html

	city: madrid, spain, sp	city: italia, italy, romq	original
0	0.052257	13.547743	Rome, Italy
1	0.050202	3.049798	Rome
2	0.063282	15.036718	Roma, Italia
3	12.047028	0.052972	Madrid, SP
4	16.547818	0.052182	Madrid, spain
5	6.048861	0.051139	Madrid
6	0.050019	3.049981	Romq
7	0.053193	9.046807	Rome, It

.. GENERATED FROM PYTHON SOURCE LINES 243-247 The higher the activation, the closer the row to the latent topic. These columns can now be understood by a machine-learning model. The other encoders are presented in :ref:`encoding`. .. GENERATED FROM PYTHON SOURCE LINES 250-261 Next steps ---------- We have briefly covered pipeline creation, vectorizing, assembling, and encoding data. We presented the main functionalities of ``skrub``, but there is much more to it ! Please refer to our :ref:`user_guide` for a more in-depth presentation of ``skrub``'s concepts, or visit our `examples `_ for more illustrations of the tools that we provide ! .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 8.493 seconds) .. _sphx_glr_download_auto_examples_00_getting_started.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/00_getting_started.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_examples/00_getting_started.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 00_getting_started.ipynb <00_getting_started.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 00_getting_started.py <00_getting_started.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 00_getting_started.zip <00_getting_started.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript