.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/01_encodings.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_01_encodings.py: .. _example_encodings: ===================================================================== Encoding: from a dataframe to a numerical matrix for machine learning ===================================================================== This example shows how to transform a rich dataframe with columns of various types into a numerical matrix on which machine-learning algorithms can be applied. We study the case of predicting wages using the `employee salaries `_ dataset. .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |Pipeline| replace:: :class:`~sklearn.pipeline.Pipeline` .. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder` .. |GapEncoder| replace:: :class:`~skrub.GapEncoder` .. |MinHashEncoder| replace:: :class:`~skrub.MinHashEncoder` .. |DatetimeEncoder| replace:: :class:`~skrub.DatetimeEncoder` .. |HGBR| replace:: :class:`~sklearn.ensemble.HistGradientBoostingRegressor` .. |RandomForestRegressor| replace:: :class:`~sklearn.ensemble.RandomForestRegressor` .. |permutation importances| replace:: :func:`~sklearn.inspection.permutation_importance` .. GENERATED FROM PYTHON SOURCE LINES 42-52 Easy learning on a dataframe ---------------------------- Let's first retrieve the dataset, using one of the downloaders from the :mod:`skrub.datasets` module. As all the downloaders, :func:`~skrub.datasets.fetch_employee_salaries` returns a dataset with attributes ``X``, and ``y``. ``X`` is a dataframe which contains the features (aka design matrix, explanatory variables, independent variables). ``y`` is a column (pandas Series) which contains the target (aka dependent, response variable) that we want to learn to predict from ``X``. In this case ``y`` is the annual salary. .. GENERATED FROM PYTHON SOURCE LINES 52-59 .. code-block:: Python from skrub.datasets import fetch_employee_salaries dataset = fetch_employee_salaries() employees, salaries = dataset.X, dataset.y employees .. raw:: html
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator 09/22/1986 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer 09/12/1988 1988
2 F HHS Department of Health and Human Services Adult Protective and Case Management Services Fulltime-Regular Social Worker IV 11/19/1989 1989
3 M COR Correction and Rehabilitation PRRS Facility and Security Fulltime-Regular Resident Supervisor II 05/05/2014 2014
4 M HCA Department of Housing and Community Affairs Affordable Housing Programs Fulltime-Regular Planning Specialist III 03/05/2007 2007
... ... ... ... ... ... ... ... ...
9223 F HHS Department of Health and Human Services School Based Health Centers Fulltime-Regular Community Health Nurse II 11/03/2015 2015
9224 F FRS Fire and Rescue Services Human Resources Division Fulltime-Regular Fire/Rescue Division Chief 11/28/1988 1988
9225 M HHS Department of Health and Human Services Child and Adolescent Mental Health Clinic Serv... Parttime-Regular Medical Doctor IV - Psychiatrist 04/30/2001 2001
9226 M CCL County Council Council Central Staff Fulltime-Regular Manager II 09/05/2006 2006
9227 M DLC Department of Liquor Control Licensure, Regulation and Education Fulltime-Regular Alcohol/Tobacco Enforcement Specialist II 01/30/2012 2012

9228 rows × 8 columns



.. GENERATED FROM PYTHON SOURCE LINES 60-71 Most machine-learning algorithms work with arrays of numbers. The challenge here is that the ``employees`` dataframe is a heterogeneous set of columns: some are numerical (``'year_first_hired'``), some dates (``'date_first_hired'``), some have a few categorical entries (``'gender'``), some many (``'employee_position_title'``). Therefore our table needs to be "vectorized": processed to extract numeric features. ``skrub`` provides an easy way to build a simple but reliable machine-learning model which includes this step, working well on most tabular data. .. GENERATED FROM PYTHON SOURCE LINES 71-80 .. code-block:: Python from sklearn.model_selection import cross_validate from skrub import tabular_learner model = tabular_learner("regressor") results = cross_validate(model, employees, salaries) results["test_score"] .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666]) .. GENERATED FROM PYTHON SOURCE LINES 81-85 The estimator returned by :obj:`tabular_learner` combines 2 steps: - a |TableVectorizer| to preprocess the dataframe and vectorize the features - a supervised learner (by default a |HGBR|) .. GENERATED FROM PYTHON SOURCE LINES 85-87 .. code-block:: Python model .. raw:: html
Pipeline(steps=[('tablevectorizer',
                     TableVectorizer(high_cardinality=MinHashEncoder(),
                                     low_cardinality=ToCategorical())),
                    ('histgradientboostingregressor',
                     HistGradientBoostingRegressor(categorical_features='from_dtype'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 88-92 In the rest of this example, we focus on the first step and explore the capabilities of skrub's |TableVectorizer|. | .. GENERATED FROM PYTHON SOURCE LINES 94-96 More details on encoding tabular data ------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 96-103 .. code-block:: Python from skrub import TableVectorizer vectorizer = TableVectorizer() vectorized_employees = vectorizer.fit_transform(employees) vectorized_employees .. raw:: html
gender_F gender_M gender_nan department_BOA department_BOE department_CAT department_CCL department_CEC department_CEX department_COR department_CUS department_DEP department_DGS department_DHS department_DLC department_DOT department_DPS department_DTS department_ECM department_FIN department_FRS department_HCA department_HHS department_HRC department_IGR department_LIB department_MPB department_NDA department_OAG department_OCP department_OHR department_OIG department_OLO department_OMB department_PIO department_POL department_PRO department_REC department_SHF department_ZAH ... division: management, equipment, budget division: security, mc311, mccf division: nicholson, transit, taxicab division: administration, delivery, battalion assignment_category_Parttime-Regular employee_position_title: engineer, senior, forensic employee_position_title: manager, projects, project employee_position_title: officer, office, security employee_position_title: operator, equipment, bus employee_position_title: master, registered, water employee_position_title: recreation, correctional, correction employee_position_title: technician, mechanic, supply employee_position_title: firefighter, rescuer, rescue employee_position_title: program, programs, procurement employee_position_title: community, health, nurse employee_position_title: income, assistance, client employee_position_title: school, room, behavioral employee_position_title: corporal, pfc, dietary employee_position_title: sergeant, police, cadet employee_position_title: information, technology, technologist employee_position_title: liquor, clerk, store employee_position_title: coordinator, services, service employee_position_title: administrative, legislative, principal employee_position_title: enforcement, inspector, budget employee_position_title: candidate, sheriff, deputy employee_position_title: lieutenant, captain, chief employee_position_title: president, resident, customer employee_position_title: crossing, purchasing, guard employee_position_title: worker, social, leader employee_position_title: permitting, planning, resources employee_position_title: specialist, special, quality employee_position_title: accountant, attorney, attendant employee_position_title: warehouse, welfare, driver employee_position_title: communications, supervisor, safety employee_position_title: assistant, library, librarian date_first_hired_year date_first_hired_month date_first_hired_day date_first_hired_total_seconds year_first_hired
0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 5.648932 3.035653 0.571320 0.066783 0.0 0.053611 0.054121 0.100430 0.065679 0.055340 0.054004 0.051519 0.054689 0.050622 0.052436 0.054520 0.051887 0.064413 0.086614 0.052504 0.059330 37.313560 0.052787 0.053816 0.054037 0.054686 0.072643 0.052548 0.052782 0.061135 0.052608 0.054692 0.053502 0.059154 0.050329 1986.0 9.0 22.0 5.277312e+08 1986.0
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.066695 0.200296 0.079657 0.061663 0.0 0.053024 0.070076 21.639814 0.053822 6.712848 0.051108 0.054119 0.059416 0.053413 0.050761 0.078093 0.052755 0.064591 0.095259 0.057205 0.059932 0.064134 0.053876 0.051453 0.053348 0.053399 0.054098 0.050750 0.054259 0.067959 0.059318 0.060588 0.052977 0.055484 0.062122 1988.0 9.0 12.0 5.900256e+08 1988.0
2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 17.302721 0.057866 0.054923 0.057619 0.0 0.053593 0.062018 0.057368 0.053129 0.054023 0.057609 0.054774 0.056327 0.057147 0.051982 0.054173 0.055152 0.059059 0.051649 0.055865 0.056016 0.052047 0.060343 0.053236 0.052254 0.050004 0.051814 0.050158 20.895084 0.058130 0.065421 0.052705 0.066760 0.051047 0.051111 1989.0 11.0 19.0 6.274368e+08 1989.0
3 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.054303 4.440018 0.053944 0.052447 0.0 0.089787 0.176727 0.097390 0.090233 0.078887 1.372314 0.061342 0.158898 0.064485 0.053708 0.090222 0.050868 0.080885 0.061326 0.067347 0.091618 0.078273 0.066315 6.585081 0.063227 0.073088 11.495172 0.113298 0.189576 0.085110 0.075858 0.143246 0.060952 9.610301 0.174468 2014.0 5.0 5.0 1.399248e+09 2014.0
4 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.074753 0.050028 0.062737 0.063829 0.0 0.057486 0.106020 0.075457 0.051493 0.050704 0.059355 0.084602 0.064983 0.118974 0.058111 0.089048 0.058423 0.062680 0.059087 0.075288 0.055589 0.052115 0.056748 0.074867 0.064005 0.053179 0.067637 0.084349 0.079633 17.778709 13.324516 0.071877 0.050174 0.053181 0.061707 2007.0 3.0 5.0 1.173053e+09 2007.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9223 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.058705 0.051930 0.056815 0.087720 0.0 0.052620 0.055327 0.055086 0.050001 0.050002 0.052560 0.054809 0.053022 0.052947 34.414398 0.054485 0.066607 0.054153 0.053966 0.052243 0.052319 0.052122 0.051941 0.053034 0.057656 0.050924 0.081195 0.050167 0.055107 0.052436 0.054753 0.055726 0.052426 0.056477 0.051491 2015.0 11.0 3.0 1.446509e+09 2015.0
9224 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.489735 0.061693 1.773392 0.061413 0.0 0.067270 0.050001 0.052345 0.050002 0.084420 2.375937 0.085554 0.095321 0.056058 0.051760 0.112862 0.069673 0.079209 0.051529 1.075365 0.053219 0.569734 0.158086 0.050685 0.062367 28.546095 0.067221 0.266426 0.056775 0.065859 0.089099 0.096037 1.169709 1.625142 0.266240 1988.0 11.0 28.0 5.966784e+08 1988.0
9225 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 7.235918 0.073626 0.105044 2.050456 1.0 0.284982 0.054587 0.078859 0.244780 0.057309 0.086892 0.550962 0.051411 0.072186 0.052494 0.275286 0.150653 0.081606 0.061576 0.135719 0.135503 0.460574 0.710347 3.726007 0.184190 0.117264 0.056900 0.055546 34.073044 0.099299 0.884935 1.202809 0.469009 1.692423 0.392849 2001.0 4.0 30.0 9.885888e+08 2001.0
9226 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.058115 0.052636 0.054740 0.055434 0.0 0.052507 11.919686 0.066847 0.051866 0.055955 0.051229 0.055496 0.063447 0.053129 0.052939 0.051904 0.051439 0.058762 0.055070 0.052260 0.054069 0.051487 0.050002 0.055882 0.055632 0.050965 0.055143 0.050045 0.068810 0.051701 0.053445 0.056001 0.051960 0.050594 0.051729 2006.0 9.0 5.0 1.157414e+09 2006.0
9227 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.088840 0.063416 0.122911 0.078590 0.0 0.064237 0.053510 0.127645 0.055874 0.050815 0.069455 0.058982 0.052049 0.129591 0.059126 0.097498 0.108909 0.063613 0.159993 0.164223 0.060185 0.061701 0.054135 31.993534 0.055770 0.054221 0.059083 7.293828 0.071370 0.125007 15.402305 3.271707 0.063802 0.059157 0.058676 2012.0 1.0 30.0 1.327882e+09 2012.0

9228 rows × 143 columns



.. GENERATED FROM PYTHON SOURCE LINES 104-108 From our 8 columns, the |TableVectorizer| has extracted 143 numerical features. Most of them are one-hot encoded representations of the categorical features. For example, we can see that 3 columns ``'gender_F'``, ``'gender_M'``, ``'gender_nan'`` were created to encode the ``'gender'`` column. .. GENERATED FROM PYTHON SOURCE LINES 110-112 By performing appropriate transformations on our complex data, the |TableVectorizer| produced numeric features that we can use for machine-learning: .. GENERATED FROM PYTHON SOURCE LINES 112-117 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingRegressor HistGradientBoostingRegressor().fit(vectorized_employees, salaries) .. raw:: html
HistGradientBoostingRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 118-122 The |TableVectorizer| bridges the gap between tabular data and machine-learning pipelines. It allows us to apply a machine-learning estimator to our dataframe without manual data wrangling and feature extraction. .. GENERATED FROM PYTHON SOURCE LINES 124-138 Inspecting the TableVectorizer ------------------------------ The |TableVectorizer| distinguishes between 4 basic kinds of columns (more may be added in the future). For each kind, it applies a different transformation, which we can configure. The kinds of columns and the default transformation for each of them are: - numeric columns: simply casting to floating-point - datetime columns: extracting features such as year, day, hour with the |DatetimeEncoder| - low-cardinality categorical columns: one-hot encoding - high-cardinality categorical columns: a simple and effective text representation pipeline provided by the |GapEncoder| .. GENERATED FROM PYTHON SOURCE LINES 138-141 .. code-block:: Python vectorizer .. raw:: html
TableVectorizer()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 142-145 We can inspect which transformation was chosen for a each column and retrieve the fitted transformer. ``vectorizer.kind_to_columns_`` provides an overview of how the vectorizer categorized columns in our input: .. GENERATED FROM PYTHON SOURCE LINES 145-148 .. code-block:: Python vectorizer.kind_to_columns_ .. rst-class:: sphx-glr-script-out .. code-block:: none {'numeric': ['year_first_hired'], 'datetime': ['date_first_hired'], 'low_cardinality': ['gender', 'department', 'department_name', 'assignment_category'], 'high_cardinality': ['division', 'employee_position_title'], 'specific': []} .. GENERATED FROM PYTHON SOURCE LINES 149-150 The reverse mapping is given by: .. GENERATED FROM PYTHON SOURCE LINES 150-153 .. code-block:: Python vectorizer.column_to_kind_ .. rst-class:: sphx-glr-script-out .. code-block:: none {'year_first_hired': 'numeric', 'date_first_hired': 'datetime', 'gender': 'low_cardinality', 'department': 'low_cardinality', 'department_name': 'low_cardinality', 'assignment_category': 'low_cardinality', 'division': 'high_cardinality', 'employee_position_title': 'high_cardinality'} .. GENERATED FROM PYTHON SOURCE LINES 154-156 ``vectorizer.transformers_`` gives us a dictionary which maps column names to the corresponding transformer. .. GENERATED FROM PYTHON SOURCE LINES 156-159 .. code-block:: Python vectorizer.transformers_["date_first_hired"] .. raw:: html
DatetimeEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 160-162 We can also see which features in the vectorizer's output were derived from a given input column. .. GENERATED FROM PYTHON SOURCE LINES 162-165 .. code-block:: Python vectorizer.input_to_outputs_["date_first_hired"] .. rst-class:: sphx-glr-script-out .. code-block:: none ['date_first_hired_year', 'date_first_hired_month', 'date_first_hired_day', 'date_first_hired_total_seconds'] .. GENERATED FROM PYTHON SOURCE LINES 166-169 .. code-block:: Python vectorized_employees[vectorizer.input_to_outputs_["date_first_hired"]] .. raw:: html
date_first_hired_year date_first_hired_month date_first_hired_day date_first_hired_total_seconds
0 1986.0 9.0 22.0 5.277312e+08
1 1988.0 9.0 12.0 5.900256e+08
2 1989.0 11.0 19.0 6.274368e+08
3 2014.0 5.0 5.0 1.399248e+09
4 2007.0 3.0 5.0 1.173053e+09
... ... ... ... ...
9223 2015.0 11.0 3.0 1.446509e+09
9224 1988.0 11.0 28.0 5.966784e+08
9225 2001.0 4.0 30.0 9.885888e+08
9226 2006.0 9.0 5.0 1.157414e+09
9227 2012.0 1.0 30.0 1.327882e+09

9228 rows × 4 columns



.. GENERATED FROM PYTHON SOURCE LINES 170-172 Finally, we can go in the opposite direction: given a column in the input, find out from which input column it was derived. .. GENERATED FROM PYTHON SOURCE LINES 172-176 .. code-block:: Python vectorizer.output_to_input_["department_BOA"] .. rst-class:: sphx-glr-script-out .. code-block:: none 'department' .. GENERATED FROM PYTHON SOURCE LINES 177-182 Dataframe preprocessing ~~~~~~~~~~~~~~~~~~~~~~~ Note that ``"date_first_hired"`` has been recognized and processed as a datetime ßcolumn. .. GENERATED FROM PYTHON SOURCE LINES 182-185 .. code-block:: Python vectorizer.column_to_kind_["date_first_hired"] .. rst-class:: sphx-glr-script-out .. code-block:: none 'datetime' .. GENERATED FROM PYTHON SOURCE LINES 186-187 But looking closer at our original dataframe, it was encoded as a string. .. GENERATED FROM PYTHON SOURCE LINES 187-190 .. code-block:: Python employees["date_first_hired"] .. rst-class:: sphx-glr-script-out .. code-block:: none 0 09/22/1986 1 09/12/1988 2 11/19/1989 3 05/05/2014 4 03/05/2007 ... 9223 11/03/2015 9224 11/28/1988 9225 04/30/2001 9226 09/05/2006 9227 01/30/2012 Name: date_first_hired, Length: 9228, dtype: object .. GENERATED FROM PYTHON SOURCE LINES 191-201 Note the ``dtype: object`` in the output above. Before applying the transformers we specify, the |TableVectorizer| performs a few preprocessing steps. For example, strings commonly used to represent missing values such as ``"N/A"`` are replaced with actual ``null``. As we saw above, columns containing strings that represent dates (e.g. ``'2024-05-15'``) are detected and converted to proper datetimes. We can inspect the list of steps that were applied to a given column: .. GENERATED FROM PYTHON SOURCE LINES 201-204 .. code-block:: Python vectorizer.all_processing_steps_["date_first_hired"] .. rst-class:: sphx-glr-script-out .. code-block:: none [CleanNullStrings(), ToDatetime(), DatetimeEncoder(), {'date_first_hired_day': ToFloat32(), 'date_first_hired_month': ToFloat32(), ...}] .. GENERATED FROM PYTHON SOURCE LINES 205-206 These preprocessing steps depend on the column: .. GENERATED FROM PYTHON SOURCE LINES 206-209 .. code-block:: Python vectorizer.all_processing_steps_["department"] .. rst-class:: sphx-glr-script-out .. code-block:: none [CleanNullStrings(), ToStr(), OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore', sparse_output=False), {'department_BOA': ToFloat32(), 'department_BOE': ToFloat32(), ...}] .. GENERATED FROM PYTHON SOURCE LINES 213-219 A simple Pipeline for tabular data ---------------------------------- The |TableVectorizer| outputs data that can be understood by a scikit-learn estimator. Therefore we can easily build a 2-step scikit-learn ``Pipeline`` that we can fit, test or cross-validate and that works well on tabular data. .. GENERATED FROM PYTHON SOURCE LINES 219-232 .. code-block:: Python import numpy as np from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import cross_validate from sklearn.pipeline import make_pipeline pipeline = make_pipeline(TableVectorizer(), HistGradientBoostingRegressor()) results = cross_validate(pipeline, employees, salaries) scores = results["test_score"] print(f"R2 score: mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}") print(f"mean fit time: {np.mean(results['fit_time']):.3f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none R2 score: mean: 0.922; std: 0.012 mean fit time: 6.016 seconds .. GENERATED FROM PYTHON SOURCE LINES 233-266 Specializing the TableVectorizer for HistGradientBoosting --------------------------------------------------------- The encoders used by default by the |TableVectorizer| are safe choices for a wide range of downstream estimators. If we know we want to use it with a |HGBR| (or classifier) model, we can make some different choices that are only well-suited for tree-based models but can yield a faster pipeline. We make 2 changes. The |HGBR| has built-in support for categorical features, so we do not need to one-hot encode them. We do need to tell it which features should be treated as categorical with the ``categorical_features`` parameter. In recent versions of scikit-learn, we can set ``categorical_features='from_dtype'``, and it will treat all columns in the input that have a ``Categorical`` dtype as such. Therefore we change the encoder for low-cardinality columns: instead of ``OneHotEncoder``, we use skrub's ``ToCategorical``. This transformer will simply ensure our columns have an actual ``Categorical`` dtype (as opposed to string for example), so that they can be recognized by the |HGBR|. The second change replaces the |GapEncoder| with a |MinHashEncoder|. The |GapEncoder| is a topic model. It produces interpretable embeddings in a vector space where distances are meaningful, which is great for interpretation and necessary for some downstream supervised learners such as linear models. However fitting the topic model is costly in computation time and memory. The |MinHashEncoder| produces features that are not easy to interpret, but that decision trees can efficiently use to test for the occurrence of particular character n-grams (more details are provided in its documentation). Therefore it can be a faster and very effective alternative, when the supervised learner is built on top of decision trees, which is the case for the |HGBR|. The resulting pipeline is identical to the one produced by default by :obj:`tabular_learner`. .. GENERATED FROM PYTHON SOURCE LINES 266-281 .. code-block:: Python from skrub import MinHashEncoder, ToCategorical vectorizer = TableVectorizer( low_cardinality=ToCategorical(), high_cardinality=MinHashEncoder() ) pipeline = make_pipeline( vectorizer, HistGradientBoostingRegressor(categorical_features="from_dtype") ) results = cross_validate(pipeline, employees, salaries) scores = results["test_score"] print(f"R2 score: mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}") print(f"mean fit time: {np.mean(results['fit_time']):.3f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none R2 score: mean: 0.911; std: 0.014 mean fit time: 1.027 seconds .. GENERATED FROM PYTHON SOURCE LINES 282-285 We can see that this new pipeline achieves a similar score but is fitted much faster. This is mostly due to replacing |GapEncoder| with |MinHashEncoder| (however this makes the features less interpretable). .. GENERATED FROM PYTHON SOURCE LINES 287-302 Feature importances in the statistical model -------------------------------------------- As we just saw, we can fit a |MinHashEncoder| faster than a |GapEncoder|. However, the |GapEncoder| has a crucial advantage: each dimension of its output space is associated with a topic which can be inspected and interpreted. In this section, after training a regressor, we will plot the feature importances. .. topic:: Note: To minimize computation time, we use the feature importances computed by the |RandomForestRegressor|, but you should prefer |permutation importances| instead (which are less subject to biases). First, we train another scikit-learn regressor, the |RandomForestRegressor|: .. GENERATED FROM PYTHON SOURCE LINES 302-311 .. code-block:: Python from sklearn.ensemble import RandomForestRegressor vectorizer = TableVectorizer() # now using the default GapEncoder regressor = RandomForestRegressor(n_estimators=50, max_depth=20, random_state=0) pipeline = make_pipeline(vectorizer, regressor) pipeline.fit(employees, salaries) .. raw:: html
Pipeline(steps=[('tablevectorizer', TableVectorizer()),
                    ('randomforestregressor',
                     RandomForestRegressor(max_depth=20, n_estimators=50,
                                           random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 312-313 We are retrieving the feature importances: .. GENERATED FROM PYTHON SOURCE LINES 313-320 .. code-block:: Python avg_importances = regressor.feature_importances_ std_importances = np.std( [tree.feature_importances_ for tree in regressor.estimators_], axis=0 ) indices = np.argsort(avg_importances)[::-1] .. GENERATED FROM PYTHON SOURCE LINES 321-322 And plotting the results: .. GENERATED FROM PYTHON SOURCE LINES 322-340 .. code-block:: Python import matplotlib.pyplot as plt top_indices = indices[:20] labels = vectorizer.get_feature_names_out()[top_indices] plt.figure(figsize=(12, 9)) plt.barh( y=labels, width=avg_importances[top_indices], yerr=std_importances[top_indices], color="b", ) plt.yticks(fontsize=15) plt.title("Feature importances") plt.tight_layout(pad=1) plt.show() .. image-sg:: /auto_examples/images/sphx_glr_01_encodings_001.png :alt: Feature importances :srcset: /auto_examples/images/sphx_glr_01_encodings_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 341-346 The |GapEncoder| creates feature names that show the first 3 most important words in the topic associated with each feature. As we can see in the plot above, this helps inspecting the model. If we had used a |MinHashEncoder| instead, the features would be much less helpful, with names such as ``employee_position_title_0``, ``employee_position_title_1``, etc. .. GENERATED FROM PYTHON SOURCE LINES 348-357 We can see that features such the time elapsed since being hired, having a full-time employment, and the position, seem to be the most informative for prediction. However, feature importances must not be over-interpreted -- they capture statistical associations `rather than causal effects `_. Moreover, the fast feature importance method used here suffers from biases favouring features with larger cardinality, as illustrated in a scikit-learn `example `_. In general we should prefer |permutation importances|, but it is a slower method. .. GENERATED FROM PYTHON SOURCE LINES 359-370 Conclusion ---------- In this example, we motivated the need for a simple machine learning pipeline, which we built using the |TableVectorizer| and a |HGBR|. We saw that by default, it works well on a heterogeneous dataset. To better understand our dataset, and without much effort, we were also able to plot the feature importances. .. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 13.200 seconds) .. _sphx_glr_download_auto_examples_01_encodings.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.2.0?urlpath=lab/tree/notebooks/auto_examples/01_encodings.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/?path=auto_examples/01_encodings.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 01_encodings.ipynb <01_encodings.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 01_encodings.py <01_encodings.py>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_