.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorials/0000_getting_started.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via JupyterLite or Binder. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorials_0000_getting_started.py: Getting Started with skrub ========================== This guide showcases some of the features of skrub. Much of skrub revolves around simplifying many of the tasks that are involved in pre-processing raw data into a format that shallow or classic machine-learning models can understand, that is, numerical data. Skrub achieves this by vectorizing, assembling, and encoding tabular data through the features we present in this example and the following ones. .. |TableReport| replace:: :class:`~skrub.TableReport` .. |Cleaner| replace:: :class:`~skrub.Cleaner` .. |set_config| replace:: :func:`~skrub.set_config` .. |tabular_pipeline| replace:: :func:`~skrub.tabular_pipeline` .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |Joiner| replace:: :class:`~skrub.Joiner` .. |SquashingScaler| replace:: :class:`~skrub.SquashingScaler` .. |DatetimeEncoder| replace:: :class:`~skrub.DatetimeEncoder` .. |ApplyToCols| replace:: :class:`~skrub.ApplyToCols` .. |StringEncoder| replace:: :class:`~skrub.StringEncoder` .. |TextEncoder| replace:: :class:`~skrub.TextEncoder` .. GENERATED FROM PYTHON SOURCE LINES 27-32 Preliminary exploration with the |TableReport| ---------------------------------------------- We start by loading the "employee salaries". Skrub dataset fetching functions return a Bunch object, which contains the paths to the data files. We can load the data into a dataframe using pandas. .. GENERATED FROM PYTHON SOURCE LINES 32-40 .. code-block:: Python import pandas as pd from skrub.datasets import fetch_employee_salaries file_path = fetch_employee_salaries().path employees_df = pd.read_csv(file_path) .. GENERATED FROM PYTHON SOURCE LINES 41-43 The target variable is the current annual salary. We pop it from the dataframe to keep only the features in ``employees_df``. .. GENERATED FROM PYTHON SOURCE LINES 43-45 .. code-block:: Python salaries = employees_df.pop("current_annual_salary") .. GENERATED FROM PYTHON SOURCE LINES 46-48 Typically, the first step with new data is exploration and parsing. To quickly get an overview of a dataframe's contents, use the |TableReport|. .. GENERATED FROM PYTHON SOURCE LINES 50-54 .. code-block:: Python from skrub import TableReport TableReport(employees_df) .. raw:: html

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1,986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1,988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1,989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2,014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2,007

9,223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	11/03/2015	2,015
9,224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	11/28/1988	1,988
9,225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	04/30/2001	2,001
9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2,006
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	StringDtype	False	17 (0.2%)	2 (< 0.1%)
1	department	StringDtype	False	0 (0.0%)	37 (0.4%)
2	department_name	StringDtype	False	0 (0.0%)	37 (0.4%)
3	division	StringDtype	False	0 (0.0%)	694 (7.5%)
4	assignment_category	StringDtype	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	StringDtype	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	StringDtype	False	0 (0.0%)	2264 (24.5%)
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
division	assignment_category	0.593
assignment_category	employee_position_title	0.497
department_name	assignment_category	0.422
department	assignment_category	0.422
department	employee_position_title	0.413
department_name	employee_position_title	0.413
division	employee_position_title	0.410
department	division	0.381
department_name	division	0.381
gender	department	0.380
gender	department_name	0.380
gender	assignment_category	0.294
gender	employee_position_title	0.275
gender	division	0.265
employee_position_title	date_first_hired	0.179
date_first_hired	year_first_hired	0.151
department	date_first_hired	0.150
department_name	date_first_hired	0.150
employee_position_title	year_first_hired	0.131
gender	date_first_hired	0.104
division	year_first_hired	0.0862
department	year_first_hired	0.0811
department_name	year_first_hired	0.0811
assignment_category	date_first_hired	0.0756
division	date_first_hired	0.0728
gender	year_first_hired	0.0641
assignment_category	year_first_hired	0.0519

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 55-73 You can use the interactive display above to explore the dataset visually. .. admonition:: Additional examples :collapsible: closed You can see a few more `example reports`_ online. We also provide an experimental online demo_ that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything. .. _example reports: https://skrub-data.org/skrub-reports/examples/ .. _demo: https://skrub-data.org/skrub-reports/ From the report above, we see that there are columns with date and time stored as ``object`` dtype (cf. "Stats" tab of the report). Datatypes not being parsed correctly is a scenario that occurs commonly after reading a table. We can use the |Cleaner| to address this. In the next section, we show that this transformer does additional cleaning. .. GENERATED FROM PYTHON SOURCE LINES 75-81 Sanitizing data with the |Cleaner| ---------------------------------- Here, we use the |Cleaner|, a transformer that sanitizes the dataframe by parsing nulls and dates, and by dropping "uninformative" columns (e.g., columns with too many nulls or that are constant). .. GENERATED FROM PYTHON SOURCE LINES 81-87 .. code-block:: Python from skrub import Cleaner employees_df = Cleaner().fit_transform(employees_df) TableReport(employees_df) .. raw:: html

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	1986-09-22 00:00:00	1,986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12 00:00:00	1,988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19 00:00:00	1,989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05 00:00:00	2,014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05 00:00:00	2,007

9,223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03 00:00:00	2,015
9,224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28 00:00:00	1,988
9,225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30 00:00:00	2,001
9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05 00:00:00	2,006
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30 00:00:00	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	StringDtype	False	17 (0.2%)	2 (< 0.1%)
1	department	StringDtype	False	0 (0.0%)	37 (0.4%)
2	department_name	StringDtype	False	0 (0.0%)	37 (0.4%)
3	division	StringDtype	False	0 (0.0%)	694 (7.5%)
4	assignment_category	StringDtype	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	StringDtype	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	DateTime64DType	False	0 (0.0%)	2264 (24.5%)			1965-09-30T00:00:00		2016-12-27T00:00:00
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
date_first_hired	year_first_hired	0.931
division	assignment_category	0.593
assignment_category	employee_position_title	0.497
department_name	assignment_category	0.422
department	assignment_category	0.422
department	employee_position_title	0.413
department_name	employee_position_title	0.413
division	employee_position_title	0.410
department	division	0.381
department_name	division	0.381
gender	department	0.380
gender	department_name	0.380
gender	assignment_category	0.294
gender	employee_position_title	0.275
gender	division	0.265
employee_position_title	date_first_hired	0.133
employee_position_title	year_first_hired	0.131
division	year_first_hired	0.0862
department_name	date_first_hired	0.0858
department	date_first_hired	0.0858
division	date_first_hired	0.0840
department_name	year_first_hired	0.0811
department	year_first_hired	0.0811
gender	date_first_hired	0.0683
gender	year_first_hired	0.0641
assignment_category	date_first_hired	0.0610
assignment_category	year_first_hired	0.0519

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 88-90 We can see from the "Stats" tab that now the column ``date_first_hired`` has been parsed correctly as a Datetime. .. GENERATED FROM PYTHON SOURCE LINES 92-98 Easily building a strong baseline for tabular machine learning -------------------------------------------------------------- The goal of skrub is to ease tabular data preparation for machine learning. The |tabular_pipeline| function provides an easy way to build a simple but reliable machine learning model that works well on most tabular data. .. GENERATED FROM PYTHON SOURCE LINES 101-107 .. code-block:: Python from sklearn.model_selection import cross_validate from skrub import tabular_pipeline model = tabular_pipeline("regressor") model .. raw:: html

Pipeline(steps=[('tablevectorizer',
                     TableVectorizer(low_cardinality=ToCategorical())),
                    ('histgradientboostingregressor',
                     HistGradientBoostingRegressor())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.. GENERATED FROM PYTHON SOURCE LINES 108-111 .. code-block:: Python results = cross_validate(model, employees_df, salaries) results["test_score"] .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.90895547, 0.87873961, 0.91514314, 0.92267765, 0.92381959]) .. GENERATED FROM PYTHON SOURCE LINES 112-118 To handle rich tabular data and feed it to a machine learning model, the pipeline returned by |tabular_pipeline| preprocesses and encodes strings, categories and dates using the |TableVectorizer|. See its documentation or :ref:`sphx_glr_auto_examples_0010_encodings.py` for more details. An overview of the chosen defaults is available in :ref:`user_guide_tabular_pipeline`. .. GENERATED FROM PYTHON SOURCE LINES 121-143 Encoding any data as numerical features --------------------------------------- Tabular data can contain a variety of datatypes, from numerical to datetimes, categories, strings, and text. Encoding features in a meaningful way requires significant effort and is a major part of the feature engineering process required to properly train machine learning models. Skrub helps with this by providing various transformers that automatically encode different datatypes into ``float32`` features. For **numerical features**, the |SquashingScaler| applies a robust scaling technique that is less sensitive to outliers. Check the :ref:`relative example ` for more information on the feature. For **datetime columns**, skrub provides the |DatetimeEncoder| which can extract useful features such as year, month, day, as well as additional features such as weekday or day of year. Periodic encoding with trigonometric or spline features is also available. Refer to the |DatetimeEncoder| documentation for more detail. .. GENERATED FROM PYTHON SOURCE LINES 145-156 .. code-block:: Python import pandas as pd data = pd.DataFrame( { "event": ["A", "B", "C"], "date_1": ["2020-01-01", "2020-06-15", "2021-03-22"], "date_2": ["2020-01-15", "2020-07-01", "2021-04-05"], } ) data = Cleaner().fit_transform(data) TableReport(data) .. raw:: html

	event	date_1	date_2
0	A	2020-01-01 00:00:00	2020-01-15 00:00:00
1	B	2020-06-15 00:00:00	2020-07-01 00:00:00
2	C	2021-03-22 00:00:00	2021-04-05 00:00:00

Column	Column name	dtype	Is sorted	Unique values	Min	Max
0	event	StringDtype	True	3 (100.0%)
1	date_1	DateTime64DType	True	3 (100.0%)	2020-01-01T00:00:00	2021-03-22T00:00:00
2	date_2	DateTime64DType	True	3 (100.0%)	2020-01-15T00:00:00	2021-04-05T00:00:00

Column 1	Column 2	Cramér's V
date_1	date_2	1.00
event	date_2	1.00
event	date_1	1.00

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 157-161 Skrub transformers are applied column-by-column, but it's possible to use the |ApplyToCols| meta-transformer to apply a transformer to multiple columns at once. Complex column selection is possible using :ref:`skrub's column selectors `. .. GENERATED FROM PYTHON SOURCE LINES 161-168 .. code-block:: Python from skrub import ApplyToCols, DatetimeEncoder ApplyToCols( DatetimeEncoder(add_total_seconds=False), cols=["date_1", "date_2"] ).fit_transform(data) .. raw:: html

	event	date_1_year	date_1_month	date_1_day	date_2_year	date_2_month	date_2_day
0	A	2.02e+03	1.00	1.00	2.02e+03	1.00	15.0
1	B	2.02e+03	6.00	15.0	2.02e+03	7.00	1.00
2	C	2.02e+03	3.00	22.0	2.02e+03	4.00	5.00

Column	Column name	dtype	Is sorted	Unique values	Mean	Std	Min	Median	Max
0	event	StringDtype	True	3 (100.0%)
1	date_1_year	Float32DType	True	2 (66.7%)	2.02e+03	0.577	2.02e+03	2.02e+03	2.02e+03
2	date_1_month	Float32DType	False	3 (100.0%)	3.33	2.52	1.00	3.00	6.00
3	date_1_day	Float32DType	True	3 (100.0%)	12.7	10.7	1.00	15.0	22.0
4	date_2_year	Float32DType	True	2 (66.7%)	2.02e+03	0.577	2.02e+03	2.02e+03	2.02e+03
5	date_2_month	Float32DType	False	3 (100.0%)	4.00	3.00	1.00	4.00	7.00
6	date_2_day	Float32DType	False	3 (100.0%)	7.00	7.21	1.00	5.00	15.0

Column 1	Column 2	Cramér's V	Pearson's Correlation
date_2_month	date_2_day	1.00	-0.971
date_2_year	date_2_day	1.00	-0.240
date_2_year	date_2_month	1.00	0.00
date_1_day	date_2_day	1.00	-0.817
date_1_day	date_2_month	1.00	0.655
date_1_day	date_2_year	1.00	0.756
date_1_month	date_2_day	1.00	-0.937
date_1_month	date_2_month	1.00	0.993
date_1_month	date_2_year	1.00	-0.115
date_1_month	date_1_day	1.00	0.564
date_1_year	date_2_day	1.00	-0.240
date_1_year	date_2_month	1.00	0.00
date_1_year	date_2_year	1.00	1.00
date_1_year	date_1_day	1.00	0.756
date_1_year	date_1_month	1.00	-0.115
event	date_2_day	1.00
event	date_2_month	1.00
event	date_2_year	1.00
event	date_1_day	1.00
event	date_1_month	1.00
event	date_1_year	1.00

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 169-175 Finally, when a column contains **categorical or string data**, it can be encoded using various encoders provided by skrub. The default encoder is the |StringEncoder|, which encodes categories using `Latent Semantic Analysis (LSA) `_. It is a simple and efficient way to encode categories and works well in practice. .. GENERATED FROM PYTHON SOURCE LINES 175-187 .. code-block:: Python data = pd.DataFrame( { "city": ["Paris", "London", "Berlin", "Madrid", "Rome"], "country": ["France", "UK", "Germany", "Spain", "Italy"], } ) TableReport(data) from skrub import StringEncoder StringEncoder(n_components=3).fit_transform(data["city"]) .. raw:: html

	city_0	city_1	city_2
0	2.01e-08	-2.75e-07	-7.48e-08
1	-0.847	1.14	0.466
2	0.698	0.454	0.156
3	-0.208	-0.684	1.30
4	0.995	0.510	0.559

Column	Column name	dtype	Is sorted	Unique values	Mean	Std	Min	Median	Max
0	city_0	Float32DType	False	5 (100.0%)	0.128	0.734	-0.847	2.01e-08	0.995
1	city_1	Float32DType	False	5 (100.0%)	0.284	0.677	-0.684	0.454	1.14
2	city_2	Float32DType	False	5 (100.0%)	0.496	0.502	-7.48e-08	0.466	1.30

Column 1	Column 2	Cramér's V	Pearson's Correlation
city_1	city_2	1.00	-0.518
city_0	city_2	1.00	-0.214
city_0	city_1	1.00	-0.0912

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 188-196 If your data includes a lot of text, you may want to use the |TextEncoder|, which uses pre-trained language models retrieved from the HuggingFace hub to create meaningful text embeddings. See :ref:`user_guide_encoders_index` for more details on all the categorical encoders provided by skrub, and :ref:`sphx_glr_auto_examples_0010_encodings.py` for a comparison between the different methods. .. GENERATED FROM PYTHON SOURCE LINES 198-208 Advanced use cases ---------------------- If your use case involves more complex data preparation, hyperparameter tuning, or model selection, if you want to build a multi-table pipeline that requires assembling and preparing multiple tables, or if you want to ensure that the data preparation can be reproduced exactly, you can use the skrub Data Ops, a powerful framework that provides tools to build complex data processing pipelines. See the related :ref:`user guide ` and the :ref:`data_ops_examples_ref` examples for more details. .. GENERATED FROM PYTHON SOURCE LINES 210-222 Next steps ---------- We have briefly covered pipeline creation, vectorizing, assembling, and encoding data. We presented the main functionalities of skrub, but there is much more to explore! Please refer to our :ref:`user_guide` for a more in-depth presentation of skrub's concepts, or visit our `examples `_ for more illustrations of the tools that we provide! .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 21.429 seconds) **Estimated memory usage:** 504 MB .. _sphx_glr_download_auto_tutorials_0000_getting_started.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.9.0?urlpath=lab/tree/notebooks/auto_tutorials/0000_getting_started.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_tutorials/0000_getting_started.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 0000_getting_started.ipynb <0000_getting_started.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 0000_getting_started.py <0000_getting_started.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 0000_getting_started.zip <0000_getting_started.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('tablevectorizer', ...), ('histgradientboostingregressor', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	cardinality_threshold	40
	low_cardinality	ToCategorical()
	high_cardinality	StringEncoder()
	numeric	PassThrough()
	datetime	DatetimeEncoder()
	specific_transformers	()
	drop_null_fraction	1.0
	drop_if_constant	False
	drop_if_unique	False
	datetime_format	None
	null_strings	None
	n_jobs	None

	resolution	'hour'
	add_weekday	False
	add_total_seconds	True
	add_day_of_year	False
	periodic_encoding	None

	n_components	30
	vectorizer	'tfidf'
	ngram_range	(3, ...)
	analyzer	'char_wb'
	stop_words	None
	random_state	None
	vocabulary	None

	loss loss: {'squared_error', 'absolute_error', 'gamma', 'poisson', 'quantile'}, default='squared_error' The loss function to use in the boosting process. Note that the "squared error", "gamma" and "poisson" losses actually implement "half least squares loss", "half gamma deviance" and "half poisson deviance" to simplify the computation of the gradient. Furthermore, "gamma" and "poisson" losses internally use a log-link, "gamma" requires ``y > 0`` and "poisson" requires ``y >= 0``. "quantile" uses the pinball loss. .. versionchanged:: 0.23 Added option 'poisson'. .. versionchanged:: 1.1 Added option 'quantile'. .. versionchanged:: 1.3 Added option 'gamma'.	'squared_error'
	quantile quantile: float, default=None If loss is "quantile", this parameter specifies which quantile to be estimated and must be between 0 and 1.	None
	learning_rate learning_rate: float, default=0.1 The learning rate, also known as shrinkage. This is used as a multiplicative factor for the leaves values. Use ``1`` for no shrinkage.	0.1
	max_iter max_iter: int, default=100 The maximum number of iterations of the boosting process, i.e. the maximum number of trees.	100
	max_leaf_nodes max_leaf_nodes: int or None, default=31 The maximum number of leaves for each tree. Must be strictly greater than 1. If None, there is no maximum limit.	31
	max_depth max_depth: int or None, default=None The maximum depth of each tree. The depth of a tree is the number of edges to go from the root to the deepest leaf. Depth isn't constrained by default.	None
	min_samples_leaf min_samples_leaf: int, default=20 The minimum number of samples per leaf. For small datasets with less than a few hundred samples, it is recommended to lower this value since only very shallow trees would be built.	20
	l2_regularization l2_regularization: float, default=0 The L2 regularization parameter penalizing leaves with small hessians. Use ``0`` for no regularization (default).	0.0
	max_features max_features: float, default=1.0 Proportion of randomly chosen features in each and every node split. This is a form of regularization, smaller values make the trees weaker learners and might prevent overfitting. If interaction constraints from `interaction_cst` are present, only allowed features are taken into account for the subsampling. .. versionadded:: 1.4	1.0
	max_bins max_bins: int, default=255 The maximum number of bins to use for non-missing values. Before training, each feature of the input array `X` is binned into integer-valued bins, which allows for a much faster training stage. Features with a small number of unique values may use less than ``max_bins`` bins. In addition to the ``max_bins`` bins, one more bin is always reserved for missing values. Must be no larger than 255.	255
	categorical_features categorical_features: array-like of {bool, int, str} of shape (n_features) or shape (n_categorical_features,), default='from_dtype' Indicates the categorical features. - None : no feature will be considered categorical. - boolean array-like : boolean mask indicating categorical features. - integer array-like : integer indices indicating categorical features. - str array-like: names of categorical features (assuming the training data has feature names). - `"from_dtype"`: dataframe columns with dtype "category" are considered to be categorical features. The input must be an object exposing a ``__dataframe__`` method such as pandas or polars DataFrames to use this feature. For each categorical feature, there must be at most `max_bins` unique categories. Negative values for categorical features encoded as numeric dtypes are treated as missing values. All categorical values are converted to floating point numbers. This means that categorical values of 1.0 and 1 are treated as the same category. Read more in the :ref:`User Guide ` and :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py`. .. versionadded:: 0.24 .. versionchanged:: 1.2 Added support for feature names. .. versionchanged:: 1.4 Added `"from_dtype"` option. .. versionchanged:: 1.6 The default value changed from `None` to `"from_dtype"`.	'from_dtype'
	monotonic_cst monotonic_cst: array-like of int of shape (n_features) or dict, default=None Monotonic constraint to enforce on each feature are specified using the following integer values: - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If a dict with str keys, map feature to monotonic constraints by name. If an array, the features are mapped to constraints by position. See :ref:`monotonic_cst_features_names` for a usage example. Read more in the :ref:`User Guide `. .. versionadded:: 0.23 .. versionchanged:: 1.2 Accept dict of constraints with feature names as keys.	None
	interaction_cst interaction_cst: {"pairwise", "no_interactions"} or sequence of lists/tuples/sets of int, default=None Specify interaction constraints, the sets of features which can interact with each other in child node splits. Each item specifies the set of feature indices that are allowed to interact with each other. If there are more features than specified in these constraints, they are treated as if they were specified as an additional set. The strings "pairwise" and "no_interactions" are shorthands for allowing only pairwise or no interactions, respectively. For instance, with 5 features in total, `interaction_cst=[{0, 1}]` is equivalent to `interaction_cst=[{0, 1}, {2, 3, 4}]`, and specifies that each branch of a tree will either only split on features 0 and 1 or only split on features 2, 3 and 4. See :ref:`this example` on how to use `interaction_cst`. .. versionadded:: 1.2	None
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble. For results to be valid, the estimator should be re-trained on the same data only. See :term:`the Glossary `.	False
	early_stopping early_stopping: 'auto' or bool, default='auto' If 'auto', early stopping is enabled if the sample size is larger than 10000 or if `X_val` and `y_val` are passed to `fit`. If True, early stopping is enabled, otherwise early stopping is disabled. .. versionadded:: 0.23	'auto'
	scoring scoring: str or callable or None, default='loss' Scoring method to use for early stopping. Only used if `early_stopping` is enabled. Options: - str: see :ref:`scoring_string_names` for options. - callable: a scorer callable object (e.g., function) with signature ``scorer(estimator, X, y)``. See :ref:`scoring_callable` for details. - `None`: the :ref:`coefficient of determination ` (:math:`R^2`) is used. - 'loss': early stopping is checked w.r.t the loss value.	'loss'
	validation_fraction validation_fraction: int or float or None, default=0.1 Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the training data. The value is ignored if either early stopping is not performed, e.g. `early_stopping=False`, or if `X_val` and `y_val` are passed to fit.	0.1
	n_iter_no_change n_iter_no_change: int, default=10 Used to determine when to "early stop". The fitting process is stopped when none of the last ``n_iter_no_change`` scores are better than the ``n_iter_no_change - 1`` -th-to-last one, up to some tolerance. Only used if early stopping is performed.	10
	tol tol: float, default=1e-7 The absolute tolerance to use when comparing scores during early stopping. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score.	1e-07
	verbose verbose: int, default=0 The verbosity level. If not zero, print some information about the fitting process. ``1`` prints only summary info, ``2`` prints info per iteration.	0
	random_state random_state: int, RandomState instance or None, default=None Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `.	None

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

event

date_1

date_2

event

date_1

date_2

Please enable javascript

event

date_1_year

date_1_month

date_1_day

date_2_year

date_2_month

date_2_day

event

date_1_year

date_1_month

date_1_day

date_2_year

date_2_month

date_2_day

Please enable javascript

city_0

city_1

city_2

city_0

city_1

city_2

Please enable javascript