Customizing the global configuration#

Skrub includes a configuration manager that allows setting various parameters (see the set_config() documentation for more detail).

It is possible to change configuration options using the set_config() function:

>>> from skrub import set_config
>>> set_config(table_report_verbosity=0)

This alters the behavior of skrub in the current script. Each configuration parameter has an environment variable that can be used to set it permanently.

Additionally, a config_context() is provided to allow temporarily altering the configuration:

>>> import skrub
>>> with skrub.config_context(max_plot_columns=1):
...     pass

Within this context, only the code executed inside the with statement is affected.

The get_config() function allows retrieving the current configuration.

Configuration parameters#

The configuration parameters that can be set with set_config and config_context are available by using

>>> import skrub
>>> config = skrub.get_config()
>>> config.keys()
dict_keys(['use_table_report_data_ops', 'table_report_verbosity', 'max_plot_columns', 'max_association_columns', 'subsampling_seed', 'enable_subsampling', 'float_precision', 'cardinality_threshold', 'data_dir', 'eager_data_ops'])

These are the parameters currently available in the global configuration:

Skrub Configuration Parameters#

Parameter Name

Default Value

Env Variable

Description

use_table_report_data_ops

True

SKB_USE_TABLE_REPORT_DATA_OPS

Set the HTML representation used for the Data Ops previews. If True, use the TableReport, otherwise use the default Pandas or Polars representation.

table_report_verbosity

1

SKB_TABLE_REPORT_VERBOSITY

Set the verbosity of the TableReport. If 1, print on screen the progress by column, if 0 print nothing.

max_plot_columns

30

SKB_MAX_PLOT_COLUMNS

If a dataframe has more columns than the value set here, the TableReport will skip generating the plots.

max_association_columns

30

SKB_MAX_ASSOCIATION_COLUMNS

If a dataframe has more columns than the value set here, the TableReport will skip computing the associations.

subsampling_seed

0

SKB_SUBSAMPLING_SEED

Set the random seed of subsampling in skrub.DataOp.skb.subsample(), when how="random" is passed.

enable_subsampling

"default"

SKB_ENABLE_SUBSAMPLING

Control the activation of subsampling in skrub.DataOp.skb.subsample(). If "default", the behavior of skrub.DataOp.skb.subsample() is used. If "disable", subsampling is never used, so skb.subsample becomes a no-op. If "force", subsampling is used in all DataOps evaluation modes (eval(), fit_transform, etc.).

float_precision

3

SKB_FLOAT_PRECISION

Control the number of significant digits shown when formatting floats. Applies overall precision rather than fixed decimal places.

cardinality_threshold

40

SKB_CARDINALITY_THRESHOLD

Set the cardinality_threshold argument of TableVectorizer. Additionally, set the threshold for warning the user about high cardinality features in the TableReport.

data_dir

~/skrub_data

SKB_DATA_DIRECTORY

Set the default location used by skrub to store datasets and other data, such as the Data Ops reports.

eager_data_ops

True

SKB_EAGER_DATA_OPS

Eagerly perform checks on the DataOps as soon they are created, and compute previews if preview data is available. If disabled, those checks are delayed until the DataOp is actually used