set_config#

skrub.set_config(use_table_report_data_ops=None, table_report_verbosity=None, max_plot_columns=None, max_association_columns=None, subsampling_seed=None, enable_subsampling=None, float_precision=None, cardinality_threshold=None, data_dir=None, eager_data_ops=None)[source]#

Set global skrub configuration.

Parameters:
use_table_report_data_opsbool, default=None

The type of HTML representation used for the dataframes preview in skrub DataOps. If None, falls back to the current configuration, which is True by default.

  • If True, TableReport will be used.

  • If False, the original Pandas or Polars dataframe display will be used.

This configuration can also be set with the SKB_USE_TABLE_REPORT_DATA_OPS environment variable.

table_report_verbosityint, default=None

Set the level of verbosity of the TableReport. Default is 1 (print the progress bar). Refer to the TableReport documentation for more details.

max_plot_columnsint, default=None

Set the max_plot_columns argument of TableReport. Default is 30. If “all”, all columns will be plotted.

This configuration can also be set with the SKB_MAX_PLOT_COLUMNS environment variable.

max_association_columnsint, default=None

Set the max_association_columns argument of TableReport. Default is 30. If “all”, all columns will be plotted.

This configuration can also be set with the SKB_MAX_ASSOCIATION_COLUMNS environment variable.

subsampling_seedint, default=None

Set the random seed of subsampling in skrub DataOps skrub.DataOp.skb.subsample(), when how="random" is passed.

This configuration can also be set with the SKB_SUBSAMPLING_SEED environment variable.

enable_subsampling{‘default’, ‘disable’, ‘force’}, default=None

Control the activation of subsampling in skrub DataOps skrub.DataOp.skb.subsample(). Default is "default".

  • If "default", the behavior of skrub.DataOp.skb.subsample() is used.

  • If "disable", subsampling is never used, so skb.subsample becomes a no-op.

  • If "force", subsampling is used in all DataOps evaluation modes (eval(), fit_transform, etc.).

This configuration can also be set with the SKB_ENABLE_SUBSAMPLING environment variable.

float_precisionint, default=3

Control the number of significant digits shown when formatting floats. Applies overall precision rather than fixed decimal places. Default is 3.

This configuration can also be set with the SKB_FLOAT_PRECISION environment variable.

cardinality_thresholdint, default=40

Set the cardinality_threshold argument of TableVectorizer. Control the threshold value used to warn user if they have high cardinality columns in there dataset.

This configuration can also be set with the SKB_CARDINALITY_THRESHOLD environment variable.

data_dirstr or pathlib.Path, default=None

Set the data directory path for skrub datasets. If None, falls back to the current configuration.

  • If the SKB_DATA_DIRECTORY environment variable is set to an absolute path, that path will be used.

  • Otherwise, the default is ~/skrub_data.

This configuration can also be set with the SKB_DATA_DIRECTORY environment variable. The deprecated SKRUB_DATA_DIRECTORY is still supported with a deprecation warning.

eager_data_opsbool, default=True

Eagerly perform checks on the DataOps as soon they are created, and compute previews if preview data is available. If disabled, those checks are delayed until the DataOp is actually used (e.g. by calling .skb.eval() or make_learner()), and previews are not computed.

This option is used to speed-up the creation of large DataOps containing many nodes. It can also be useful in rare cases where a DataOp needs no inputs (for example it relies on a hard-coded filename to load data) but we want to prevent it from computing preview results as soon as it is constructed and delay computation until we explicitly request it. For most DataOps that do need inputs (contain skrub.var() nodes), previews can also be disabled simply by not providing preview data to skrub.var().

This configuration can also be set with the SKB_EAGER_DATA_OPS environment variable.

See also

get_config

Retrieve current values for global configuration.

config_context

Context manager for global skrub configuration.

Examples

>>> from skrub import set_config
>>> set_config(use_table_report_data_ops=True)