API Reference#

This is the class and function reference of skrub. Please refer to the full user guide for further details, as the raw specifications of classes and functions may not be enough to give full guidelines on their use.

Building a pipeline#

tabular_pipeline

Get a simple machine-learning pipeline for tabular data.

tabular_learner

Get a simple machine-learning pipeline for tabular data.

TableVectorizer

Transform a dataframe to a numeric (vectorized) representation.

SelectCols

Select a subset of a DataFrame's columns.

DropCols

Drop a subset of a DataFrame's columns.

ApplyToCols

Map a transformer to columns in a dataframe.

ApplyToFrame

Apply a transformer to part of a dataframe.

Encoding a column#

StringEncoder

Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).

TextEncoder

Encode string features by applying a pretrained language model downloaded from the HuggingFace Hub.

MinHashEncoder

Encode string categorical features by applying the MinHash method to n-gram decompositions of strings.

GapEncoder

Encode string columns by constructing latent topics.

SimilarityEncoder

Encode string categories to a similarity matrix, to capture fuzziness across a few categories.

ToCategorical

Convert a string column to Categorical dtype.

DatetimeEncoder

Extract temporal features such as month, day of the week, … from a datetime column.

ToDatetime

Parse datetimes represented as strings and return Datetime columns.

to_datetime

Convert DataFrame or column to Datetime dtype.

Exploring a dataframe#

TableReport

Summarize the contents of a dataframe.

patch_display

Replace the default DataFrame HTML displays with skrub.TableReport.

unpatch_display

Undo the effect of skrub.patch_display().

column_associations

Get measures of statistical associations between all pairs of columns.

Cleaning a dataframe#

SquashingScaler

Perform robust centering and scaling followed by soft clipping.

deduplicate

Deduplicate categorical data by hierarchically clustering similar strings.

Cleaner

Column-wise consistency checks and sanitization of dtypes, null values and dates.

DropUninformative

Drop column if it is found to be uninformative according to various criteria.

Joining dataframes#

Joiner

Augment features in a main table by fuzzy-joining an auxiliary table to it.

AggJoiner

Aggregate an auxiliary dataframe before joining it on a base dataframe.

MultiAggJoiner

Extension of the AggJoiner to multiple auxiliary tables.

AggTarget

Aggregate a target y before joining its aggregation on a base dataframe.

InterpolationJoiner

Join with a table augmented by machine-learning predictions.

fuzzy_join

Fuzzy (approximate) join.

Selectors#

selectors.all

Select all columns.

selectors.any_date

Select columns that have a Date or Datetime data type.

selectors.boolean

Select columns that have an Boolean data type.

selectors.cardinality_below

Select columns whose cardinality (number of unique values) is (strictly) below threshold.

selectors.categorical

Select columns that have a Categorical (or polars Enum) data type.

selectors.cols

Select columns by name.

selectors.filter

Select columns for which predicate returns True.

selectors.filter_names

Select columns based on their name.

selectors.float

Select columns that have a floating-point data type.

selectors.glob

Select columns by name with Unix shell style 'glob' pattern.

selectors.has_nulls

Select columns that contain at least one null value.

selectors.integer

Select columns that have an integer data type.

selectors.inv

Invert a selector.

selectors.make_selector

Transform a selector, column name or list of column names into a selector.

selectors.numeric

Select columns that have a numeric data type.

selectors.regex

Select columns by name with a regular expression.

selectors.select

Apply a selector to a dataframe and return the resulting dataframe.

selectors.string

Select columns that have a String data type.

DataOps#

var

Create a skrub variable.

X

Create a skrub variable and mark it as being X.

y

Create a skrub variable and mark it as being y.

as_data_op

Create a DataOp DataOp that evaluates to the given value.

deferred

Wrap function calls in a DataOp DataOp.

DataOp

Representation of a computation that can be used to build DataOps plans and learners.

choose_bool

A choice between True and False.

choose_float

A choice of floating-point numbers from a numeric range.

choose_int

A choice of integers from a numeric range.

choose_from

A choice among several possible outcomes.

optional

A choice between value and None.

cross_validate

Cross-validate a learner built from a DataOp.

eval_mode

Return the mode in which the DataOp is currently being evaluated.

DataOp.skb.apply

Apply a scikit-learn estimator to a dataframe or numpy array.

DataOp.skb.apply_func

Apply the given function.

DataOp.skb.clone

Get an independent clone of the DataOp.

DataOp.skb.concat

Concatenate dataframes vertically or horizontally.

DataOp.skb.cross_validate

Cross-validate the DataOp plan.

DataOp.skb.describe_defaults

Describe the hyper-parameters used by the default learner.

DataOp.skb.describe_param_grid

Describe the hyper-parameters extracted from choices in the DataOp.

DataOp.skb.describe_steps

Get a text representation of the computation graph.

DataOp.skb.draw_graph

Get an SVG string representing the computation graph.

DataOp.skb.drop

Drop some columns.

DataOp.skb.eval

Evaluate the DataOp.

DataOp.skb.freeze_after_fit

Freeze the result during learner fitting.

DataOp.skb.full_report

Generate a full report of the DataOp's evaluation.

DataOp.skb.get_data

Collect the values of the variables contained in the DataOp.

DataOp.skb.make_learner

Get a skrub learner for this DataOp.

DataOp.skb.make_grid_search

Find the best parameters with grid search.

DataOp.skb.make_randomized_search

Find the best parameters with randomized search.

DataOp.skb.if_else

Create a conditional DataOp.

DataOp.skb.iter_learners_grid

Get learners with different parameter combinations.

DataOp.skb.iter_learners_randomized

Get learners with different parameter combinations.

DataOp.skb.mark_as_X

Mark this DataOp as being the X table.

DataOp.skb.mark_as_y

Mark this DataOp as being the y table.

DataOp.skb.match

Select based on the value of a DataOp.

DataOp.skb.preview

Get the value computed for previews (shown when printing the DataOp).

DataOp.skb.select

Select a subset of columns.

DataOp.skb.set_description

Give a description to this DataOp.

DataOp.skb.set_name

Give a name to this DataOp.

DataOp.skb.subsample

Configure subsampling of a dataframe or numpy array.

DataOp.skb.train_test_split

Split an environment into a training an testing environments.

DataOp.skb.description

A user-defined description or comment about the DataOp.

DataOp.skb.is_X

Whether this DataOp has been marked with skb.mark_as_X().

DataOp.skb.is_y

Whether this DataOp has been marked with skb.mark_as_y().

DataOp.skb.name

A user-chosen name for the DataOp.

DataOp.skb.applied_estimator

Retrieve the estimator applied in the previous step, as a DataOp.

SkrubLearner

Learner that evaluates a skrub DataOp.

ParamSearch

Learner that evaluates a skrub DataOp with hyperparameter tuning.

Configuration#

get_config

Retrieve current values for configuration set by set_config().

set_config

Set global skrub configuration.

config_context

Context manager for global skrub configuration.

Datasets#

datasets.fetch_bike_sharing

Fetch the bike sharing dataset (regression) available at skrub-data/skrub-data-files

datasets.fetch_country_happiness

Fetch the happiness index dataset (regression) available at skrub-data/skrub-data-files

datasets.fetch_credit_fraud

Fetch the credit fraud dataset (classification) available at skrub-data/skrub-data-files

datasets.fetch_drug_directory

Fetches the drug directory dataset (classification), available at skrub-data/skrub-data-files

datasets.fetch_employee_salaries

Fetches the employee salaries dataset (regression), available at skrub-data/skrub-data-files

datasets.fetch_flight_delays

Fetch the flight delays dataset (regression) available at skrub-data/skrub-data-files

datasets.fetch_ken_embeddings

Download Wikipedia embeddings by type.

datasets.fetch_ken_table_aliases

Get the supported aliases of embedded KEN entities tables.

datasets.fetch_ken_types

Helper function to search for KEN entity types.

datasets.fetch_medical_charge

Fetches the medical charge dataset (regression), available at skrub-data/skrub-data-files

datasets.fetch_midwest_survey

Fetches the midwest survey dataset (classification), available at skrub-data/skrub-data-files

datasets.fetch_movielens

Fetch the movielens dataset (regression) available at skrub-data/skrub-data-files

datasets.fetch_open_payments

Fetches the open payments dataset (classification), available at skrub-data/skrub-data-files

datasets.fetch_toxicity

Fetch the toxicity dataset (classification) available at skrub-data/skrub-data-files

datasets.fetch_traffic_violations

Fetches the traffic violations dataset (classification), available at skrub-data/skrub-data-files

datasets.fetch_videogame_sales

Fetch the videogame sales dataset (regression) available at skrub-data/skrub-data-files

datasets.get_data_dir

Returns the directory in which skrub looks for data.

datasets.make_deduplication_data

Duplicates examples with spelling mistakes.

datasets.toy_orders

Create a toy dataframe and corresponding targets for examples.