API Reference#

This is the class and function reference of skrub. Please refer to the full user guide for further details, as the raw specifications of classes and functions may not be enough to give full guidelines on their use.

Building a pipeline#

tabular_learner

Get a simple machine-learning pipeline for tabular data.

TableVectorizer

Transform a dataframe to a numeric (vectorized) representation.

Cleaner

Column-wise consistency checks and sanitization of dtypes, null values and dates.

SelectCols

Select a subset of a DataFrame's columns.

DropCols

Drop a subset of a DataFrame's columns.

DropUninformative

Drop column if it is found to be uninformative according to various criteria.

Encoding a column#

StringEncoder

Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).

TextEncoder

Encode string features by applying a pretrained language model downloaded from the HuggingFace Hub.

MinHashEncoder

Encode string categorical features by applying the MinHash method to n-gram decompositions of strings.

GapEncoder

Encode string columns by constructing latent topics.

SimilarityEncoder

Encode string categories to a similarity matrix, to capture fuzziness across a few categories.

ToCategorical

Convert a string column to Categorical dtype.

DatetimeEncoder

Extract temporal features such as month, day of the week, … from a datetime column.

ToDatetime

Parse datetimes represented as strings and return Datetime columns.

to_datetime

Convert DataFrame or column to Datetime dtype.

Exploring a dataframe#

TableReport

Summarize the contents of a dataframe.

patch_display

Replace the default DataFrame HTML displays with skrub.TableReport.

unpatch_display

Undo the effect of skrub.patch_display().

column_associations

Get measures of statistical associations between all pairs of columns.

Cleaning a dataframe#

deduplicate

Deduplicate categorical data by hierarchically clustering similar strings.

Joining dataframes#

Joiner

Augment features in a main table by fuzzy-joining an auxiliary table to it.

AggJoiner

Aggregate an auxiliary dataframe before joining it on a base dataframe.

MultiAggJoiner

Extension of the AggJoiner to multiple auxiliary tables.

AggTarget

Aggregate a target y before joining its aggregation on a base dataframe.

InterpolationJoiner

Join with a table augmented by machine-learning predictions.

fuzzy_join

Fuzzy (approximate) join.

Selectors#

selectors.all

Select all columns.

selectors.any_date

Select columns that have a Date or Datetime data type.

selectors.boolean

Select columns that have an Boolean data type.

selectors.cardinality_below

Select columns whose cardinality (number of unique values) is (strictly) below threshold.

selectors.categorical

Select columns that have a Categorical (or polars Enum) data type.

selectors.cols

Select columns by name.

selectors.filter

Select columns for which predicate returns True.

selectors.filter_names

Select columns based on their name.

selectors.float

Select columns that have a floating-point data type.

selectors.glob

Select columns by name with Unix shell style 'glob' pattern.

selectors.has_nulls

Select columns that contain at least one null value.

selectors.integer

Select columns that have an integer data type.

selectors.inv

Invert a selector.

selectors.make_selector

Transform a selector, column name or list of column names into a selector.

selectors.numeric

Select columns that have a numeric data type.

selectors.regex

Select columns by name with a regular expression.

selectors.select

Apply a selector to a dataframe and return the resulting dataframe.

selectors.string

Select columns that have a String data type.

Expressions#

var

Create a skrub variable.

X

Create a skrub variable and mark it as being X.

y

Create a skrub variable and mark it as being y.

as_expr

Create an expression Expr that evaluates to the given value.

deferred

Wrap function calls in an expression Expr.

Expr

Representation of a computation that can be used to build ML pipelines.

choose_bool

A choice between True and False.

choose_float

A choice of floating-point numbers from a numeric range.

choose_int

A choice of integers from a numeric range.

choose_from

A choice among several possible outcomes.

optional

A choice between value and None.

cross_validate

Cross-validate a pipeline built from an expression.

eval_mode

Return the mode in which the expression is currently being evaluated.

Expr.skb.apply

Apply a scikit-learn estimator to a dataframe or numpy array.

Expr.skb.apply_func

Apply the given function.

Expr.skb.clone

Get an independent clone of the expression.

Expr.skb.concat

Concatenate dataframes vertically or horizontally.

Expr.skb.cross_validate

Cross-validate the expression.

Expr.skb.describe_defaults

Describe the hyper-parameters used by the default pipeline.

Expr.skb.describe_param_grid

Describe the hyper-parameters extracted from choices in the expression.

Expr.skb.describe_steps

Get a text representation of the computation graph.

Expr.skb.draw_graph

Get an SVG string representing the computation graph.

Expr.skb.drop

Drop some columns.

Expr.skb.eval

Evaluate the expression.

Expr.skb.freeze_after_fit

Freeze the result during pipeline fitting.

Expr.skb.full_report

Generate a full report of the expression's evaluation.

Expr.skb.get_data

Collect the values of the variables contained in the expression.

Expr.skb.get_pipeline

Get a skrub pipeline for this expression.

Expr.skb.get_grid_search

Find the best parameters with grid search.

Expr.skb.get_randomized_search

Find the best parameters with randomized search.

Expr.skb.if_else

Create a conditional expression.

Expr.skb.iter_pipelines_grid

Get pipelines with different parameter combinations.

Expr.skb.iter_pipelines_randomized

Get pipelines with different parameter combinations.

Expr.skb.mark_as_X

Mark this expression as being the X table.

Expr.skb.mark_as_y

Mark this expression as being the y table.

Expr.skb.match

Select based on the value of an expression.

Expr.skb.preview

Get the value computed for previews (shown when printing the expression).

Expr.skb.select

Select a subset of columns.

Expr.skb.set_description

Give a description to this expression.

Expr.skb.set_name

Give a name to this expression.

Expr.skb.subsample

Configure subsampling of a dataframe or numpy array.

Expr.skb.train_test_split

Split an environment into a training an testing environments.

Expr.skb.description

A user-defined description or comment about the expression.

Expr.skb.is_X

Whether this expression has been marked with .skb.mark_as_X().

Expr.skb.is_y

Whether this expression has been marked with .skb.mark_as_y().

Expr.skb.name

A user-chosen name for the expression.

Expr.skb.applied_estimator

Retrieve the estimator applied in the previous step, as an expression.

SkrubPipeline

Pipeline that evaluates a skrub expression.

ParamSearch

Pipeline that evaluates a skrub expression with hyperparameter tuning.

Datasets#

datasets.fetch_bike_sharing

Fetch the bike sharing dataset (regression) available at skrub-data/skrub-data-files

datasets.fetch_country_happiness

Fetch the happiness index dataset (regression) available at skrub-data/skrub-data-files

datasets.fetch_credit_fraud

Fetch the credit fraud dataset (classification) available at skrub-data/skrub-data-files

datasets.fetch_drug_directory

Fetches the drug directory dataset (classification), available at skrub-data/skrub-data-files

datasets.fetch_employee_salaries

Fetches the employee salaries dataset (regression), available at skrub-data/skrub-data-files

datasets.fetch_flight_delays

Fetch the flight delays dataset (regression) available at skrub-data/skrub-data-files

datasets.fetch_ken_embeddings

Download Wikipedia embeddings by type.

datasets.fetch_ken_table_aliases

Get the supported aliases of embedded KEN entities tables.

datasets.fetch_ken_types

Helper function to search for KEN entity types.

datasets.fetch_medical_charge

Fetches the medical charge dataset (regression), available at skrub-data/skrub-data-files

datasets.fetch_midwest_survey

Fetches the midwest survey dataset (classification), available at skrub-data/skrub-data-files

datasets.fetch_movielens

Fetch the movielens dataset (regression) available at skrub-data/skrub-data-files

datasets.fetch_open_payments

Fetches the open payments dataset (classification), available at skrub-data/skrub-data-files

datasets.fetch_toxicity

Fetch the toxicity dataset (classification) available at skrub-data/skrub-data-files

datasets.fetch_traffic_violations

Fetches the traffic violations dataset (classification), available at skrub-data/skrub-data-files

datasets.fetch_videogame_sales

Fetch the videogame sales dataset (regression) available at skrub-data/skrub-data-files

datasets.get_data_dir

Returns the directory in which skrub looks for data.

datasets.make_deduplication_data

Duplicates examples with spelling mistakes.