API#

This page lists all available functions and classes of skrub.

Joining dataframes#

Joiner

Augment features in a main table by fuzzy-joining an auxiliary table to it.

AggJoiner

Aggregate an auxiliary dataframe before joining it on a base dataframe.

MultiAggJoiner

Extension of the AggJoiner to multiple auxiliary tables.

AggTarget

Aggregate a target y before joining its aggregation on a base dataframe.

InterpolationJoiner

Join with a table augmented by machine-learning predictions.

fuzzy_join

Fuzzy (approximate) join.

Encoding a column#

GapEncoder

Encode string columns by constructing latent topics.

MinHashEncoder

Encode string categorical features by applying the MinHash method to n-gram decompositions of strings.

SimilarityEncoder

Encode string categories to a similarity matrix, to capture fuzziness across a few categories.

DatetimeEncoder

Extract temporal features such as month, day of the week, … from a datetime column.

ToCategorical

Convert a string column to Categorical dtype.

ToDatetime

Parse datetimes represented as strings and return Datetime columns.

StringEncoder

Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).

to_datetime

Convert DataFrame or column to Datetime dtype.

Deep Learning#

These encoders require installing additional dependencies around torch. See the “deep learning dependencies” section in the Install guide for more details.

TextEncoder

Encode string features by applying a pretrained language model downloaded from the HuggingFace Hub.

Building a pipeline#

TableVectorizer

Transform a dataframe to a numeric (vectorized) representation.

Cleaner

Column-wise consistency checks and sanitization, eg of null values or dates.

SelectCols

Select a subset of a DataFrame's columns.

DropCols

Drop a subset of a DataFrame's columns.

DropUninformative

Drop column if it is found to be uninformative according to various criteria.

tabular_learner

Get a simple machine-learning pipeline for tabular data.

Skrub Expressions#

The tabular_learner provides a pre-defined, default pipeline for datasets that contain a simple table. For more control or in order to build pipelines for more datasets, use the skrub expressions.

as_expr

Create an expression Expr that evaluates to the given value.

choose_bool

Construct a choice between False and True.

choose_float

Construct a choice of floating-point numbers from a numeric range.

choose_from

Construct a choice among several possible outcomes.

choose_int

Construct a choice of integers from a numeric range.

cross_validate

Cross-validate a pipeline built from an expression.

deferred

Wrap function calls in an expression Expr.

eval_mode

Return the mode in which the expression is currently being evaluated.

optional

Construct a choice between a value and None.

var

Create a skrub variable.

X

Create a skrub variable and mark it as being X.

y

Create a skrub variable and mark it as being y.

Expr

Representation of a computation that can be used to build ML pipelines.

Expr.skb.apply

Apply a scikit-learn estimator to a dataframe or numpy array.

Expr.skb.clone

Get an independent clone of the expression.

Expr.skb.concat

Concatenate dataframes vertically or horizontally.

Expr.skb.cross_validate

Cross-validate the expression.

Expr.skb.describe_param_grid

Describe the hyper-parameters extracted from choices in the expression.

Expr.skb.describe_steps

Get a text representation of the computation graph.

Expr.skb.draw_graph

Get an SVG string representing the computation graph.

Expr.skb.drop

Drop some columns.

Expr.skb.eval

Evaluate the expression.

Expr.skb.freeze_after_fit

Freeze the result during pipeline fitting.

Expr.skb.full_report

Generate a full report of the expression's evaluation.

Expr.skb.get_data

Collect the values of the variables contained in the expression.

Expr.skb.get_pipeline

Get a skrub pipeline for this expression.

Expr.skb.get_grid_search

Find the best parameters with grid search.

Expr.skb.get_randomized_search

Find the best parameters with randomized search.

Expr.skb.if_else

Create a conditional expression.

Expr.skb.mark_as_X

Mark this expression as being the X table.

Expr.skb.mark_as_y

Mark this expression as being the X table.

Expr.skb.match

Select based on the value of an expression.

Expr.skb.select

Select a subset of columns.

Expr.skb.set_description

Give a description to this expression.

Expr.skb.set_name

Give a name to this expression.

Expr.skb.train_test_split

Split an environment into a training an testing environments.

Expr.skb.description

A user-defined description or comment about the expression.

Expr.skb.is_X

Whether this expression has been marked with .skb.mark_as_X().

Expr.skb.is_y

Whether this expression has been marked with .skb.mark_as_y().

Expr.skb.name

A user-chosen name for the expression.

Expr.skb.applied_estimator

Retrieve the estimator applied in the previous step, as an expression.

SkrubPipeline

Pipeline that evaluates a skrub expression.

ParamSearch

Pipeline that evaluates a skrub expression with hyperparameter tuning.

Selecting columns in a DataFrame#

The srkub selectors provide a flexible way to specify the columns on which a transformation should be applied. They are meant to be used for the cols argument of Expr.skb.apply(), Expr.skb.select(), Expr.skb.drop(), SelectCols or DropCols.

selectors.all

Select all columns.

selectors.any_date

Select columns that have a Date or Datetime data type.

selectors.boolean

Select columns that have an Boolean data type.

selectors.cardinality_below

Select columns whose cardinality (number of unique values) is (strictly) below threshold.

selectors.categorical

Select columns that have a Categorical (or polars Enum) data type.

selectors.cols

Select columns by name.

selectors.Filter

selectors.filter

Select columns for which predicate returns True.

selectors.filter_names

Select columns based on their name.

selectors.float

Select columns that have a floating-point data type.

selectors.glob

Select columns by name with Unix shell style 'glob' pattern.

selectors.has_nulls

Select columns that contain at least one null value.

selectors.integer

Select columns that have an integer data type.

selectors.inv

Invert a selector.

selectors.make_selector

Transform a selector, column name or list of column names into a selector.

selectors.NameFilter

selectors.numeric

Select columns that have a numeric data type.

selectors.regex

Select columns by name with a regular expression.

selectors.select

Apply a selector to a dataframe and return the resulting dataframe.

selectors.Selector

selectors.string

Select columns that have a String data type.

Generating an HTML report#

TableReport

Summarize the contents of a dataframe.

patch_display

Replace the default DataFrame HTML displays with skrub.TableReport.

unpatch_display

Undo the effect of skrub.patch_display().

column_associations

Get measures of statistical associations between all pairs of columns.

Cleaning a dataframe#

deduplicate

Deduplicate categorical data by hierarchically clustering similar strings.

Downloading a dataset#

fetch_bike_sharing

Fetch the bike sharing dataset (regression) available at skrub-data/skrub-data-files

fetch_country_happiness

Fetch the happiness index dataset (regression) available at skrub-data/skrub-data-files

fetch_credit_fraud

Fetch the credit fraud dataset (classification) available at skrub-data/skrub-data-files

fetch_drug_directory

Fetches the drug directory dataset (classification), available at skrub-data/skrub-data-files

fetch_employee_salaries

Fetches the employee salaries dataset (regression), available at skrub-data/skrub-data-files

fetch_flight_delays

Fetch the flight delays dataset (regression) available at skrub-data/skrub-data-files

fetch_ken_embeddings

Download Wikipedia embeddings by type.

fetch_ken_table_aliases

Get the supported aliases of embedded KEN entities tables.

fetch_ken_types

Helper function to search for KEN entity types.

fetch_medical_charge

Fetches the medical charge dataset (regression), available at skrub-data/skrub-data-files

fetch_midwest_survey

Fetches the midwest survey dataset (classification), available at skrub-data/skrub-data-files

fetch_movielens

Fetch the movielens dataset (regression) available at skrub-data/skrub-data-files

fetch_open_payments

Fetches the open payments dataset (classification), available at skrub-data/skrub-data-files

fetch_toxicity

Fetch the toxicity dataset (classification) available at skrub-data/skrub-data-files

fetch_traffic_violations

Fetches the traffic violations dataset (classification), available at skrub-data/skrub-data-files

fetch_videogame_sales

Fetch the videogame sales dataset (regression) available at skrub-data/skrub-data-files

get_data_dir

Returns the directory in which skrub looks for data.

make_deduplication_data

Duplicates examples with spelling mistakes.