API Reference#
This is the class and function reference of skrub. Please refer to the full user guide for further details, as the raw specifications of classes and functions may not be enough to give full guidelines on their use.
Building a pipeline#
Get a simple machine-learning pipeline for tabular data. |
|
Transform a dataframe to a numeric (vectorized) representation. |
|
Column-wise consistency checks and sanitization of dtypes, null values and dates. |
|
Select a subset of a DataFrame's columns. |
|
Drop a subset of a DataFrame's columns. |
|
Drop column if it is found to be uninformative according to various criteria. |
Encoding a column#
Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD). |
|
Encode string features by applying a pretrained language model downloaded from the HuggingFace Hub. |
|
Encode string categorical features by applying the MinHash method to n-gram decompositions of strings. |
|
Encode string columns by constructing latent topics. |
|
Encode string categories to a similarity matrix, to capture fuzziness across a few categories. |
|
Convert a string column to Categorical dtype. |
|
Extract temporal features such as month, day of the week, … from a datetime column. |
|
Parse datetimes represented as strings and return |
|
Convert DataFrame or column to Datetime dtype. |
Exploring a dataframe#
Summarize the contents of a dataframe. |
|
Replace the default DataFrame HTML displays with |
|
Undo the effect of |
|
Get measures of statistical associations between all pairs of columns. |
Cleaning a dataframe#
Deduplicate categorical data by hierarchically clustering similar strings. |
Joining dataframes#
Augment features in a main table by fuzzy-joining an auxiliary table to it. |
|
Aggregate an auxiliary dataframe before joining it on a base dataframe. |
|
Extension of the |
|
Aggregate a target y before joining its aggregation on a base dataframe. |
|
Join with a table augmented by machine-learning predictions. |
|
Fuzzy (approximate) join. |
Selectors#
Select all columns. |
|
Select columns that have a Date or Datetime data type. |
|
Select columns that have an Boolean data type. |
|
Select columns whose cardinality (number of unique values) is (strictly) below |
|
Select columns that have a Categorical (or polars Enum) data type. |
|
Select columns by name. |
|
Select columns for which |
|
Select columns based on their name. |
|
Select columns that have a floating-point data type. |
|
Select columns by name with Unix shell style 'glob' pattern. |
|
Select columns that contain at least one null value. |
|
Select columns that have an integer data type. |
|
Invert a selector. |
|
Transform a selector, column name or list of column names into a selector. |
|
Select columns that have a numeric data type. |
|
Select columns by name with a regular expression. |
|
Apply a selector to a dataframe and return the resulting dataframe. |
|
Select columns that have a String data type. |
Expressions#
Create a skrub variable. |
|
Create a skrub variable and mark it as being |
|
Create a skrub variable and mark it as being |
|
Create an expression |
|
Wrap function calls in an expression |
Representation of a computation that can be used to build ML pipelines. |
A choice between |
|
A choice of floating-point numbers from a numeric range. |
|
A choice of integers from a numeric range. |
|
A choice among several possible outcomes. |
|
A choice between |
Cross-validate a pipeline built from an expression. |
|
Return the mode in which the expression is currently being evaluated. |
Apply a scikit-learn estimator to a dataframe or numpy array. |
|
Apply the given function. |
|
Get an independent clone of the expression. |
|
Concatenate dataframes vertically or horizontally. |
|
Cross-validate the expression. |
|
Describe the hyper-parameters used by the default pipeline. |
|
Describe the hyper-parameters extracted from choices in the expression. |
|
Get a text representation of the computation graph. |
|
Get an SVG string representing the computation graph. |
|
Drop some columns. |
|
Evaluate the expression. |
|
Freeze the result during pipeline fitting. |
|
Generate a full report of the expression's evaluation. |
|
Collect the values of the variables contained in the expression. |
|
Get a skrub pipeline for this expression. |
|
Find the best parameters with grid search. |
|
Find the best parameters with randomized search. |
|
Create a conditional expression. |
|
Get pipelines with different parameter combinations. |
|
Get pipelines with different parameter combinations. |
|
Mark this expression as being the |
|
Mark this expression as being the |
|
Select based on the value of an expression. |
|
Get the value computed for previews (shown when printing the expression). |
|
Select a subset of columns. |
|
Give a description to this expression. |
|
Give a name to this expression. |
|
Configure subsampling of a dataframe or numpy array. |
|
Split an environment into a training an testing environments. |
A user-defined description or comment about the expression. |
|
Whether this expression has been marked with |
|
Whether this expression has been marked with |
|
A user-chosen name for the expression. |
|
Retrieve the estimator applied in the previous step, as an expression. |
Pipeline that evaluates a skrub expression. |
|
Pipeline that evaluates a skrub expression with hyperparameter tuning. |
Datasets#
Fetch the bike sharing dataset (regression) available at skrub-data/skrub-data-files |
|
Fetch the happiness index dataset (regression) available at skrub-data/skrub-data-files |
|
Fetch the credit fraud dataset (classification) available at skrub-data/skrub-data-files |
|
Fetches the drug directory dataset (classification), available at skrub-data/skrub-data-files |
|
Fetches the employee salaries dataset (regression), available at skrub-data/skrub-data-files |
|
Fetch the flight delays dataset (regression) available at skrub-data/skrub-data-files |
|
Download Wikipedia embeddings by type. |
|
Get the supported aliases of embedded KEN entities tables. |
|
Helper function to search for KEN entity types. |
|
Fetches the medical charge dataset (regression), available at skrub-data/skrub-data-files |
|
Fetches the midwest survey dataset (classification), available at skrub-data/skrub-data-files |
|
Fetch the movielens dataset (regression) available at skrub-data/skrub-data-files |
|
Fetches the open payments dataset (classification), available at skrub-data/skrub-data-files |
|
Fetch the toxicity dataset (classification) available at skrub-data/skrub-data-files |
|
Fetches the traffic violations dataset (classification), available at skrub-data/skrub-data-files |
|
Fetch the videogame sales dataset (regression) available at skrub-data/skrub-data-files |
|
Returns the directory in which skrub looks for data. |
|
Duplicates examples with spelling mistakes. |