Data Preparation with skrub
Transformers#
Cleaning dataframes and parsing datatypes#
>>> from skrub import Cleaner
>>> import pandas as pd
>>> df = pd.DataFrame({
... "id": [1, 2, 3],
... "all_missing": ["", "", ""],
... "date": ["2024-05-05", "2024-05-06", "2024-05-07"],
... })
>>> df_clean = Cleaner().fit_transform(df)
>>> df_clean
id date
0 1 2024-05-05
1 2 2024-05-06
2 3 2024-05-07
The Cleaner
converts data types and Nan values in dataframes to ease downstream preprocessing. It includes:
Replacing strings that represent missing values with NA markers
Dropping uninformative columns (add cross reference)
Parsing datetimes from datetime strings
Forcing consistent categorical typing
Converting columns to string, unless they have a more informative datatype (numerical, datetime, categorical)
Converting numeric dtypes to float32
with the Cleaner
#
By default, the Cleaner
parses numeric datatypes and does not cast them to a
different dtype. In some cases, it may be beneficial to have the same numeric
dtype for all numeric columns to guarantee compatibility between values.
The Cleaner
allows conversion of numeric features to float32
by setting
the numeric_dtype
parameter:
>>> from skrub import Cleaner
>>> cleaner = Cleaner(numeric_dtype="float32")
Setting the dtype to float32
reduces RAM footprint for most use cases and
ensures that all missing values have the same representation. This also ensures
compatibility with scikit-learn transformers.
Removing unneeded columns with DropUninformative
and Cleaner
#
DropUninformative
is used to remove features or data points that do not provide
useful information for the analysis or model.
Tables may include columns that do not carry useful information. These columns increase computational cost and may reduce downstream performance.
The DropUninformative
transformer includes various heuristics to drop columns
considered “uninformative”:
Drops all columns that contain only missing values (threshold adjustable via
drop_null_fraction
)Drops columns with only a single value if
drop_if_constant=True
Drops string/categorical columns where each row is unique if
drop_if_unique=True
(use with care)
DropUninformative
is used by both TableVectorizer
and Cleaner
; both accept
the same parameters to drop columns accordingly.
Deduplicate categorical data with deduplicate()
#
If you have a series containing strings with typos, the deduplicate()
function
may be used to remove some typos by creating a mapping between the typo strings
and the correct strings. See the documentation for caveats and more detail.