Robust scaling of numeric features using SquashingScaler#

The SquashingScaler is a robust scaler for numeric features, particularly useful when features include outliers (such as infinite values); missing values are left unchanged (they are not interpolated). The SquashingScaler centers and scales the data in such a way that outliers are less likely to skew the final result compared to alternative methods.

Based on the specified quantile_range parameter, the scaler employs a scikit-learn RobustScaler to rescale the values in a way that the quantile range occupies interval of length two, centering the median to zero. It therefore ensures that inliers are spread to a reasonable range. Afterwards, it uses a smooth clipping function to ensure all values (including outliers and infinite values) are in the range [-max_absolute_value, max_absolute_value]. By default, max_absolute_value=3.

>>> import pandas as pd
>>> import numpy as np
>>> from skrub import SquashingScaler
>>> X = pd.DataFrame(dict(col=[np.inf, -np.inf, 3, -1, np.nan, 2]))
>>> SquashingScaler(max_absolute_value=3).fit_transform(X)
array([[ 3.        ],
        [-3.        ],
        [ 0.49319696],
        [-1.34164079],
        [        nan],
        [ 0.        ]])

More information about the theory behind the scaler is available in the SquashingScaler documentation, while this working example compares different scalers when used on data that include outliers.