Scaling numerical features safely
Outliers in Data
Most values in range [-2, 2], but some outliers in [-40, 40]:
from helpers import (
generate_data_with_outliers,
plot_feature_with_outliers
)
values = generate_data_with_outliers()
plot_feature_with_outliers(values)
StandardScaler Problems
- Uses mean and standard deviation
- Outliers make these statistics unreliable
- Scaling factor becomes too small
- Inliers get compressed into narrow range
RobustScaler
- Uses percentiles (25th, 75th) instead of mean/std
- More resistant to outliers
- But doesn’t bound output values
- Extreme outliers still have large scaled values
SquashingScaler: Best Approach
Smart outlier handling in skrub:
from skrub import SquashingScaler
scaler = SquashingScaler()
X_scaled = scaler.fit_transform(X)
Comparing the scalers
from helpers import scale_feature_and_plot
scale_feature_and_plot(values)
The Squashing Function
\[x_{\text{out}} = \frac{z}{\sqrt{1 + (z/B)^2}}\]
Where \(z\) is the centered/scaled value and \(B\) is the bound (default: 3)
Advantages
- Outlier-resistant: Inliers unaffected by outliers
- Bounded output: Predictable range ideal for neural networks
- Handles edge cases: Works with infinite values, constant columns
- Preserves NaN: Missing values stay unchanged
Disadvantages
- Non-invertible: Cannot perfectly reverse transformation
What we have seen in this chapter
StandardScaler is problematic with outliers
SquashingScaler safely handles outliers
- Smooth squashing preserves relative scales
- All values bounded to predictable range