Scaling numerical features safely

Outliers in Data

Most values in range [-2, 2], but some outliers in [-40, 40]:

from helpers import (
    generate_data_with_outliers,
    plot_feature_with_outliers
)

values = generate_data_with_outliers()

plot_feature_with_outliers(values)

StandardScaler Problems

  • Uses mean and standard deviation
  • Outliers make these statistics unreliable
  • Scaling factor becomes too small
  • Inliers get compressed into narrow range

RobustScaler

  • Uses percentiles (25th, 75th) instead of mean/std
  • More resistant to outliers
  • But doesn’t bound output values
  • Extreme outliers still have large scaled values

SquashingScaler: Best Approach

Smart outlier handling in skrub:

from skrub import SquashingScaler

scaler = SquashingScaler()
X_scaled = scaler.fit_transform(X)

Comparing the scalers

from helpers import scale_feature_and_plot
scale_feature_and_plot(values)

How SquashingScaler Works

  1. Center the median to 0
  2. Use quantile-based statistics for scaling
  3. Fill constant columns with 0s
  4. Apply smooth squashing function
  5. Constrain output to [-3, 3] (default)
  6. Map infinite values to boundaries
  7. Keep NaN values unchanged

The Squashing Function

\[x_{\text{out}} = \frac{z}{\sqrt{1 + (z/B)^2}}\]

Where \(z\) is the centered/scaled value and \(B\) is the bound (default: 3)

Advantages

  • Outlier-resistant: Inliers unaffected by outliers
  • Bounded output: Predictable range ideal for neural networks
  • Handles edge cases: Works with infinite values, constant columns
  • Preserves NaN: Missing values stay unchanged

Disadvantages

  • Non-invertible: Cannot perfectly reverse transformation

What we have seen in this chapter

  • StandardScaler is problematic with outliers
  • SquashingScaler safely handles outliers
  • Smooth squashing preserves relative scales
  • All values bounded to predictable range