Encoding datetime features

The Datetime Challenge

Datetimes come in many formats:

  • “2023-01-03” (ISO format)
  • “03/01/2023” (EU format)
  • “January 3, 2023” (text)
  • “03 Jan 2023” (custom)

Correct parsing is essential for feature extraction.

Converting Strings to Datetime

ToDatetime: Single column transformer with format guessing:

from skrub import ApplyToCols, ToDatetime
import pandas as pd

df = pd.DataFrame({"dates": ["2023-01-03", "2023-02-15"]})
df_dt = ApplyToCols(ToDatetime(), allow_reject=True).fit_transform(df)
df_dt
dates
0 2023-01-03
1 2023-02-15

Cleaner: Also parses datetimes with custom format:

from skrub import Cleaner

cleaner = Cleaner(datetime_format="%d %B %Y")
df_clean = cleaner.fit_transform(df)
df_clean
dates
0 2023-01-03
1 2023-02-15

Extracting Datetime Features

Datetimes must be converted to numerical features:

df_dt["year"] = df_dt["dates"].dt.year
df_dt["month"] = df_dt["dates"].dt.month
df_dt["day"] = df_dt["dates"].dt.day
df_dt["weekday"] = df_dt["dates"].dt.weekday
df_dt["day_of_year"] = df_dt["dates"].dt.day_of_year
df_dt["total_seconds"] = (df_dt["dates"] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
df_dt
dates year month day weekday day_of_year total_seconds
0 2023-01-03 2023 1 3 1 3 1672704000
1 2023-02-15 2023 2 15 2 46 1676419200

Simpler with DatetimeEncoder

from skrub import DatetimeEncoder, ApplyToCols

de = DatetimeEncoder(
    add_total_seconds=True,
    add_weekday=True,
    add_day_of_year=True
)

df_dt = ApplyToCols(ToDatetime(), allow_reject=True).fit_transform(df)
df_enc = ApplyToCols(de, cols="dates").fit_transform(df_dt)
df_enc
dates_year dates_month dates_day dates_total_seconds dates_weekday dates_day_of_year
0 2023.0 1.0 3.0 1.672704e+09 2.0 3.0
1 2023.0 2.0 15.0 1.676419e+09 3.0 46.0

Periodic Features

Cyclical patterns need special handling:

  • Day of week: 0-6, but 0 and 6 are close
  • Month: 1-12, but 12 and 1 are close
  • Hour: 0-23, but 23 and 0 are close

Circular (Sin/Cos) Encoding

import numpy as np

df["weekday_sin"] = np.sin(2 * np.pi * df["weekday"] / 7)
df["weekday_cos"] = np.cos(2 * np.pi * df["weekday"] / 7)

Or with DatetimeEncoder:

de = DatetimeEncoder(periodic_encoding="circular")
df_enc = ApplyToCols(de, cols="dates").fit_transform(df_dt)
df_enc
dates_year dates_total_seconds dates_month_circular_0 dates_month_circular_1 dates_day_circular_0 dates_day_circular_1
0 2023.0 1.672704e+09 0.500000 0.866025 5.877853e-01 0.809017
1 2023.0 1.676419e+09 0.866025 0.500000 1.224647e-16 -1.000000

Spline Encoding Alternative

de = DatetimeEncoder(periodic_encoding="spline")
df_enc = ApplyToCols(de, cols="dates").fit_transform(df_dt)
df_enc
dates_year dates_total_seconds dates_month_spline_00 dates_month_spline_01 dates_month_spline_02 dates_month_spline_03 dates_month_spline_04 dates_month_spline_05 dates_month_spline_06 dates_month_spline_07 dates_month_spline_08 dates_month_spline_09 dates_month_spline_10 dates_month_spline_11 dates_day_spline_0 dates_day_spline_1 dates_day_spline_2 dates_day_spline_3
0 2023.0 1.672704e+09 0.0 0.166667 0.666667 0.166667 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.036000 0.538667 0.414667 0.010667
1 2023.0 1.676419e+09 0.0 0.000000 0.166667 0.666667 0.166667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.166667 0.000000 0.166667 0.666667

Example: spline periodic features

example of periodic features generated with splines

Key Takeaways

  • Use ToDatetime or Cleaner to parse string dates
  • DatetimeEncoder extracts useful features
  • Use periodic encoding for cyclical features
  • Circular encoding: sin/cos transformation
  • Spline encoding: alternative periodic approach