AggTarget#

class skrub.AggTarget(main_key, operations, *, suffix='_target')[source]#

Aggregate a target y before joining its aggregation on a base dataframe.

Accepts pandas.DataFrame or polars.DataFrame inputs.

Parameters:
main_keystr or iterable of str

Select the columns from the main table to use as keys during the aggregation of the target and during the join operation.

If main_key refer to a single column, a single aggregation for this key will be generated and a single join will be performed.

If main_key is a list of keys, a multi-column aggregation will be performed on the target.

operationsstr or iterable of str

Aggregation operations to perform on the target.

Supported operations are “count”, “mode”, “min”, “max”, “sum”, “median”, “mean”, “std”. The operations “sum”, “median”, “mean”, “std” are reserved to numeric type targets.

suffixstr, default=”_target”

The suffix to append to the columns of the target table if the join results in duplicates columns.

See also

AggJoiner

Aggregates auxiliary dataframes before joining them on the base dataframe.

Joiner

Augments a main table by automatically joining multiple auxiliary tables on it.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from skrub import AggTarget
>>> X = pd.DataFrame({
...     "flightId": range(1, 7),
...     "from_airport": [1, 1, 1, 2, 2, 2],
...     "total_passengers": [90, 120, 100, 70, 80, 90],
...     "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
... })
>>> y = np.array([1, 1, 0, 0, 1, 1])
>>> agg_target = AggTarget(
...     main_key="company",
...     operations=["mean", "max"],
... )
>>> agg_target.fit_transform(X, y)
   flightId  from_airport  ...  y_0_mean_target  y_0_max_target
0         1             1  ...         0.666667               1
1         2             1  ...         0.500000               1
2         3             1  ...         0.500000               1
3         4             2  ...         0.666667               1
4         5             2  ...         0.666667               1
5         6             2  ...         1.000000               1

Methods

fit(X, y)

Aggregate the target y based on keys from X.

fit_transform(X, y)

Aggregate the target y based on keys from X.

get_feature_names_out()

Get output feature names for transformation.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Left-join pre-aggregated target on X.

fit(X, y)[source]#

Aggregate the target y based on keys from X.

Parameters:
XDataFrameLike

Must contains the columns names defined in main_key.

yDataFrameLike or SeriesLike or ArrayLike

y length must match X length. The target can be continuous or discrete, with multiple columns.

Returns:
AggTarget

Fitted AggTarget instance (self).

fit_transform(X, y)[source]#

Aggregate the target y based on keys from X.

Parameters:
XDataFrameLike

Must contains the columns names defined in main_key.

yDataFrameLike or SeriesLike or ArrayLike

y length must match X length. The target can be continuous or discrete, with multiple columns.

Returns:
Dataframe

The augmented input.

get_feature_names_out()[source]#

Get output feature names for transformation.

Returns:
List of str

Transformed feature names.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Left-join pre-aggregated target on X.

Parameters:
XDataFrameLike

The input data to transform.

Returns:
X_transformedDataFrameLike

The augmented input.