AggTarget#

Usage examples at the bottom of this page.

class skrub.AggTarget(main_key, operation=None, suffix=None)[source]#

Aggregate a target y before joining its aggregation on a base dataframe.

Accepts pandas.DataFrame or polars.DataFrame inputs.

Parameters:
main_keystr or iterable of str

Select the columns from the main table to use as keys during the aggregation of the target and during the join operation.

If main_key refer to a single column, a single aggregation for this key will be generated and a single join will be performed.

Otherwise, if main_key is a list of keys, the target will be aggregated using each key separately, then each aggregation of the target will be joined on the main table.

operationstr or iterable of str, optional

Aggregation operations to perform on the target.

numerical{“sum”, “mean”, “std”, “min”, “max”, “hist”, “value_counts”}

‘hist’ and ‘value_counts’ accept an integer argument to parametrize the binning.

categorical : {“mode”, “count”, “value_counts”}

If set to None (the default), [“mean”, “mode”] will be used.

suffixstr, optional

The suffix to append to the columns of the target table if the join results in duplicates columns. If set to None, “_target” is used.

See also

AggJoiner

Aggregates auxiliary dataframes before joining them on the base dataframe.

Joiner

Augments a main table by automatically joining multiple auxiliary tables on it.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from skrub import AggTarget
>>> X = pd.DataFrame({
...     "flightId": range(1, 7),
...     "from_airport": [1, 1, 1, 2, 2, 2],
...     "total_passengers": [90, 120, 100, 70, 80, 90],
...     "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
... })
>>> y = np.array([1, 1, 0, 0, 1, 1])
>>> agg_target = AggTarget(
...     main_key="company",
...     operation=["mean", "max"],
... )
>>> agg_target.fit_transform(X, y)
   flightId  from_airport  ...  y_0_max_target y_0_mean_target
0         1             1  ...               1        0.666667
1         2             1  ...               1        0.500000
2         3             1  ...               1        0.500000
3         4             2  ...               1        0.666667
4         5             2  ...               1        0.666667
5         6             2  ...               1        1.000000

Methods

check_inputs(X, y)

Perform a check on column names data type and suffixes.

fit(X, y)

Aggregate the target y based on keys from X.

fit_transform(X, y)

Aggregate the target y based on keys from X.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Left-join pre-aggregated table on X.

check_inputs(X, y)[source]#

Perform a check on column names data type and suffixes.

Parameters:
XDataFrameLike

The raw input to check.

yDataFrameLike or SeriesLike or ArrayLike

The raw target to check.

Returns:
y_DataFrameLike,

Transformation of the target.

fit(X, y)[source]#

Aggregate the target y based on keys from X.

Parameters:
XDataFrameLike

Must contains the columns names defined in main_key.

yDataFrameLike or SeriesLike or ArrayLike

y length must match X length, with matching indices. The target can be continuous or discrete, with multiple columns.

If the target is continuous, only numerical operations, listed in num_operations, can be applied.

If the target is discrete, only categorical operations, listed in categ_operations, can be applied.

Note that the target type is determined by sklearn.utils.multiclass.type_of_target().

Returns:
AggTarget

Fitted AggTarget instance (self).

fit_transform(X, y)[source]#

Aggregate the target y based on keys from X.

Parameters:
XDataFrameLike

Must contains the columns names defined in main_key.

yDataFrameLike or SeriesLike or ArrayLike

y length must match X length, with matching indices. The target can be continuous or discrete, with multiple columns.

If the target is continuous, only numerical operations, listed in num_operations, can be applied.

If the target is discrete, only categorical operations, listed in categ_operations, can be applied.

Note that the target type is determined by sklearn.utils.multiclass.type_of_target().

Returns:
Dataframe

The augmented input.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Left-join pre-aggregated table on X.

Parameters:
XDataFrameLike,

The input data to transform.

Returns:
X_transformedDataFrameLike,

The augmented input.