AggTarget#
Usage examples at the bottom of this page.
- class skrub.AggTarget(main_key, operation=None, suffix=None)[source]#
Aggregate a target
y
before joining its aggregation on a base dataframe.Accepts
pandas.DataFrame
orpolars.DataFrame
inputs.- Parameters:
- main_key
str
or iterable ofstr
Select the columns from the main table to use as keys during the aggregation of the target and during the join operation.
If main_key refer to a single column, a single aggregation for this key will be generated and a single join will be performed.
Otherwise, if main_key is a list of keys, the target will be aggregated using each key separately, then each aggregation of the target will be joined on the main table.
- operation
str
or iterable ofstr
, optional Aggregation operations to perform on the target.
- numerical{“sum”, “mean”, “std”, “min”, “max”, “hist”, “value_counts”}
‘hist’ and ‘value_counts’ accept an integer argument to parametrize the binning.
categorical : {“mode”, “count”, “value_counts”}
If set to None (the default), [“mean”, “mode”] will be used.
- suffix
str
, optional The suffix to append to the columns of the target table if the join results in duplicates columns. If set to None, “_target” is used.
- main_key
See also
Examples
>>> import pandas as pd >>> import numpy as np >>> from skrub import AggTarget >>> X = pd.DataFrame({ ... "flightId": range(1, 7), ... "from_airport": [1, 1, 1, 2, 2, 2], ... "total_passengers": [90, 120, 100, 70, 80, 90], ... "company": ["DL", "AF", "AF", "DL", "DL", "TR"], ... }) >>> y = np.array([1, 1, 0, 0, 1, 1]) >>> agg_target = AggTarget( ... main_key="company", ... operation=["mean", "max"], ... ) >>> agg_target.fit_transform(X, y) flightId from_airport ... y_0_max_target y_0_mean_target 0 1 1 ... 1 0.666667 1 2 1 ... 1 0.500000 2 3 1 ... 1 0.500000 3 4 2 ... 1 0.666667 4 5 2 ... 1 0.666667 5 6 2 ... 1 1.000000
Methods
check_inputs
(X, y)Perform a check on column names data type and suffixes.
fit
(X, y)Aggregate the target
y
based on keys fromX
.fit_transform
(X, y)Aggregate the target
y
based on keys fromX
.Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Left-join pre-aggregated table on X.
- check_inputs(X, y)[source]#
Perform a check on column names data type and suffixes.
- Parameters:
- XDataFrameLike
The raw input to check.
- yDataFrameLike or SeriesLike or ArrayLike
The raw target to check.
- Returns:
- y_DataFrameLike,
Transformation of the target.
- fit(X, y)[source]#
Aggregate the target
y
based on keys fromX
.- Parameters:
- XDataFrameLike
Must contains the columns names defined in main_key.
- yDataFrameLike or SeriesLike or ArrayLike
y length must match X length, with matching indices. The target can be continuous or discrete, with multiple columns.
If the target is continuous, only numerical operations, listed in num_operations, can be applied.
If the target is discrete, only categorical operations, listed in categ_operations, can be applied.
Note that the target type is determined by
sklearn.utils.multiclass.type_of_target()
.
- Returns:
- AggTarget
Fitted
AggTarget
instance (self).
- fit_transform(X, y)[source]#
Aggregate the target
y
based on keys fromX
.- Parameters:
- XDataFrameLike
Must contains the columns names defined in main_key.
- yDataFrameLike or SeriesLike or ArrayLike
y length must match X length, with matching indices. The target can be continuous or discrete, with multiple columns.
If the target is continuous, only numerical operations, listed in num_operations, can be applied.
If the target is discrete, only categorical operations, listed in categ_operations, can be applied.
Note that the target type is determined by
sklearn.utils.multiclass.type_of_target()
.
- Returns:
- Dataframe
The augmented input.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.