End-to-end predictive models#
Create baseline predictive models on heterogeneous dataset#
Crafting a machine-learning pipeline is a rather daunting task. Choosing the ending learner of such pipeline is usually the easiest part. However, it imposes constraints regarding the preprocessing steps that are are required ahead of the learner. Programmatically defining these steps is the part that requires the most expertise and that is cumbersome to write.
The function tabular_learner()
provides a factory function that, given a
scikit-learn estimator, returns a pipeline combining this estimator with the
appropriate preprocessing steps. These steps correspond to a TableVectorizer
that handles heterogeneous data and, depending on the capabilities
of the final estimator, a missing value imputer and/or a standard scaler.
In the next section, we provide more details regarding the TableVectorizer
(Turning a dataframe into a numeric feature matrix). The parameters of the TableVectorizer
are chosen
based on the type of the final estimator.
|
|
Linear models and others |
|
---|---|---|---|
Low-cardinality encoder |
Native support (1) |
||
High-cardinality encoder |
|||
Numerical preprocessor |
No processing |
No processing |
|
Date preprocessor |
|||
Missing value strategy |
Native support (2) |
Native support |
Note
(1) if scikit-learn installed is lower than 1.4, then
OrdinalEncoder
is used since native support
for categorical features is not available.
(2) if scikit-learn installed is lower than 1.4, then
SimpleImputer
is used since native support
for missing values is not available.
With tree-based models, the MinHashEncoder
is used for high-cardinality
categorical features. It does not provide interpretable features as the default
GapEncoder
but it is much faster. For low-cardinality, these models relies on
either the native support of the model or the
OrdinalEncoder
.
With linear models or unknown models, the default values of the different parameters are
used. Therefore, the GapEncoder
is used for high-cardinality categorical features
and the OneHotEncoder
for low-cardinality ones. If the
final estimator does not support missing values, a SimpleImputer
is added before the final estimator. Finally, a
StandardScaler
is added to the pipeline. Those choices may
not be optimal in all cases but they are methodologically safe.
Turning a dataframe into a numeric feature matrix#
A dataframe can contain columns of all kinds of types. We usually refer to such data as “heterogeneous” data. A good numerical representation of these columns helps with analytics and statistical learning.
The TableVectorizer
gives a turn-key solution by applying different
data-specific encoders to the different columns. It makes reasonable heuristic choices
that are not necessarily optimal since it is not aware of the learner used for the
machine learning task. However, it already provides a typically very good baseline.
The TableVectorizer
handles the following type of data:
numerical data represented with the data types bool, int, and float;
categorical data represented with the data types str or categorical (e.g.
pandas.CategoricalDtype
orpolars.datatypes.Categorical
);date and time data represented by datetime data type (e.g.
numpy.datetime64
,pandas.DatetimeTZDtype
,polars.datatypes.Datetime
).
Categorical data are subdivided into two groups: columns containing a large number of categories (high-cardinality columns) and columns containing a small number of categories (low-cardinality columns). A column is considered high-cardinality if the number of unique values is greater than a given threshold, which is controlled by the parameter cardinality_threshold.
Each group of data types defined earlier is associated with a specific init parameter
(e.g. numeric
, datetime
, etc.). The value of these parameters follows the same
convention:
when set to
"passthrough"
, the input columns are output as they are;when set to
"drop"
, the input columns are dropped;when set to a compatible scikit-learn transformer (implementing
fit
,transform
, andfit_transform
methods), the transformer is applied to each column independently. The transformer is cloned (usingsklearn.base.clone()
) before calling thefit
method.
Examples#
The following examples provide an in-depth look at how to use the
TableVectorizer
class and the tabular_learner()
function.