End-to-end predictive models#
Turning a dataframe into a numerical feature matrix#
A dataframe can contain columns of all kind of types. A good numerical representation of these columns help analytics and statistical learning.
The TableVectorizer
gives a turn-key solution by applying
different data-specific encoders to the different columns. It makes reasonable
heuristic choices that are not necessarily optimal since it is not aware of the learner
used for the machine learning task). However, it already provides a typically very good
baseline.
The function tabular_learner()
goes the extra mile by creating a machine-learning
model that works well on tabular data. This model combines a TableVectorizer
with a provided scikit-learn estimator. Depending whether or not the final estimator
natively supports missing values, a missing value imputer step is added before the
final estimator. The parameters of the TableVectorizer
are chosen based on the
type of the final estimator.
|
|
Linear models and others |
|
---|---|---|---|
Low-cardinality encoder |
Native support (1) |
||
High-cardinality encoder |
|||
Numerical preprocessor |
No processing |
No processing |
|
Date preprocessor |
|||
Missing value strategy |
Native support (2) |
Native support |
Note
(1) if scikit-learn installed is lower than 1.4, then
OrdinalEncoder
is used since native support
for categorical features is not available.
(2) if scikit-learn installed is lower than 1.4, then
SimpleImputer
is used since native support
for missing values is not available.
With tree-based models, the MinHashEncoder
is used for high-cardinality
categorical features. It does not provide interpretable features as the default
GapEncoder
but it is much faster. For low-cardinality, these models relies on
either the native support of the model or the
OrdinalEncoder
.
With linear models or unknown models, the default values of the different
parameters are used. Therefore, the GapEncoder
is used for
high-cardinality categorical features and the
OneHotEncoder
for low-cardinality ones. If the
final estimator does not support missing values, a
SimpleImputer
is added before the final estimator.
Finally, a StandardScaler
is added to the
pipeline. Those choices may not be optimal in all cases but they are
methodologically safe.