12  Quiz: Column-level transformations

13 Feature engineering

13.1 Question 1

Consider this data with outliers.

In this plot, which line shows numerical features scaled by the SquashingScaler?

Answer A)

The solid blue line

13.2 Question 2

Which of these is not a feature that can be added by the DatetimeEncoder?

Answer: A)

All other parameters are available to the DatetimeEncoder.

13.3 Question 3

Which categorical encoder is the most balanced in both runtime and downstream performance?

Answer: A)

The StringEncoder is always one of the fastest encoders to fit, and produces high quality encodings. The MinHashEncoder is faster, but it produces worse encodings. The TextEncoder can produce the best encodings for certain types of categorical data (mainly text), but is very slow to run.

13.4 Question 4

You need to do feature engineering on a dataset that includes user reviews as part of its features. You are working in an environment that has access to good computational resources, including a GPU. Which of the following encoders would be the best choice in such a scenario?

Answer: B)

The TextEncoder works best when it is applied to free-flowing text, or columns whose content may find additional context in the training set of the language model used by the encoder. User reviews fall in this category.

The OneHotEncoder would likely generate a very large number of uninformative features, and the features prepared by the OrdinalEncoder are unlikely to be informative.

The MinHashEncoder would generate a fixed number of encoded features, but the quality of the embeddings is lower than that of the TextEncoder. In a situation where computational resources are limited, it may be beneficial, but even in that case the StringEncoder may produce better results.