Vision: where is skrub heading?#

Vision statement#

The goal of skrub is to facilitate machine learning on tables: pandas and polars dataframes, SQL databases…


Skrub is high-level, with a philosophy and an API matching that of scikit-learn. It strives to bridge the world of databases to that of machine-learning, enabling imperfect assembly and representations of the data when it is noisy, using the downstream target to predict to guide assembly when possible (supervised learning for data assembly).

In the long term, as skrub is built on higher-level APIs, it will make it easier for data-scientists to use efficient database patterns and backends.

Skrub seeks tradeoffs in terms of flexibility: its high-level APIs are by construction restrictive compared to directly manipulating dataframes. This is by design, as skrub does not aim to replace tools such as Pandas, Ibis, DuckDB.

To make things simpler, skrub uses defaults that are chosen empirically to give good machine learning, even though these are sometimes heuristic, as in the TableVectorizer.

Roadmap#

In an open-source project, roadmaps can be whishful thinking: things happen in an iterative way, often guided by the community.

We however decided to communicate on what we would like to do in the next 6 months to give a better idea of the vision.

From shorter term to longer term:

  • Better support for time series

  • Data namespaces, lazy data loading, out of core computing using database engines (e.g., duckdb)

  • Join discovery to work in data lakes where the tables are not in a clean relational database

  • Automatic feature synthesis in databases, building on the assembling features