=============================== Vision: where is skrub heading? =============================== .. currentmodule:: skrub Vision statement ================ The goal of skrub is to facilitate building and deploying machine-learning models on tables: `pandas `__ dataframe, SQL databases... | Skrub is high-level, with a philosophy and an API matching that of `scikit-learn `_. It strives to bridge the world of databases to that of machine-learning, **enabling imperfect assembly and representations of the data when it is noisy**, using the downstream target to predict to guide assembly when possible (supervised learning for data assembly). In the long term, as skrub is built on higher-level APIs, it will make it easier for data-scientists to use efficient database patterns and backends. Skrub seeks tradeoffs in terms of flexibility: its high-level APIs are by construction restrictive compared to directly manipulating dataframes. This is by design, as skrub does not aim to replace tools such as `Pandas `__, `Ibis `__, `DuckDB `_. To make things simpler, skrub uses defaults that are chosen empirically to give good machine learning, even though these are sometimes heuristic, as in the :class:`TableVectorizer`. Roadmap ======= In an open-source project, roadmaps can be whishful thinking: things happen in an iterative way, often guided by the community. We however decided to communicate on what we would like to do in the next 6 months to give a better idea of the vision. From shorter term to longer term: - Make the :class:`TableVectorizer` fast, robust, and easy to tune (in the sense of hyper-parameter tuning) - Add a Join-aggregator object, to do feature augmentation on one-to-many correspondences - Support polars - Support time series (eg in the aggregations) - Interpolator join to join across multiple columns without exact correspondences in the keys - Release (yes we are not planning to release very soon) - Data namespaces, lazy data loading, out of core computing using database engines (eg duckdb) - Join discovery to work in data lakes where the tables are not in a clean relational database - Automatic feature synthesis in databases, building on the assembling features