.. currentmodule:: skrub .. _user_guide_data_ops_vs_alternatives: How do skrub Data Ops differ from the alternatives? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Skrub DataOps and scikit-learn :class:`sklearn.pipeline.Pipeline` ==================================================================== Scikit-learn pipelines represent a linear sequence of transformations on one table with a fixed number of rows. .. image:: ../../../_static/sklearn_pipeline.svg :width: 500 Skrub DataOps, on the other hand, can manipulate any number of variables. The transformation they perform is not a linear sequence but any Directed Acyclic Graph of computations. Take the following example, where our task is to predict item price in dollars: .. image:: ../../../_static/dataops_graph.svg - Here we use three input variables: two tables ("Items" and "Prices") and a float ("euro_dollar_rate"). - For this regression task, we have declared which intermediary step can be considered as the design matrix X (shown in blue) and as the target y (shown in orange). - Akin to scikit-learn pipelines, we apply an estimator (Ridge) at the end of the processing. The rest of this user guide will detail how the DataOps work. Skrub DataOps and orchestrators like Apache Airflow =================================================================== Skrub pipelines are not an `orchestrator `_ and do not offer capabilities for scheduling runs or provisioning resources and environments. Instead, they are a generalization of scikit-learn pipelines, which can still be used within an orchestrator. Skrub DataOps and other Skrub objects, like :func:`~skrub.tabular_pipeline` =============================================================================== Skrub DataOps are built to maximize flexibility in the construction of complex pre-processing and machine learning pipelines. On the other hand, the main intent of Skrub objects such as :func:`~skrub.tabular_pipeline` and :class:`~skrub.TableVectorizer` is to provide interfaces that for common pre-processing tasks, and simple and robust baselines for machine learning. As a result, these objects are more opinionated and less flexible than DataOps. However, it is possible to combine DataOps and regular Skrub and scikit-learn transformers to improve their flexibility, particularly in multi-table scenarios. Can I use library "x" with Skrub DataOps? ========================================== Yes, Skrub DataOps are designed to be "transparent", so that any method used by the underlying data structures (e.g., Pandas or Polars) can be accessed directly: check :ref:`user_guide_direct_access_ref` for more details. All DataOps-specific operations are available through the ``.skb`` attribute, which provides access to the DataOps namespace. Other library-specific methods are available directly from the DataOp object, as if it were a regular object (like a Pandas or Polars DataFrame or Series).