Skip to content

Synthesis of research related to deployment of Kedro to modern MLOps platforms

Juan Luis Cano Rodríguez edited this page Apr 16, 2024 · 1 revision

Authored with @alparibal

Deploying Kedro to and integrating with MLOps Platforms

This document aims to cover the current state regarding deploying Kedro on enterprise-grade MLOps platforms:

  • pain points observed integrating with distributed, container-based systems.
  • feedback we gathered from Kedro developers, users, and plugin developers such as GetInData.
  • learnings from implementing an MLRun-specific integration.

Common pain points

High level graphic summary of the problem space identified:

Deciding on granularity when translating to orchestrator DSL

  • A node within an orchestrator is typically an entire container.
  • There is often a significant conceptual mismatch between a single Kedro node and an orchestrator container node.
  • One needs to decide on what a "node" means in the orchestrator's environment i.e. the "granularity" of your nodes.
Expand detail

1:1 Mapping

This is where a single Kedro node is translated to a single orchestrator node.

  • Kedro encourages small, manageable nodes.
  • These nodes contain smaller logic units than typical orchestrator containers.
  • Distributing very small steps in orchestrators can lead to performance overhead. Consider running the pipeline in a single container mode (M:1 granularity) for efficiency.

Distributing each node also complicates the data flow between them:

  • When the pipeline is run locally non-persisted data is passed around as MemoryDatasets.
  • When each step runs in isolation, this feature is lost and most implementations require every step to be persisted. See this section for more details.

Currently, most deployment plugins use 1:1 mapping and hence are impacted by these drawbacks.

M:1 Mapping

This is where the whole Kedro pipeline is run as a single node on the target platform.

  • The main benefit is simplicity: One job goes to the orchestrator, executed on a single machine.

  • However, there are inefficiencies:

    1. Large setup: Setting up the single node to execute all tasks involves creating an environment, handling potentially conflicting requirements, and more.
    2. Limited parallelization: This approach often underutilizes available compute resources.

All dependencies need to be compatible in this configuration, see Requirements management in a Kedro project for more details.

M:N Mapping

This is where the full pipeline is divided into a set of sub-pipelines, that can be run separately. Today, there is no obvious way to do this.

This approach provides a middle ground between shortcomings of both the 1:1 and M:1 mappings:

  • Small node groups form large buckets of work that justify the overhead of creating an execution environment.
  • The orchestrator is free to schedule the sub-pipelines to be run in parallel / isolation.

Kedro is a fast, iterative development tool largely because the user is not required to think about execution contexts. This unmanaged complexity is why it is difficult to resolve this granularity mismatch in production contexts.

Piecemeal localised conventions for describing M:N granularity have emerged across mature users:

Convention Merits Drawbacks
Node tags Simple to use, CLI accessible,Applies across pipelines No bounded context, zero validation
Registered pipelines Simple to use, conceptually maps to sub-pipelines, CLI accessible No bounded context, zero validation
Pipeline namespaces Bounded context, CLI accessible, Visualisation integration Harder to use, confusing error messages, verbose catalog¹

Each of these has merits and drawbacks. In every case, the user is given no easy way to validate if these groups are mutually exclusive or collectively exhaustive.

Despite the namespace option being the most robust approach available (since v0.16.x), these are not in wide use across our power-user base. There are several hypotheses for the low adoption rate:

Hypothesis area Comments
Confusing feature space namespaces != modular pipelines != micropackaging, Overlapping features all unrelated to deployment confuse the value for the user.
• Today, namespaces are primarily used for visualisation and pipeline-reuse not deployment.
• Internal monorepo tooling now covers much of the micropackaging feature space.
UX • Users have reported they dislike the catalog verbosity introduced by namespaces¹
• The error messages provided by Kedro when applying namespaces are unhelpful²

¹ May be resolved by new dataset factory feature
² e.g. Failed to map datasets and/or parameters: params:features

Potential approaches to M:N grouping

Even for a mid-sized pipeline, it is not trivial to find the "optimum" grouping of nodes.

Approach Thoughts
Manual Grouping Pipeline developers are typically aware of broad groups (e.g. preprocessing, training). However, this is something that may take a while to stabilise during development
Via Kedro metadata (nodes, tags, namespaces) See M:N Mapping section above, each approach requires some human direction and unvalidated conventions.
Via DAG branching Nodes which split the pipeline graph into distinct branches can be used as sub-pipeline boundaries. This is a similar mechanism as used by ParallelRunner and ThreadRunner.
Via persistence points Nodes that persist data, i.e. nodes whose dataset type in the catalog is not MemoryDataset, to be the starting node of a new group. The assumption is that users persist data after checkpointing meaningful work. In a theoretically perfect production system one would only persist the very end of the pipeline.

Validating the groups

After nodes are mapped to several groups, sanity checks and questions need to be answered.

  • How may we enforce that the groups are still acyclic?
  • Should a node be able to be re-used across multiple groups?
  • How do we surface / manage un-grouped nodes?
  • Do we try and add validation to registered pipelines / node tags to better bound their context?

A possible solution here is to introduce a formal before_pipelines_registered and after_pipelines_registered hooks which would expose the pipelines in a state where grouping validation could be injected and applied (See issue #3000 here). There is no way to do this on a portable, plug-in level at the time of writing.

Expressing the groups

  • After the pipeline is broken down into groups, there must be a way to express these groups.
  • This expression mechanism must be serialisable so that it can be stored, reused, and passed between Kedro core, plugins, and orchestrators.

A possible solution is to build upon Pipeline.filter. If run configuration parameters share the same names (from_nodes, tags etc.), then at execution time, we can get the pipeline with the given name and just execute pipe.filter(**args).

Requirements management in large Kedro projects

Expand detail
  • The Kedro project template comes with a single requirements file for the whole Kedro registry.
  • The requirements of individual pipelines and nodes are not captured. All pipelines are usually run using the same environment.
  • It can be hard to manage a single environment for large projects. We have evidence of users adopting a monorepo of distinct projects when this becomes a blocker.
  • Modular pipelines do support localised requirements.txts, but it is still up to the user to make these work neatly in independent environments.

There is a 1:1 relationship between pipeline granularity and the dependencies required for that scope. A full solution could include metadata such as dependencies, Docker base image, preferred execution engine (e.g. Pod, Spark job, Ray parallel processing), and other relevant aspects.

Separating pipeline definition and execution environments

Expand detail

This section is very much coupled with Requirements management in a Kedro project

  • Most project-scoped CLI commands eagerly load all pipelines of the project.
  • Since Kedro nodes keep a pointer to the function object that the node has to run, loading all pipelines means importing all modules at once:
    • This is very expensive in large pipelines. The most common manifestation of this problem is where Kedro-Viz takes several minutes to load, despite not requiring a functional DAG.
    • This hinders the ability to isolate different work teams within a project, e.g. the data science team has to install Spark and the data engineering team has to install TensorFlow.

There are active initiatives to address this, but no concrete progress has been made at the time of writing.

No link between distributed KedroSessions of the same pipeline

Expand detail

As described below, most deployment plugins run the Kedro CLI under the hood.

  • When the execution of the pipeline is separated into multiple steps, a new KedroSession for each of these steps is created, and a separate session_id is assigned to each of them.
  • This makes it hard to have a single overview of the pipeline execution.

This point has been raised by the community and there is ongoing work by the Kedro team. Users often report bypassing Kedro's session_id and introducing their own mechanism.

Passing ephemeral data between distributed runs

Expand detail

Kedro, by default, uses MemoryDataSets to hold intermediate data. However, this dataset type cannot be used in a distributed setting since containers do not share main memory.

Deployment plugins usually replace the MemoryDataset by:

  • Having a Runner implementation with another default dataset type
  • Explicitly mapping catalog entries to another dataset type

In either case, ephemeral data is, at least temporarily, persisted to storage (cloud bucket, Kubernetes volume, etc.). The [de-]seriliasation of data throttles the pipeline execution speed and, in many cases, leads to worse performance in the distributed setting compared to a local run.

There are some solutions like the CNCF vineyard project that have in-memory data access offerings that might improve execution speed in only K8s specific situations.

Differentiating between data, model, and reporting artifacts

Expand detail

There is a wider point here that granular information about the entire Kedro execution lifecycle needs often to be exposed to the underlying MLOps platform in order to maximise the features available.

  • Most mature MLOps platforms differentiate between kinds of pipeline steps, models and artifacts.
  • For example:
  • Kedro Dataset classes do not contain metadata about the kind of data they store or load.
    • For example, a PickleDataSet can store any Python object and it is not known whether the dataset stores a model. In general, there is a strong argument that ONNX (LFAI) must be the default model serialisation mechanism within Kedro.
    • There are some 1st and 3rd party model specific datasets, but it is a manual exercise to classify these.
  • Users who hit this problem are forced to rely on type hints or some sort of object introspection to retrieve this information (see example).
    • Kedro hooks can be utilised to inspect objects at the right time during pipeline execution.
    • At translation time, type annotations of the node functions can be used similarly.

A potential solution here is to establish and enforce conventions. Introducing something like AbstractModelDataSet would make this much easier. We could also use the new metadata catalog key, but the onus is on the user to update this.

Lack of a standard pattern for iterative development

Expand detail

Currently, deployment plugins address the one-way task of converting a developed pipeline into a deployment. When deployment is viewed as an iterative process of development and deployment steps, additional gaps need to be bridged.

Linking source code to execution

There are two popular configurations (1) tight (2) loose between source code and platform:

  1. When the execution environment is not necessarily aware of ML concepts such as pipelines, models and artifacts it is on the user to ensure that deployments are versioned correctly. For example - steps must be taken to avoid pushing untracked code into deployment.
  2. When the execution environment is loosely coupled with the project's source code (e.g.Databricks Repos, AzureML Environment, MLRun Function), the deployment platform usually maintains the linkage between code and pipeline execution.

Keeping code and configuration separated

  • By design, Kedro separates code and configuration.
    • Configuration is not included as part of kedro package in strict adherence with 12factor app.
  • However, in most deployment patterns, the configuration is baked into the deployment.
    • Since v0.18.5 it is possible to easily pass a zip file containing configuration via the command line, but it is not easy to:
      • Point to a shared or cloud bucket location, only local directories are supported
      • Inject configuration directly through the command line in something like JSON.

One option is to use environment variables in the configuration and manage environment variables at deployment time. There is significant complexity in doing this at scale.

Limiting duplicated build efforts

In a setup where the pipeline is continuously deployed, repeating the same deployment workflow may lead to inefficiencies:

  • re-translating the pipeline: Ideally, the pipeline is translated only for changes ("deltas") in the repo that alter the structure of the pipeline. For example, some changes in the bound node function should not necessitate re-translation.
  • re-creating the environment: When source code is injected into the execution environment, the same Docker image should be re-used across several versions of the deployment. Some platforms support this out-of-the-box (MLRun, AzureML, Databricks).

It might be possible to implement a platform-agnostic solution, e.g. cloning the repo at execution time before executing the Kedro command.

Kedro dependency after deployment to orchestrator

Expand detail

There may be some situations where Kedro integrating with a target platform leaves much of the platform feature set under-utilised. From the platform's perspective, deployed Kedro pipelines may feel like "closed boxes".

For many deployment plugins, translating a Kedro pipeline means encapsulating the Kedro project within a Docker container and executing specific nodes via the Kedro CLI.

So, pipeline execution depends on Kedro in two ways:

  1. Session management:
    • Kedro still manages the run context, execution order, as well as importing and running lifecycle hooks.
    • This gives the user a familiar way to modify execution behaviour but can also be limiting for the orchestrator. For example, the nodes in the pipeline may not be fully transparent to the orchestrator in the case of M:1 mapping.
    • While it might be possible to remove session management from a simple project, it becomes very challenging when the project heavily utilizes hooks or any sort of dynamic pipelining.
  2. I/O:
    • Kedro datasets contain arbitrary custom logic that cannot be reliably mapped to data loading native logic supported by the orchestrator or platform.
    • If platforms are opinionated (like Sagemaker /opt/models) in the way that they handle artifact management these features often be bypassed and not automatically available to the users.

Recommended changes to Kedro core

  1. Distributed session_id Setting: Simplify session_id management in distributed Kedro pipelines (see issue #2182).
  2. Artifact kind assignment: Enhance dataset integration with artifact kinds. Make ONNX the default path.
  3. M:N Groups in Kedro: Establish conventions for M:N groups with deployment focus. (See kedro-plugins PR#241)
  4. Modular Requirements: Simplify pipeline deployments and development constraints. (Slack conversation)
  5. Group-Level Validation Hooks: Add hooks for enforcing constraints like MECE pipelines (see issue #3000).
  6. Lazy loading of pipeline structure Enable DAG resolution without dependencies present in the environment (#2829).
  7. Make Kedro pipeline serialisable: Inputs, outputs and fully qualified function references would enable easier translation into target DSLs. JSON target seems reasonable.
  8. Deterministic toposort: Users often report that the sort order is not reproducible, this affects any implicit grouping strategies considerably.

Deployment plugins

Overview of plugins

Almost all plugins rely on a Docker image to wrap the Kedro project. The Docker image is usually built just before executing the pipeline, and source code is copied into the image as part of the build.

  • A Docker container is spun up from this image on the MLOps platform.
  • The Kedro pipeline is run through the Kedro CLI.
  • Plugins provide hooks and datasets to manage the communication between Kedro and the platform.
  • This communication includes mapping Kedro datasets to platform artifacts, managing experiment tracking via MLflow deployed on the platform and [de-]serialisation of MemoryDatasets.

It is also worth noting that beyond data management and experiment tracking, deployment plugins often fail to leverage or unlock the full potential of platform-specific capabilities.

These unused capabilities include:

  • serving trained models via an endpoint
  • labelling and retraining workflows
  • incorporating feature stores
  • model monitoring

Comparison

Plugin Mapping Support Handling Memory Datasets Execution Setup Source Code Translation Platform Integration Reflections
kedro-airflow[O] 1:1 only Not supported Airflow-defined environment Source available in executor; cwd set to project path Kedro DAG -> Python script using Airflow API Designed for Airflow, not for container platforms.
kedro-docker[O] M:1 only N/A since M:1 Dockerfile environment Source available via Docker mount at execution time No translation or orchestration Introductory, platform-agnostic tutorial.
kedro-sagemaker[G] 1:1 only Cloudpickle & AWS bucket Dockerfile environment Source copied to container at build; auto-rebuilt SageMakerPipeline object using SageMaker API MLflow tracking, native pipeline visualization.
kedro-vertexai[G] 1:1 only Cloudpickle & GCS Dockerfile environment Source expected in container Kubeflow Pipelines DSL MLflow tracking, elastic machine allocation.
kedro-azureml[G] 1:1 only Cloudpickle & Azure Blob AzureML Environment Source in AzureML Environment Inputs/outputs -> AzureML counterparts AzureMLPipelineDataSet, MLflow, distributed training.
kedro-kubeflow[G] 1:1, M:1 KFP Volumes Dockerfile environment Source expected in container Kubeflow Pipelines DSL Scaling issue with KFP Volumes.
kedro-mlrun 1:1, M:N (prototype) MLRun artifact Dockerfile environment Source fetched from repo by MLRun Kubeflow Pipelines DSL Native model tracking, serving, pipeline visualization.

* [O] maintained by the Kedro org, [G] maintained by the GetInData org

Clone this wiki locally