MLOps Project KPIs

The DORA four keys defines (1) Deployment Frequency, (2) Lead Time for Changes, (3) Time to Restore Services, and (4) Change Failure Rate; which indicates the performance of software delivery. The DORA four keys are performance measurement metrics and fits well philosophically with the mission of any Machine Learning (ML) team.

In addition, there is a definitive industry trend with the term MLOps, which tries to conceptually bridge ML projects with that of Software projects and DevOps. Therefore, it is natural for an ML team to think about its own KPIs in terms of the DORA four keys.

While the DORA four keys can help measure project performance and productivity, individual models within a domain should use various model observability and monitoring metrics specific to the use case.

What is an ML project?

First, let’s define what an ML project is. An ML project is a project that can be in two distinct and mutually exclusive phases, (1) exploration phase and (2) productionisation and iteration phase.

Exploration phase

An ML project that’s in the exploration phase is that which has (1) a clear problem statement defined in some document, (2) has a set of Jupyter notebooks or scripts that roughly documents the approach and datasets explored, and (3) a score / tracking / visualisation of all approaches (e.g. Tensorboard, MLflow, wandb.ai etc)

The notebooks or scripts themselves have a specification of what data was used, what kind of exploration was made and what model candidates were tried out. This is run many times (typically in the order of hundreds of times) with small changes. Scientists want immediate feedback to what changed and how the approach performed with each incremental change that they make.

Production and iteration phase

An ML project that’s in the productionisation and iteration phase is that which has (1) a formal representation of the pipeline, and (2) uses CI/CD.

The pipeline roughly represents (a) data loading, (b) data transformation and feature engineering, (c) model training, (d) model validation, (e) model deployment, and (f) model monitoring and feedback loop; all in a formally defined manner that’s committed to Git and is deployed via CI/CD. The pipeline itself is defined in some DSL or orchestration framework (e.g. SageMaker pipelines, Vertex pipelines, Kubeflow pipelines, Airflow, etc.) and each stage of the pipeline has some underlying infrastructure involved (e.g. Databricks, SageMaker, Vertex AI etc.)

Typically projects in the productionisation and iteration phase are run in some schedule or cadence (e.g. daily, weekly, whenever a new data snapshot is available) in contrast to the exploration phase where the project is run hundreds of times in short bursts of time.

Discretizing the user journey

In order to measure the DORA four keys, we will need some telemetry data of some discretized view of our user journeys. Let’s define the discretized view as below for each phase.

Exploration phase

STEP	CODE	DESCRIPTION	TELEMETRY
1	UNDERSTAND	Understand what to do and if to do at all	N/A
2	FORMULATE	Formulate the problem statement and design proposal in a document	T1 = doc creation time or project kickoff date in JIRA / Asana / Basecamp / Notion etc
3	EXPLORE	Repeatedly run experiments on a combination of datasets, see if new data must be acquired, explore different approaches and track them	T2 = Jupyter notebook executions or MLflow project creation etc
4	BASELINE	Reach a consensus among all stakeholders involved for the correct approach and datasets	T3 = Stakeholders approval in JIRA / Asana / Basecamp / Notion etc

Production and iteration phase

STEP	CODE	DESCRIPTION	TELEMETRY
1	BOOTSTRAP	Create an initial pipeline with some pipeline orchestration framework	T4 = Git first commit to main branch
2	LOCAL CHANGE	Add or introduce first change to model locally in a feature branch in IDE (e.g. VSCode) and verify if ready to commit to upstream	N/A
3	OPEN PULL REQUEST	Open a PR to the main branch and wait until stack is deployed in some staging env and deployment checks out and model is validated	T5 = Git commit time
4	COMMITS TO PULL REQUEST	Repeatedly do local changes and commit to the PR until a reasonable outcome is observed (e.g. successful deployment or good model performance)	T6 = Git commit time and model training / deployment metadata in cloud console
5	OBSERVE	Understand how the system is performing in the real-world with some Grafana / CloudWatch and go back to step 2 through 4 (LOCAL CHANGE …) if any small steering is needed. In case massive steering is needed, go back to the experimentation phase	T7 = Model deployment metadata timestamp in cloud console

The discrete user journey steps in ML projects are usually discussed at length with limited clarity on a clear delineation of steps in each of the two phases. To address this problem, I define a mnemonic for the above as follows: (1) Under the ForEx Base (for exploration phase) and (2) BLOCO (for productionisation and iteration phase).

The rationale for such a discretization comes from my subjective but informed exposure to looking at tens, or rather hundreds of ML projects at Zalando over the last five years.

Ideas for defining the DORA four keys for ML projects

With a clear discrete steps in an ML project and corresponding telemetry, our definitions can be deduced easily, such as,

Lead time for iterating in production = T7 - T5
Lead time for first bootstrap to first model deployment = T7 - T4 (first tick of the clock)
Total lead time (time to market) = T7 - T1
Deployment frequency = T7 per day/per week etc
CFR = # T7s that did not perform good in Grafana (via deployment id / commit id / model id)
Time to restore = T7 - T5 for the clock tick after bad model (via deployment id / commit id / model id)

MLOps is moving closer to DevOps and today, we have better clarity on the different steps in an ML project.