Productivity KPIs for ML teams

2022-07-14

The DORA four keys defines (1) Deployment Frequency, (2) Lead Time for Changes, (3) Time to Restore Services, and (4) Change Failure Rate; which indicates the performance of software delivery. The DORA four keys are performance measurement metrics and fits well philosophically with the mission of any Machine Learning (ML) team.

In addition, there is a definitive industry trend with the term MLOps, which tries to conceptually bridge ML projects with that of Software projects and DevOps. Therefore, it is natural for an ML team to think about its own KPIs in terms of the DORA four keys.

While the DORA four keys can help measure project performance and productivity, individual models within a domain should use various model observability and monitoring metrics specific to the use case.

What is an ML project?

First, let’s define what an ML project is. An ML project is a project that can be in two distinct and mutually exclusive phases, (1) exploration phase and (2) productionisation and iteration phase.

Exploration phase

An ML project that’s in the exploration phase is that which has (1) a clear problem statement defined in some document, (2) has a set of Jupyter notebooks or scripts that roughly documents the approach and datasets explored, and (3) a score / tracking / visualisation of all approaches (e.g. Tensorboard, MLflow, wandb.ai etc)

The notebooks or scripts themselves have a specification of what data was used, what kind of exploration was made and what model candidates were tried out. This is run many times (typically in the order of hundreds of times) with small changes. Scientists want immediate feedback to what changed and how the approach performed with each incremental change that they make.

Production and iteration phase

An ML project that’s in the productionisation and iteration phase is that which has (1) a formal representation of the pipeline, and (2) uses CI/CD.

The pipeline roughly represents (a) data loading, (b) data transformation and feature engineering, (c) model training, (d) model validation, (e) model deployment, and (f) model monitoring and feedback loop; all in a formally defined manner that's committed to Git and is deployed via CI/CD. The pipeline itself is defined in some DSL or orchestration framework (e.g. SageMaker pipelines, Vertex pipelines, Kubeflow pipelines, Airflow, etc.) and each stage of the pipeline has some underlying infrastructure involved (e.g. Databricks, SageMaker, Vertex AI etc.)

Typically projects in the productionisation and iteration phase are run in some schedule or cadence (e.g. daily, weekly, whenever a new data snapshot is available) in contrast to the exploration phase where the project is run hundreds of times in short bursts of time.

Discretizing the user journey

In order to measure the DORA four keys, we will need some telemetry data of some discretized view of our user journeys. Let’s define the discretized view as below for each phase.

Exploration phase

STEP CODE DESCRIPTION TELEMETRY
1 UNDERSTAND Understand what to do and if to do at all N/A
2 FORMULATE Formulate the problem statement and design proposal in a document T1 = doc creation time or project kickoff date in JIRA / Asana / Basecamp / Notion etc
3 EXPLORE Repeatedly run experiments on a combination of datasets, see if new data must be acquired, explore different approaches and track them T2 = Jupyter notebook executions or MLflow project creation etc
4 BASELINE Reach a consensus among all stakeholders involved for the correct approach and datasets T3 = Stakeholders approval in JIRA / Asana / Basecamp / Notion etc

Production and iteration phase

STEP CODE DESCRIPTION TELEMETRY
1 BOOTSTRAP Create an initial pipeline with some pipeline orchestration framework T4 = Git first commit to main branch
2 LOCAL CHANGE Add or introduce first change to model locally in a feature branch in IDE (e.g. VSCode) and verify if ready to commit to upstream N/A
3 OPEN PULL REQUEST Open a PR to the main branch and wait until stack is deployed in some staging env and deployment checks out and model is validated T5 = Git commit time
4 COMMITS TO PULL REQUEST Repeatedly do local changes and commit to the PR until a reasonable outcome is observed (e.g. successful deployment or good model performance) T6 = Git commit time and model training / deployment metadata in cloud console
5 OBSERVE Understand how the system is performing in the real-world with some Grafana / CloudWatch and go back to step 2 through 4 (LOCAL CHANGE …) if any small steering is needed. In case massive steering is needed, go back to the experimentation phase T7 = Model deployment metadata timestamp in cloud console

The discrete user journey steps in ML projects are usually discussed at length with limited clarity on a clear delineation of steps in each of the two phases. To address this problem, I define a mnemonic for the above as follows: (1) Under the ForEx Base (for exploration phase) and (2) BLOCO (for productionisation and iteration phase).

The rationale for such a discretization comes from my subjective but informed exposure to looking at tens, or rather hundreds of ML projects at Zalando over the last five years.

Ideas for defining the DORA four keys for ML projects

With a clear discrete steps in an ML project and corresponding telemetry, our definitions can be deduced easily, such as,

MLOps is moving closer to DevOps and today, we have better clarity on the different steps in an ML project.