Posted on , 5 min read
The DORA four keys defines (1) Deployment Frequency, (2) Lead Time for Changes, (3) Time to Restore Services, and (4) Change Failure Rate; which indicates the performance of software delivery. The DORA four keys are performance measurement metrics and fits well philosophically with the mission of any Machine Learning (ML) team.
In addition, there is a definitive industry trend with the term MLOps, which tries to conceptually bridge ML projects with that of Software projects and DevOps. Therefore, it is natural for an ML team to think about its own KPIs in terms of the DORA four keys.
While the DORA four keys can help measure project performance and productivity, individual models within a domain should use various model observability and monitoring metrics specific to the use case.
First, let’s define what an ML project is. An ML project is a project that can be in two distinct and mutually exclusive phases, (1) exploration phase and (2) productionisation and iteration phase.
An ML project that’s in the exploration phase is that which has (1) a clear problem statement defined in some document, (2) has a set of Jupyter notebooks or scripts that roughly documents the approach and datasets explored, and (3) a score / tracking / visualisation of all approaches (e.g. Tensorboard, MLflow, wandb.ai etc)
The notebooks or scripts themselves have a specification of what data was used, what kind of exploration was made and what model candidates were tried out. This is run many times (typically in the order of hundreds of times) with small changes. Scientists want immediate feedback to what changed and how the approach performed with each incremental change that they make.
An ML project that’s in the productionisation and iteration phase is that which has (1) a formal representation of the pipeline, and (2) uses CI/CD.
The pipeline roughly represents (a) data loading, (b) data transformation and feature engineering, (c) model training, (d) model validation, (e) model deployment, and (f) model monitoring and feedback loop; all in a formally defined manner that’s committed to Git and is deployed via CI/CD. The pipeline itself is defined in some DSL or orchestration framework (e.g. SageMaker pipelines, Vertex pipelines, Kubeflow pipelines, Airflow, etc.) and each stage of the pipeline has some underlying infrastructure involved (e.g. Databricks, SageMaker, Vertex AI etc.)
Typically projects in the productionisation and iteration phase are run in some schedule or cadence (e.g. daily, weekly, whenever a new data snapshot is available) in contrast to the exploration phase where the project is run hundreds of times in short bursts of time.
In order to measure the DORA four keys, we will need some telemetry data of some discretized view of our user journeys. Let’s define the discretized view as below for each phase.
|Understand what to do and if to do at all
|Formulate the problem statement and design proposal in a document
|T1 = doc creation time or project kickoff date in JIRA / Asana / Basecamp / Notion etc
|Repeatedly run experiments on a combination of datasets, see if new data must be acquired, explore different approaches and track them
|T2 = Jupyter notebook executions or MLflow project creation etc
|Reach a consensus among all stakeholders involved for the correct approach and datasets
|T3 = Stakeholders approval in JIRA / Asana / Basecamp / Notion etc
|Create an initial pipeline with some pipeline orchestration framework
|T4 = Git first commit to main branch
|Add or introduce first change to model locally in a feature branch in IDE (e.g. VSCode) and verify if ready to commit to upstream
|OPEN PULL REQUEST
|Open a PR to the main branch and wait until stack is deployed in some staging env and deployment checks out and model is validated
|T5 = Git commit time
|COMMITS TO PULL REQUEST
|Repeatedly do local changes and commit to the PR until a reasonable outcome is observed (e.g. successful deployment or good model performance)
|T6 = Git commit time and model training / deployment metadata in cloud console
|Understand how the system is performing in the real-world with some Grafana / CloudWatch and go back to step 2 through 4 (LOCAL CHANGE …) if any small steering is needed. In case massive steering is needed, go back to the experimentation phase
|T7 = Model deployment metadata timestamp in cloud console
The discrete user journey steps in ML projects are usually discussed at length with limited clarity on a clear delineation of steps in each of the two phases. To address this problem, I define a mnemonic for the above as follows: (1) Under the ForEx Base (for exploration phase) and (2) BLOCO (for productionisation and iteration phase).
The rationale for such a discretization comes from my subjective but informed exposure to looking at tens, or rather hundreds of ML projects at Zalando over the last five years.
With a clear discrete steps in an ML project and corresponding telemetry, our definitions can be deduced easily, such as,
MLOps is moving closer to DevOps and today, we have better clarity on the different steps in an ML project.