DevOps KPI

Stay in the loop

April 20, 2021

What are DevOps Key Performance Indicators

DevOps is about culture, enabled by the implementation of best practices and appropriate tools. This approach aims to improve the design of software products and services, using constant feedback and continuous monitoring. However, to ensure the efficiency of this approach, it is better to know which metrics to evaluate.

MTT-R Mean Time to R-

This one is a bit tricky because the MTTR can represent 4 different measures since the R can mean “repair”, “recovery”, “respond” or “resolve” and even though these measures may seem similar, each one has its own meaning.
Consequently, it’ s important to have fully integrated the communication principle promoted by DevOps and to clarify which MTTR we are talking about. Before starting to measure your MTTR, your teams must agree on what is going to be tracked and make sure that everyone is talking about the same thing.

Mean Time to Repair

MTTR (mean time to repair) is the average time required to repair a system. MTTR is a critical measure of the maintainability of systems, whether they are applications or infrastructures. MTTR is the value used to assess the efficiency of the recovery of these systems when an IT incident occurs. The MTTR is particularly valuable since nearly 90% of the repair MTTR is dedicated to problem identification.
Calculating the Mean Time To Repair:
The MTTR is calculated by adding the total time spent on repairs during a given period of time, then dividing that time by the number of repairs.

Mean time to recovery

In DevOps metrics, MTTR (mean time to recovery) is the average time to recover from a product or system failure. It encompasses the whole length of the failure; from the moment the system or product fails to the precise moment when it is fully operational again.

Calculation of Mean Time to Recovery
Mean Time to Recovery is calculated by adding up all the downtime during a specific period and dividing it by the number of incidents. It is usually measured in hours and may refer to business hours, not clock hours.

Mean time to Resolve

MTTR (mean time to resolve) is the average time between the detection of an incident and the moment the affected system or component is available to users again. This includes the time it takes to detect the failure, to diagnose the problem and to repair it, as well as the time spent to ensure that the failure does not reoccur. This indicator is closely tied to customer satisfaction.

Calculating the Mean Time To Resolution:
To calculate this MTTR, you add up the total time to resolution during a given period and divide it by the number of incidents.

Mean time to respond

MTTR (mean time to respond) is the average time it takes to recover from a product or system failure starting when you are first alerted to the failure. This time does not include the latency of your alert system. Mean Time to Respond is useful in cybersecurity, combined with MTTD (Mean Time to Detect).

Calculation of the average response time :
To calculate this mean response time, add up the entire response time from the alert to the time the service or product is fully functional again. Then divide by the number of incidents.

MTTD – Mean Time to Detect

MTTD is the average time it takes for a team to identify an issue. This is a very critical metric to monitor in cybersecurity.

Calculating Mean Time to Detect:
To find the MTTD, just add up all the incident detection times for the given team member or period and divide by the number of incidents.

Defect escape rate

Clearly, in a perfect world, all bugs are found during the testing phase on the development server. However, in real life, and because DevOps is about delivering code quickly, it is important to track this metric to minimize the number and severity of defects that make it to production.

The percentage of critical and escaped defects is an important KPI, it ensures that the team and their testing efforts are focused on rectifying critical product issues and defects, which helps them ensure the quality of the overall testing process as well as the product.

“Fueled by data and empowered by automation, IT can operate in real-time, be predictive and rely on detailed data to have a true seat at the table, delivering strategic value for their organization and for their customers.”
— Joseph Bradley

Reliability and maintenance metrics

MTBF – Mean time between failures

MTBF (mean time between failures) is the mean time between failures of a system. This measure is used to monitor both the availability and reliability of a product. MTBF is a key maintenance metric for assessing the performance and security of critical or complex assets. It is one of the few KPIs that must be increased: the higher the time between failures, the more reliable the system is.

Calculation of the mean time between failures

This is the sum of the hours of Uptime in a period divided by the number of failures that occurred in that same period. MTBF is usually measured in hours, but if you are lucky enough to be able to measure it in days: Congratulations

MTTF – Mean Time to Failure

This metric is often mistaken and misused interchangeably with mean time between failures (MTBF). The main difference between MTBF and MTTF is that MTTF refers to non-repairable equipment, while MTBF applies to repairable systems. Software can be repaired and will experience multiple failures over its lifetime, and thus will have periods of time between failures, whereas non-repairable items, such as SSDs, will function properly for a period of time before permanently failing, and thus will only have one failure over its lifetime.

How to calculate MTTF:
To calculate MTTF, divide the total number of hours of operation by the total number of resources used. The MTTF represents the mean time to failure.

Automated test (in percentage)

DevOps relies heavily on automation, so tracking the performance of your automated tests is extremely important. This metric counts the number of automated tests performed for a given build. However, this metric does not consider redundancy, efficiency or variability of tests. This metric goes beyond simply counting the number of automated tests and helps clearly define business risks. With this metric, they can focus their test automation efforts on the tests that matter most to the business.

Deployment metrics

Lead Time for changes

This is an important metric in the DevOps model as it helps assess the efficiency of processes. The DevOps philosophy implies deploying “small batches” of code frequently, which is why it is so important to evaluate the time needed to implement, test and deliver the code. To do this, two important pieces of data are needed: when the commit occurred and when the code was deployed.

Fréquence de déploiement

Il s’agit du nombre de déploiements de logiciels sur une période donnée. Attention, cette mesure concerne les performances techniques du pipeline de déploiement, et non la fréquence de livraison, en effet tous les déploiements ne sont pas poussés en production. Il peut être mesuré de différentes manières, notamment par des pipelines de déploiement automatisés, des appels API et des scripts manuels.

Deployment stability

Any deployment that may cause problems or failures for your users is considered a failure. Deployment stability is the percentage of time that the most recent repository for a given repository has been successful. Of course, DevOps teams are supposed to build quality into the product from the beginning of the project, but failures and mistakes can happen. Not all broken builds are due to code errors; it may be an infrastructure issue that needs to be resolved. It becomes a problem when these errors persist for a long period of time and start to affect developer productivity.

This number should be as low as possible: 0 being the magic number, less than 5% of deployment failures is considered acceptable.

Service metrics

Tickets volume

Ticket volume or total number of tickets tracks all tickets in your support queue over a period. Bugs and errors can sometimes bypass the testing phase and be detected by the end user, resulting in a new ticket being opened. The number of customer tickets reported as “problems” or bugs is a major indicator of the reliability of the application. A large number of tickets indicates quality problems, while a small number indicates the robustness of the application. In other words, this number is a measure of end-user satisfaction.
A variation of this metric is the total number of conversations, which counts all exchanges with customers, whether through an official support ticket, social networks, or any other channel.

Application and traffic usage

After a deployment, it’ s a good idea to check the level of application usage or whether the number of transactions or users accessing the system seems normal. If all of a sudden there is no traffic or a traffic spike (if you are using microservices), there is a problem.

TAGS: