Maximising your Platform Engineering Efficiency

5 Key Pillars for evaluating platform engineering success

Feb 21, 2024

Why Measure Platform Engineering?

Platform engineering metrics are crucial for understanding the performance, reliability, and efficiency of a platform engineering function. They provide valuable insights into the health of the infrastructure, help identify areas for improvement, and enable informed decision-making,
Majorly in today’s dynamic landscape these metrics enable organisations to continuously improve their systems, mitigate risks, optimise resource allocation, plan for scalability, and align with overarching business objectives. In essence, they empower decision-makers to make informed choices that drive innovation, resilience, and growth within the organisation.

Gartner predicts that by 2026, 80% of software engineering organisations will establish “platform teams as internal providers of reusable services, components and tools for application delivery.”

gray and yellow measures — Photo by William Warby on Unsplash

What to Measure

For Platform Engineering, there are hundreds of metrics which can be classified broadly in below categories:-

Operational Excellence
Compliance
Production Readiness
Cost Efficiency
Developer Experience

1. Operational Excellence

DORA

DevOps Operation and Research Assessment backed by Google- These are research based metrics and provides standard framework to know about Operational and Organisation performances

Deployment frequency

How frequently changes are getting pushed into production

Change lead time

How long it takes for a commit to reach production

Change failure rate

How frequently a deployment introduces a failure which requires an immediate intervention

Failed Deployment Recovery Time

How long it takes to get recover from a failed deployment incident

MTTx

MTTD (Mean time to detect)- Time taken to detect the issue

This gives an idea about how good your observability is, how best alerting thresholds had been designed for quickly isolating issues

MTTR (Mean time to restore/resolve)- Time taken to restore the issue

This gives you context about your engineering efficiency, incident management process, Run-books presence and much more

There are multiple other MTTx metrics but I believe above two are most important

Support Metrics

Number of Support tickets Opened/Triaged/Closed based on Severity

Understand and classify all support issues over various severity hierarchy followed like
Critical→ Major→ Medium →Minor

Number of Tickets classified as Configuration bugs

This reflects and helps to identify and isolate what all changes goes in, how they went in production , was there a Change request, what all unit testing can improve

Number of Tickets which need feature development efforts

This gives you clarity of investing into your future roadmaps and hence capacity planning and also makes you aware of drift of your platform capabilities v/s the user requirements

Number of total incidents per week and their Revenue Impact

Identifying revenue impact is itself a bigger metrics for any organisation and this defines/redefines various business priorities , Need to have measurement of same with all impacting incidents to run RCA

2. Compliance

DevSecOps Security Metrics

Number of vulnerabilities per week based on priority

Run multiple Security scanning tools to identify and rank vulnerabilities

Vulnerabilities per layer

Kernel vulnerabilities
OS Based Vulnerabilities
Container Based vulnerabilities

Vulnerability Patching Rate

Identify the patching cadence needed for your each environment and define efforts and capacity for same

Vulnerability breaching Age

Notify, Audit and monitor defects and ageing even though they are low in severity

Audit Compliance Metrics

PCI Audit Cycle and benchmarking

It is of paramount importance for the payment card industry data to get this audit done routinely to save revenue leakage/ avoid fraud and gain customers trust

SOX Audit Benchmarking

Sarbanes-Oxley Act also requires various organisation to audit their various practices/procedures for your data backup, change management, access control etc

Internal Audit Scoring

As your org prepare for above two, a shift left approach is needed to run an internal audit at various level for your platform configuration, to gain confidence before going for an external audit

3. Production Readiness Metrics

Reliability Score

Uptime

Basic metrics to measure availability of your tools/services, always take user perspectives in mind instead of computing each component uptime, collate and try to measure uptime for a specific service instead

SLO/SLI

Google Backed SRE principles fundamentally encourages to define SLOs i.e. service level objectives for each of your services identifying right SLI i.e. service level indicators

Error Budget

Identify the downtime which is acceptable for the services based on criticality of the services and aligned/agreed per your business domain, Create notification for those at various steps to monitor the expenditure of same with each incident

RCA Completion Rate

Blameless RCA are the strength of any system reliability, Action items coming out of RCA should be prioritised and there should be always capacity allotted to same, Hence RCA completion rate identified how many RCAs are successfully closed e2e with all Action items completed with in defined timeframe

Code Quality Score

Code Coverage

Not a difficult score to achieve as in current era lot of IDEs provides extension to achieve same, developers need to just adhere with same

Test Coverage

Test Coverage are important to have in order to achieve quality and stability in production, Corner cases, Fault injection, Unit testing, Integration testing all comes in consideration for same

Documentation

How well your documentation, run-books are drafted would be directly proportional to your engineering efficiency, again utilise Document as a Code functionality offered by various IDEs as a shift left approach

Observability

Monitoring Thresholds (Number of Alerts v/s Number of Issues)

Decluttering noise is equally important to have desired attention for incidents, Improve your alerting indicator regularly with increase in traffic, version upgrades and lot of routine changes

Number of Alerts categorised by Priority

Alerts classification helps in understanding need for urgent support or not, i.e. which alerts are eligible for Pagers, Segregation is important as well in measuring improvement rate for incident at each severity

Pagers schedule and Run-books

Pager rotation hygiene, Incident acknowledgements, Services coverage gives the clear picture of the support sanctity, Engineering team should own the process and drives it will fully

Utilisation

Throughput

Its a basic and much needed metrics for any system to run smooth with known boundaries, Perform PoCs, simulation traffic testing for all workloads before running in production in order to define their throughput with various parameters for eg running pods in burst mode needs a thorough testing, Throughput also helps in designing the step wise scaling in production

Scaling Capacity

Platform individual and shared components rightsizing should be known so that you can alert as it reaches x level of scaling and then horizontal, vertical scaling decisions can be taken up

At each layer , platform engineering should be aware of utilisation capacity whether its data storage, compute (clusters, nodes, pods, IPs)

Dimensioning should be publicised and advertised as a specification sheet of platform to make all users aware about it

4. Cloud Cost Efficiency Metrics

Cloud Cost Efficiency Score

Total Cloud cost expenses per week

Its an important business metrics to target, Already each cloud provider billing is visible with finance/FinOps team in cloud account consoles as well, lets define alerting/monitor for them for any abrupt hikes or downfalls

Migration/Introduction Cost

Any new tools/service introduction should have migration or introduction cost defined for ROI as well being architected /defined as a key criteria for Buy/Build approach or Move/Stay approach

Resource Provisioned v/s Resource Utilised

FinOps, Platform Engineering, cloud team whatever function your org has for this, Aiming for cloud cost optimisation has become an unsaid target for any business, Need to keep a close watch on how much actual resources are requested and how much are utilised by business

Decommissioning Save

Unused Cloud resources/platforms should be having retention policy in place to get auto cleaned up after certain time, Every test clusters, POC clusters are in scope for same which usually end up being stale

Chargeback and Attribution

Total Cloud Cost attributed to business

As part of Cost transparency platform engineering function should have core platform plugins cost attributed back to them only, Rest all cloud resources should get attributed back to the business domain it belongs to, this gives an accurate view of investment to all specific business units, Hence its useful to measure total Number of Apps getting chargeback and then their corresponding cumulative Cost to business

Tagging of Application and resources running over cloud

Tagging of each cloud resources not only helps in attribution but also in governance of said resources with their usability to specific business area and allocation percentages too

Forecast Rate

Business strategic decisions are empowered by roadmaps where predicting cloud cost helps in setting the correct expectation with the business growth expected, continuously comparison audit of this forecast should be done with actual expenditure to confirm alignments of Cost to business

5. Developer Experience Metrics

Onboarding Metrics

Time to onboard

Its the time taken by an engineer to understand the developer tools, pipelines and ran their deployment by pushing their first commit

Release Velocity

How frequent new product releases are happening i.e. from development phase to Production, this iterative metrics helps to know customer feature acceptance, reduces churn, this also throws light not only on development capabilities but also how smooth a DevOps pipeline is to facilitate releases

Product Market Fit

Developer Adoption Rate

While building platform as a product, we need to know usability rate of the features as well as complete platform as sometimes the users who are more loud tend to get their feature prioritised but we need to consider business priorities and ROIs to establish contextual and informed decision making in releasing features

Developer Engagement Rate

There has been a lot of say about Internal Developer Platform which talks about self service, accelerating developer productivity, not going in that much details as that had their own set of metrics like PRs, Review measurement etc but
here developer engagement simply means that utilities automation , tools you have invested in building developer experience , you need to see how much they are getting utilised in order to continue upgrade them to increase engagements so they did not end up unused or become a technical debt

CSAT Survey

Lastly a CSAT Score i.e. Customer satisfaction store is the biggest metric of above all,

It is beneficial to gauge customer needs, understand their biggest problems, It can be in both push and pull mode, ie Doing these surveys at a regular cadence or after bigger releases in push mode Or Opening a channel over slack, JIRA intake or your internal developer portal where user can give their feedback anytime they want, i.e. pull mode

Few Pitfalls to avoid

man in black and white jacket doing peace sign — Photo by Markus Spiske on Unsplash

Goodhart’s law: when a measure becomes a target it ceases to be a good measure

That means when everyone aims to mould/tries different ways just to achieve metrics target then accuracy of the metrics effectiveness will diminish

There is no one metrics suits all, each organisation have unique use-cases and implementation strategies, so tailor it for your own, research and adapt accordingly
Steer clear from unrealistic comparisons of metrics between various functions , Metrics provides context, unlike comparison wont serve value
Resist the temptation to equate metrics with individual performance. Doing so will disturb credibility, overlooks the collaborative efforts that drive success

LIT Yourself Up

Discussion about this post

Ready for more?