Exploring the unseen costs of building and maintaining ML systems — and how to mitigate them.
Unsplash

Machine learning has revolutionised how we solve complex problems — from personalised recommendations to autonomous driving and many more. Yet, while rapid prototyping and deployment may seem like a quick win, they often come with a hidden price tag: Technical Debt. Drawing inspiration from the seminal paper “Hidden Technical Debt in Machine Learning Systems” by Sculley et al., this article delves into the unique challenges of ML technical debt along with actionable strategies for mitigating long-term maintenance costs.

What is Technical Debt in ML?

Let us first understand what we mean by technical debt in machine learning systems.

In the software development field, technical debt (also known as design debt or code debt) refers to the implied cost of additional work in the future resulting from choosing an expedient solution over a more robust one. [Wikipedia]

In ML systems, however, the same problem is compounded by various problems such as:

This debt may be difficult to detect because it exists at the system level rather than the code level. Only a small fraction of real-world ML systems are composed of ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex. Hence, understanding these issues is critical as they not only affect model performance but also hinder the long-term maintainability and scalability of the entire ML system.

Real-world ML system components (source: 1)

Understanding the main causes of ML Tech Debt

1. Entanglement and the CACE Principle

The “Changing Anything Changes Everything” (CACE) principle reflects how interconnected components in ML systems can be. A minor tweak in one part of the pipeline may trigger unexpected changes elsewhere. For example, changing the input data distribution of some features can cause a change in the importance of other features used in ML systems. This could also apply to many other hyperparameters used in the model training process.

Actionable Strategies to mitigate this:

2. Undeclared Consumers

ML model predictions can often be left accessible to other systems. Having no visibility of who consumes the output of the model is risky because changes in the ML model affect the downstream processes that are consuming it. This can also be called “Visibility debt”.

Actionable Strategies to mitigate this:

3. Data Dependencies

Data is the lifeblood of any ML system. However, unstable or underutilised data signals can become significant liabilities more than the code dependencies.

Actionable Strategies to mitigate this:

4. Configuration Debt and Pipeline Jungles

Over time, configurations and data pipelines can become so chucky. Having misconfigured pipelines is expensive in terms of wasted effort money and unnecessary delays. We should always think that a highly configurable system is more mature because there is less room for errors. Let us think about the principles of good configuration systems.

5. Changes in the External World

Changes in the external world, such as shifts in data sources, regulatory requirements, or technological landscapes, can cause your models to not work as well as they used to. This can affect how reliable and accurate your predictions are.

Actionable Strategies to mitigate this:

Model Monitoring:

End-to-End Testing:

There are lots of other causes for ML tech debt that are mentioned in the paper; I chose to explain some of the causes that strongly resonate with my experience. Please look into the reference section to learn more about other cases.

Practical Guidelines for Practitioners

Conclusion and Future Outlook

Managing technical debt in ML systems is an ongoing challenge. Recognising the hidden costs early and implementing strategies to mitigate them can save considerable time and resources in the long run. As ML systems continue to grow in complexity, the development of standardised tools and frameworks will be crucial. I encourage ML practitioners to take a proactive stance — audit your systems, simplify where possible, and share your learnings with the community.

What steps have you taken to manage technical debt in your ML projects?
Share your experiences in the comments below, and let’s build a more maintainable future for machine learning together.

References:

  1. D. Sculley et. al., Hidden Technical Debt in Machine Learning Systems, https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf