Exploring the unseen costs of building and maintaining ML systems — and how to mitigate them.

Machine learning has revolutionised how we solve complex problems — from personalised recommendations to autonomous driving and many more. Yet, while rapid prototyping and deployment may seem like a quick win, they often come with a hidden price tag: Technical Debt. Drawing inspiration from the seminal paper “Hidden Technical Debt in Machine Learning Systems” by Sculley et al., this article delves into the unique challenges of ML technical debt along with actionable strategies for mitigating long-term maintenance costs.

What is Technical Debt in ML?

Let us first understand what we mean by technical debt in machine learning systems.

In the software development field, technical debt (also known as design debt or code debt) refers to the implied cost of additional work in the future resulting from choosing an expedient solution over a more robust one. [Wikipedia]

In ML systems, however, the same problem is compounded by various problems such as:

Entanglement: Small changes in input data or features can unexpectedly ripple through the system.
Data Dependencies: Unstable or redundant data features add complexity and risk.
Configuration Complexity: An ever-growing number of configuration settings can lead to brittle systems.

This debt may be difficult to detect because it exists at the system level rather than the code level. Only a small fraction of real-world ML systems are composed of ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex. Hence, understanding these issues is critical as they not only affect model performance but also hinder the long-term maintainability and scalability of the entire ML system.

Real-world ML system components (source: 1)

Understanding the main causes of ML Tech Debt

1. Entanglement and the CACE Principle

The “Changing Anything Changes Everything” (CACE) principle reflects how interconnected components in ML systems can be. A minor tweak in one part of the pipeline may trigger unexpected changes elsewhere. For example, changing the input data distribution of some features can cause a change in the importance of other features used in ML systems. This could also apply to many other hyperparameters used in the model training process.

Actionable Strategies to mitigate this:

Isolate Components: Design your model components to be independent, if possible.
Monitor Input Distributions: Implement continuous monitoring and slice-based metrics to detect when the model’s prediction behaviour changes; this is a good indicator that something in the input has changed (unless the model itself has changed).

2. Undeclared Consumers

ML model predictions can often be left accessible to other systems. Having no visibility of who consumes the output of the model is risky because changes in the ML model affect the downstream processes that are consuming it. This can also be called “Visibility debt”.

Actionable Strategies to mitigate this:

Document Consumers: Maintain a registry of all systems that consume your model outputs.
Access Controls: Implement strict access policies and service-level agreements (SLAs) to manage data usage.

3. Data Dependencies

Data is the lifeblood of any ML system. However, unstable or underutilised data signals can become significant liabilities more than the code dependencies.

Actionable Strategies to mitigate this:

Versioning: Freeze versions of critical data features to prevent sudden, detrimental changes.
Feature Audits to remove underutilised dependencies: Conduct regular leave-one-feature-out evaluations to identify and retire redundant or low-value features.
Automated Dependency Tracking: Use tooling to annotate and validate the relationships between features.

4. Configuration Debt and Pipeline Jungles

Over time, configurations and data pipelines can become so chucky. Having misconfigured pipelines is expensive in terms of wasted effort money and unnecessary delays. We should always think that a highly configurable system is more mature because there is less room for errors. Let us think about the principles of good configuration systems.

Easy to specify a configuration and modify from the previous version
It should be hard to make manual errors in the configuration. It's absolutely necessary to validate the configuration values!
Detect unused and redundant settings left from previous iterations
It should be part of the code repository and versioned like other code.

5. Changes in the External World

Changes in the external world, such as shifts in data sources, regulatory requirements, or technological landscapes, can cause your models to not work as well as they used to. This can affect how reliable and accurate your predictions are.

Actionable Strategies to mitigate this:

Model Monitoring:

Why it’s important: Monitoring is like keeping a close eye on your model after it’s been put to work. It helps you see if it’s still performing well over time.
How to do it: Set up dashboards to track key performance indicators (KPIs), data drift, and system anomalies. Think of KPIs as vital signs for your model.
Data Drift: Data drift is when the data your model is seeing in the real world starts to look different from the data it was trained on.
Why retrain: Monitoring helps you decide when to retrain your model with new data. Models can become “stale” if they’re not updated, which can lead to less accurate results.

End-to-End Testing:

What it is: This involves testing the entire data flow, from where the data comes in (ingestion) to the final prediction.
Why it’s important: Before putting a new model into production, you want to make sure it’s better than the one you’re already using.
How to do it: Develop comprehensive tests that simulate the entire data flow.

There are lots of other causes for ML tech debt that are mentioned in the paper; I chose to explain some of the causes that strongly resonate with my experience. Please look into the reference section to learn more about other cases.

Practical Guidelines for Practitioners

Practice data validation and check for data leakage between your training and validation
Version critical data features to track changes.
Isolate model components using a modular design.
Document your customers, specifying who your model’s direct consumers are.
Audit code & config regularly for security and dependency updates.
Maintain a model registry like MLflow or Weights & Biases.
Ensure reproducibility of model deployments.
Use gradual rollouts: canary, A/B testing. I wrote an article on model deployments, check it out: Deploying Machine Learning Models in Production
Monitor and test ML systems, and make sure to have alerting systems.
Production model degrades, decide on a retraining period and update the model regularly.
Last but not least, use diagrams to visualise your dataflow and the entire system design. Don't underestimate the need for good documentation :p

Conclusion and Future Outlook

Managing technical debt in ML systems is an ongoing challenge. Recognising the hidden costs early and implementing strategies to mitigate them can save considerable time and resources in the long run. As ML systems continue to grow in complexity, the development of standardised tools and frameworks will be crucial. I encourage ML practitioners to take a proactive stance — audit your systems, simplify where possible, and share your learnings with the community.

What steps have you taken to manage technical debt in your ML projects?
Share your experiences in the comments below, and let’s build a more maintainable future for machine learning together.

References:

D. Sculley et. al., Hidden Technical Debt in Machine Learning Systems, https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf