Understanding ML monitoring debt

This article was originally published on Towards Data Science and is part of an ongoing series exploring the topic of ML monitoring debt, how to identify it, and best practices to manage and mitigate its impact

We’re all familiar with technical debt in software engineering, and at this point, hidden technical debt in ML systems is practically dogma. But what is ML monitoring debt? ML monitoring debt is when model monitoring is overwhelmed by the scale of the ML systems that it’s meant to monitor. Leaving practitioners to literally search for the proverbial needle in a haystack or, worse, hit ‘delete all’ on alerts.

ML monitoring is nowhere near as clear-cut as traditional APM monitoring. Not only are there no absolute truths when it comes to metrics and benchmarks, but models are not subject to economies of scale. It’s easy to spin up a new Kubernetes cluster, and the cluster will be subject to the same performance metrics, benchmarks, thresholds, and KPIs as its predecessors. But when you deploy a new model, even if it’s a pre-existing model and there has been no change to the artifact, it’s practically guaranteed that your references will be different. That means that you’re incurring debt for every model that you deploy to production and monitor.

What is a bad performance level? 80% accuracy? 60% accuracy?

Multiple factors need to be considered to identify a good/bad performance level, and the bottom line will be different depending on each model’s use case, segments, and of course, data. In this post, we’ll explain the debt dimensions of ML model monitoring by using “The four V’s of Big Data” framework, which lends itself surprisingly well for this comparison.

1. Veracity

High dimensionality

Measuring and monitoring a data-driven process dependent on 2–3 elements is reasonably straightforward. But ML is all about utilizing large amounts of data sources and entities to locate underlying, predictable patterns. Depending on the problem and dependent data, you could be looking at dozens of features or even hundreds and thousands of features, each one of which should be monitored independently.

Model metrics

ML is a stochastic data-oriented world combined from multiple different pipelines in production. This means that a host of metrics and elements need to be tracked and monitored for each entity, such as feature mean, std, and missing values for numerical elements and cardinality levels, entropy, and more for categorical elements. Comprehensive model metrics go beyond features, data, and pipeline integrity to provide quantifiable metrics to analyze the relative quality of model inputs and outputs.

Chip Huyen recently published a comprehensive list of model metrics covering the entire model life cycle that’s worth checking out.

2. Volume

Volume in ML monitoring needs to be analyzed on two dimensions: Throughput and granularity

Throughput

Models usually work on large amounts of data to automate a decision process. This poses an engineering challenge to monitor and observe the distribution and behavior of your dataset. A monitoring solution needs to detect data quality and performance issues in minutes in parallel to analyzing huge streams of data over time.

Resolution of data

To detect things on subpopulation levels requires the ability to slice data by segments, but it’s also an analytical challenge. The nature of data and model performance may vary dramatically for the same metric under different subpopulations.

Data Resolution - Population vs. Subpopulation Segmentation

For example, a missing value indicator on a feature called “Age” may usually be 20% on the overall population, but for a specific channel, say Facebook, the value may be optional and, in 60% of cases, is a missing value where for all other subpopulations it’s a missing value in only 0.5% of the cases.

A high-level view will give you only so much information, particularly regarding subpopulations and detailed resolutions critical to support business needs and decisions. Macro events that impact entire datasets or populations are things that everybody knows to look out for and are usually detected relatively quickly.

But this means that the engineering and analytical challenge of detecting issues in a huge stream of data is now multiplied by the number of different segments you need to monitor.

3. Velocity

Models serve the automation of business processes at different velocities, from batch daily weekly prediction and up to real-time ms decisions on a high scale. Depending on your use cases, you’ll need to be able to support varying types of velocities. Still, like with volume, velocity has an additional dimension to contend with, pipeline velocity. Looking at the entire inference flow as a pipeline for continuous improvement. In order to move fast without breaking things, you’ll need to reincorporate delayed feedback into your ML decision-making processes.

In some use cases, such as an Ad-tech real-time bidding algorithm, we will want to monitor for weekly effects as we need to be able to detect data quality or performance issues in a manner of minutes to avoid business catastrophes.

4. Variety

Last but not least, we come to variety. A successful model with business ROI spans more models. Once you get past that first model hurdle and prove ML’s positive impact on business outcomes, both your team and your business will want to replicate this success and scale it. There are three ways to scale models, and they are not mutually exclusive to each other.

Versions

ML is an iterative process, and versions are how we do it. The real world is not static, so pipelines and models must be optimized continuously. Versions are constantly created for the same existing models, but each version is actually a totally different model instance that may have different features or even different baselines.

Use case scale

Adding a use case to your arsenal means you’re essentially restarting the entire MLOps cycle from scratch. You can carry over many things, especially when it comes to feature engineering, but when you deploy to production, you’ll have a new set and scale of model metrics to monitor. In addition to the technical side of ML monitoring, models drive business processes, and each process is different from the others. For the same loan approval model, risk and compliance teams may be concerned about potential biases due to regulatory concerns, business ops want to be the first to know if the model suddenly decides to decline loans across the board, ML engineers need to know about integrity and pipeline issues, and data science teams may be interested in slow drifts in model predictions. The point is that it’s multidisciplinary, and your stakeholders are interested in different aspects of the ML decision-making process. With a new process, you need to make sure that you’re delivering value fast.

Multi-tenancy scale

Multi-tenancy has exponential scale capacity. Deciding to deploy a model across multiple tenets is used when a tenant equals a population in its own right. For example, deploying a learning process that detects potential customer churn, but on each country separately (tenant in this case). The result is a standalone model per country.

Making a decision like this can take you from a single fraud model to hundreds of fraud models overnight. And while they may share the same set of metrics, expected values, and behaviors will vary.

What we’ve learned about model monitoring debt

On the surface of things, model monitoring can seem deceptively straightforward. To be fair, with one or two models, it is feasible to monitor ML manually if you’re willing to invest the resources. But in ML engineering, just like software engineering, everything is a question of the debt and scale. Is it worth taking on and paying down later? Model monitoring is not a simple task, nor is it straightforward both from a technological and process perspective, and as you scale, so does the difficulty of managing ML monitoring.

The 4 V’s illustrate why model monitoring is complex, and as an exercise in quantifying this problem, let’s think about the following numbers:

#models115100
Avg features/model100100100
Avg segments/model101010
Avg metrics/features + outputs + labels555
Data points5,00075,000500,000
Simple ML monitoring noise calc

Now that we’ve quantified the inherent scale problem of ML monitoring and what causes it, the next step is to identify debt. The following parts of this series will deal with identifying debt indicators and best practices to manage and overcome model monitoring debt.

Stay tuned!

Want to see how Superwise can help you stay on top of monitoring debt?

Request a demo to see how!

2021 at Superwise: Let’s recap

In one day, 2021 will officially be a wrap. Before we all check out for some champagne and fireworks, let’s take a look at a few of our highlights from the last year and how Superwise is enabling customers to observe models at high scale. 

Connect anything, anywhere, by yourself

MLOps is a stack. It’s about best-in-breed solutions that streamline your entire model lifecycle and beyond. That’s why we went API-first this year, made integrating with the Superwise model observability fully flexible and model and platform agnostic, launched extensive documentation, and are continuously adding ecosystem integrations like the one we launched with New Relic, and there are many more coming next year.

We didn’t just double down on making Superwise an open platform – we also made any integration a matter of minutes and 100% self-service from the platform UI, Superwise SDK, and our APIs. 

Comprehensive metric discovery 

Our metrics were already great, and now they’re even better. All our metrics are automatically discovered and configured whenever you add a model or version. This leverages all of the best practices we’ve baked in to shorten your time to value by delivering out-of-the-box metrics for integrity, activity, distribution, and drift tailored to your models and data. We’ve also just released custom performance metrics, so you can express any business KPI you need to analyze and monitor. 

Self-service monitoring policy builder

No one wants to configure policies metric by metric. It’s slow, tedious, and not scalable, given how each model is unique, and monitoring use cases vary. That’s why we rolled out Superwise’s monitoring policy builder:

  • It lets you build and deploy policies within minutes 
  • Has flexible logic to support any unique use case
  • Automatically configures thresholds 
  • Lets you control sensitivity based on business impact.

Now you can logically express what events you need to be alerted about, and Superwise will continuously scan your models for you and ensure that the right team gets the right alert at the right time. 

Enterprise-grade management

We tripled our user base over the last 2 quarters alone. With more data science and ML engineering teams using the platform to observe their models in production, we added a host of authentication, security, and user management capabilities to the platform.  

User management, Multi-Factor Authentication, SAML, token management, and audit logs are all available for our customers on the platform. 

2021 has been a year marked with achievements across the board and not just in terms of customers onboarded, feature releases, and engineering accomplishments. We opened our first U.S. office in New York, doubled the team (and still are – check out our open positions here), and even had our first all-hands event since coming back from working remotely!

We’re proud of everything that our team across the globe has achieved over the last year, and given that we know what’s coming up next, 2022 is going to take model observability to a whole new level.  

Model observability: The path to production-first data science

Model observability has been all the rage in 2021, and with good reason. Applied machine learning is crossing the technology chasm, and for more and more companies, ML is becoming a core technology driving daily business decisions. Now that ML is front and center, in production, and business-critical, the need for model monitoring and observability is both plain and pressing. Practitioners across the board agree that ML is so fundamentally different from traditional code that models need a new breed of monitoring and observability solutions. All of this is true, but model observability can be so much more than the sum of its parts when adopted together with a production-first mindset. 

This article will explore what a modern model monitoring and observability solution should look like and how a production-first mindset can help your team proactively exploit opportunities that model observability presents. 

A new breed of observability

Before we dive into monitoring vs. observability, let’s take a second to talk about what it is about ML that requires a different approach. Ville Tuulos and Hugo Bowne-Anderson recently published a great article on why data makes MLOps so different from DevOps. Simply put, models aren’t linear; they are infinitely more immense, more complex, more volatile, and more individualistic than their traditional software counterparts. They operate at scale, both in terms of input complexity and input volume, and are exposed to constantly changing real-world data. So how do we effectively monitor and observe a system that we cannot model ourselves? Applications with no overarching truths like CPUs where even the ground truths (if we know them at all) are subject to change? Processes where a ‘good’ result can be relative, temporal, or even irrelevant?

This is the true challenge of model observability. It’s not about visualizing drifts and or building dashboards. That just results in data scientists and ML engineers babysitting their monitoring and literally looking for needles in the haystack. Model observability is about building a bigger picture context, so you can express what you want to know without defining minute details of each and every question and model.

The road to autonomous model observability

Model observability is paramount to the success of ML in production and our ability to scale ML operations. Still, just like ML, it is both a high-scale solution and a problem. The first step we need to take to achieve observability is to open the black box and get granular visibility into model behaviors. The ability to see everything down to the last detail is valuable, but it’s not practical at scale. At scale, you don’t have the capacity to look at everything all the time. It’s about automatically showing you what’s important, when it matters, why it’s an issue, what should be done, and who needs to do it. 

Autonomous model observability must do a specific set of things to justify its claim to fame. 

The road to autonomous model observability

Step 1 is about creating visibility into the black box and discovering the set of metrics pertinent to a model and calibrating their scale. This visibility reflects all the different elements in the process, such as:

  • Inputs and pipeline health: tracking the drift levels of our inputs to ensure the model is still relevant and validating the health and quality of the incoming data.
  • Model decision-making stability: how robust is our ML decision-making process? For example, a model now rejects 40% of loan requests relative to the 20% it usually rejects. Is a specific feature now abnormally affecting the decision-making process?
  • Process quality: measuring the performance of the process, usually in supervised cases, based on the collected ground truth. Such performance metrics can be as simple as precision/recall or complex and even customized to cases. For example, normalized F1 score based on the level of the transaction amount or identifying weak spots that the model is not well optimized for.
  • Operational aspects: are there traffic variations? Are there changes in the population mix/composition? Is our label collection process stable?

All of the above should be accessible and visible across versions and on the segment level, as many cases occur for a specific sub-population and won’t be apparent on an aggregated view.

Feature distribution change

Step 2 is all about zeroing in on signals and eliminating noise by adapting to seasonality and subpopulation behaviors to surface abnormal behaviors. As domain experts, you are the ones that know what types of abnormalities are interesting (e.g., model shift on the subpopulation level), but it’s impossible to manually express them with single, static thresholds that take into consideration the seasonality and temporality of the process.

Step 3 is about identifying risks and streamlining troubleshooting with grouped issues and correlated events to build context for faster root-cause analysis. Correlation isn’t causation, but it’s an excellent place to start analyzing and troubleshooting issues. For example, let’s take an ML process based on multiple data sources exhibiting a pipe issue with a single source. In this case, we’ll probably see an issue/shift in all the features that were engineered based on this source. As that is correlated, it should be displayed together as there’s a strong indication that the underlying data is the root cause.  

Superwise incidents

A positive side effect of the context built by grouping correlated events is that it further reduces noise.

Step 4 (the monitoring part) is about taking action and letting you abstractly express business, data, and engineering failure scenarios and how stakeholders should be notified. ML is multidisciplinary, and for different issues, different teams, singularly or in collaboration, will own the resolution. With that in mind, not only is it critical that teams get alerts promptly with all of the in-context information they will need to take corrective action before issues impact business operations, but it’s also critical that the right teams get the right alerts; otherwise, your stakeholders will suffer from alert fatigue. Automating this process is vital to successfully embed the monitoring aspect of model observability within the existing processes of each team. 

In addition to all this, autonomous also means giving organizations the freedom to consume observability as they see fit. This goes beyond the self-governing aspects of model observability that discovers metrics and builds contexts. It is about enabling open platform accessibility that lets businesses holistically internalize model observability within their processes, existing serving platforms, and tools. With an open platform, it’s easy to connect and consume each step via APIs. That empowers the organization, builds ML trust, and enables higher-level customizations specific to each organization. 

Production-first model observability 

Model observability has the potential to be much more than a reactive measure to detect and resolve issues when models misbehave, and it’s the shift to a production-first mindset that holds the key to achieving these benefits. With production-first model observability, every decision to improve a model is supported by production evidence and data. It helps us validate that a model is creating ROI for the organization and ensure that everything we do, be it deploying a new version or adding features, increases ROI and the quality of our business. Production-first model observability completely disrupts the research lead mindset that had dictated data science and machine learning engineering for so long and opens the door to continuous model improvement. 

Retraining is only the first and most obvious of continuous improvements. Many other continuous model improvement opportunities can be leveraged, such as A/B testing, shadow releases, multi-tenancy, hyperparameter tuning, and the list goes on. Production-first empowers us to answer our operational ML questions with our data instead of general rules of thumb and best practices. 

  • On what data should we retrain? Is fresh is best true? As ML practitioners, we shouldn’t be leveraging historical assumptions – we can analyze and retrain proactively based on prediction behavior.
  • What subpopulations is our model not optimized for? Or protect subpopulations that are prone to bias? 
  • How do we improve our existing model? Should we add features/ data sources? Should we adopt a different algorithmic approach? Should we eliminate non-attributing features and reduce model complexity?

Production-first model observability exposes continuous improvement opportunities, which means shorter paths to production, robust deployments, faster time to value, and the ability to increase scale.

Want to see what autonomous model observability looks like?

Request a demo

So you want to be API-first?

Deciding to become an API-first product is not a trivial decision to be made by a company. There needs to be a deep alignment throughout the company, from R&D all the way to marketing, on why and how an API-first approach will accelerate development, go-to-market, and the business at large. But more importantly, just like you need product-market fit, you need product-market-API fit. There is a big difference between externalizing APIs and being API-first, and depending on your clients and their use cases, you’ll need to understand whether API or API-first is the right choice for you. 

This post explores how APIs and API-first impact both the business and R&D through the evolution we at Superwise went through as we became an API-first product and business. 

APIs are not just about code

Luckily we don’t need to go into depth here. APIs are so common at this point that even the most non-technical of business persona knows that an API is an Application Programming Interface that standardizes communications so any two apps can send/receive data between each other. The problem with this is that they are so ubiquitous today that occasionally, you’ll see businesses pushing for APIs without a strong product-API fit and/ or product-dev maturity. 

Should your APIs be first-class citizens?

You need to ask yourself a set of criteria before deciding what to do regarding APIs; go all-in and become API-first, expose a set of APIs, or say no to APIs in their entirety. There is no magic number of yeses or nos here; you might even say yes to everything listed below, and still, API-first will be wrong for your product/business. 

Bigger picture fit

The first thing you need to figure out is where your API fits in the bigger picture and how integral it is to enhancing value. 

  • Is your solution part of a more extensive process? Yes, users tend to get annoyed with the overabundance of tools and platforms they need to use to do their jobs, but there is a big difference between a tool used monthly and a daily tool. 
  • Does consuming your solution via API generate more value for users? BI is an excellent example of higher value via API by making information accessible to all stakeholders in the organization.

Look at model observability, for example. It isn’t necessarily a day-to-day tool, but it is mission-critical, and when something goes wrong with ML in production, monitoring can trigger any set of processes to resolve the anomalies. Furthermore, almost always, you’ll also need to expose issues to other stakeholders in the organization so they can take preventive actions until the root cause is uncovered and the incident is resolved. 

Consistent reusability 

So you have a big picture fit, and your API creates additional value to your users; fantastic. Now think about how your users use your product and if this translates consistently, across your user base, to API usage. 

  • Is your product-market fit ubiquitous? Will most of your users want to use the API more or less in the same way? Social login is an excellent example of API-first. It’s a product with consistent reusability across the user base.
  • Can any organization implement your API? This is about both endpoints, not just your API. If the system you typically integrate into is niche or requires specific domain knowledge, it could be that not all organizations will be receptive to your API because they don’t have the necessary resources to bake it into their processes. 
  • Do your users need an API or all the APIs? Is it worth your time and effort to go API-first, or will you get the same impact with one or two APIs in a non-API-first approach?  
API-firstJust APIRational
Auth0/Frontegg✔️All authentication and authorization processes are done by API calls.
Paypal✔️All payments are done on the merchants’ website and sent to Paypal APIs.
Slack✔️Slack provides APIs, but its main value is in giving users an amazing organizational chat experience – which kinda needs UI.

Superwise’s journey to API-first

In all honesty, when we first started exposing APIs, we didn’t have a robust process in place, much less an API-first mentality. We were exposing quite a few APIs, some for internal use in our web application and some for direct customer consumption. It was a headache to maintain both the APIs and their documentation, and we had no flow in place to handle the influx of customer requests to change /create APIs. In addition, and probably most importantly, because we didn’t have a well-defined process and mindset in place, there was a ton of miscommunication and ‘lost in translation moments between our backend and frontend teams that resulted, more often than I want to admit, in bugs and over-fetching.

All of these problems stemming from our APIs made us sit down and think about what is right for us when it comes to APIs and how to build processes that facilitate scale, both ours and our customers, without the issues we experienced till now. The result is evident from this post’s title; we decided to go API-first. 

So what did we do?

APIs are a big deal for our customers, internal and external, and our product depends on the quality of our APIs and their ability to deliver value seamlessly. 

  • Who is the client? Internal? External?
  • Are the API requests and responses aligned and fit the client’s use case? You need to find a balance between minimizing API calls and invocation explicitly. Too many API calls are inefficient, but confusing invocations are ineffective – both are detrimental to the user experience. 

Once we figured all this out, we documented our APIs to create a “construct” of how the APIs will be consumed. This gives our frontend team the ability to mock data and continue developing front-end features without waiting on support from the backend. This way of thinking about our APIs as an integral part of the product makes us always examine any request to ensure that our APIs stay reusable and flexible. 

The advantages of going API-first

Going API-first, both technically and in terms of mindset, had a powerful impact on our ability to scale the application and integrate with external services rapidly. Before we started thinking about our APIs as first-class citizens, when we had a load on a specific API, it was impossible to scale just that specific API; we had to scale all our applications. With the switch to API-first, all our APIs are designed for a microservice with a specific task. This enables us to scale each API according to its load and be efficient with our resources.

  • Minimize dependencies – An API-first mindset brings dependencies to the forefront and encourages us to decouple APIs by design so that updates/changes can be done on the API level and not at the application level, which affects all APIs. This is not always attainable, but where it is, upgrading/changing APIs will be a more effortless and independent task. 
  • Parallelize development – Development teams can work in parallel by creating contracts (i,e: documenting your APIs route, requests, and responses) between services on how the data will be exposed. This way, developers do not have to wait for updates to an API to be released before moving on to the next API, and teams can mock APIs and test API dependencies based on the established API definition.
  • Speed up the development cycle – API-first means we design our APIs before coding. Early feedback on the design allows the team to adapt to new inputs while the cost of change is still relatively low, reducing overall cost over the project’s lifetime.
  • QA in design – double down on the design phase because fixing issues once APIs are coded costs a lot more than fixing them during the design phase. 
  • Design for reusability – You can also reduce development costs by reusing components across multiple API projects.

Key points to becoming API-first

Implementing an API-first development approach in your organization requires planning, cooperation, and enforcement. Here are some key points and concepts to bake into your API-first strategy to make sure it’s a success:

  • Get early feedback – Understanding who your API clients are, inside and outside of your organization, and getting early feedback on API designs helps you ensure API-use case fit. This will make APIs easier to use and shorten your development cycle.
  • Always design first – API (design)-first means you describe every API design in an iterative way that both humans and computers can understand – before you write any code. API consumption is part of the design process, and it’s important to remember that clients (in plural) will interact with the feature through an API, so you need to always keep everyone in mind and not focus too much on a specific client. Considering design first will also make it easier to understand all the dependencies in the task.
  • Document your APIs – API documentation is a must as it creates a construct between clients and developers. The documentation is critical to ensure that the API consumption is effective and efficient. We want to be exact in the language and examples so the client gets maximum impact with minimum effort. 
  • Automate your processes – Use tools like SwaggerHub to automate processes like generating API documentation, style validation, API mocking, and versioning. 
  • Make it easy to get started – Provide interactive documentation and sandboxes so that developers can try out API endpoints and get started building apps with your APIs right away. 

A lot has been said about going API-first, and there are many resources and best practices (for example, these articles from Auth0 and Swagger) that can help you through the transition. But going API-first doesn’t necessarily require refactoring your existing applications; it’s about embracing a different mindset. For us, it was, without a doubt, the right path to take, we see it in customer satisfaction and increased usage, we see it in how we are scaling faster and more efficiently, and we see it in how we are developing and deploying new capabilities faster to our customers.

Don’t forget to check out our careers page and join us!

Scaling model observability with Superwise & New Relic

Let’s skip the obvious, if you’re reading this it’s a safe bet that you already know that ML monitoring is a must; data integrity, model drift, performance degradation, etc., are already the basic standard of any MLOps monitoring tool. But as any ML practitioner will attest to, it’s one thing to monitor a single machine learning model, it’s another altogether to achieve automated model observability for dozens of live models all with immediate impact on daily predictions and business operations. Enter Superwise; high-scale model observability done right. What does that mean? Zeroing in on issues that lead to action, without alert fatigue and false alarms. The platform comes with built-in KPIs, automated issue detection and insights, and a self-service monitoring engine to deliver immediate value without sacrificing customization down the road. 

Model observability is all about context so it’s only natural for us to integrate our model KPIs and model insights into New Relic to take observability higher, further, faster. With the integration, Superwise and New Relic users will be able to explore model incidents within their New Relic workflow, as well as view Superwise’s model KPIs. 

What do you get?

The Superwise model observability dashboard gives you out-of-the-box information regarding your active models, their activity status, drift levels, and any open incidents detected for specific time intervals or filters. But we don’t stop there; you can configure any custom metric and incident you need to monitor for your specific use cases and monitor them in New Relic.

The basics

  • The model activity overview gives you a quick view of your active models, their activity (predictions) over time, and the total number of predictions during the filtered timeframe.
  • With drift detection and the model input drift chart, users can identify what models are drifting and may require retraining.
  • Using incident widgets, users can easily see how many models currently have open incidents (violations of any monitoring policy configured), how incidents are being distributed among the different models, and drill down into the model incident details. 

The custom

Superwise’s flexible monitoring policy builder lets you configure various model monitoring policies and send detected incidents into one or more downstream channels including New Relic, PagerDuty, Slack, Email, and more. You have full control over what policies are sent to which channels to ensure that the right team gets the right alert at the right time. 

What do you need to do?

It takes only a few minutes to integrate Superwise and New Relic so you can access our model KPIs and incidents in New Relic One. Check out the integration documentation and we’ll walk you through it.  

Don’t have Superwise yet?

We can fix that. Request a demo