Data drift occurs when a model sees production data that differs from its training data. If a model is asked to make a prediction based upon drifted data, the model is unlikely to achieve its reported performance.

This phenomenon happens because during training, a model attempts to learn the most pertinent features to the train dataset. The most important features of the training dataset, however, are not universal to all data. For example, in computer vision tasks, models might identify patterns like straight lines in the sides of a building or circles for the wheels of a car. As useful as those features are for those specific objects, correctly interpreting images of trees or shoes will require a model to hone in on different features. Even in situations where the objects of interest remain the same, data drift can occur. Consider a model that is trained to detect buildings in overhead imagery. Suppose this model was trained on images taken over a city, during the day, in clear skies. If we pass images taken under different weather conditions, at night, or from a rural area, the data has drifted, and the model won’t perform as well.

Monitoring drift and taking the appropriate steps to retrain models is essential to ensuring models generate trustworthy results. Trusting your model predictions requires knowing that your model can correctly interpret your data. To automate this process, Modzy developed a statistical method of detecting drift between your data and a model’s training data.

The Challenge

To train a model, data scientists split up a large dataset into train and test datasets. The data scientist then trains a model on the train dataset. Next, the model’s metrics are calculated based on the performance of the model on the test dataset. Since both the test and train datasets come from the same distribution (e.g. they have the same features: they are all cell phone images, taken during the day, in good weather conditions, etc.), the features learned by the model generalize well to the test data. This leads to metrics that only measure the performance of the model on the available data. It is expected that this model will not perform as well on the data points that largely differ from the train and test data (e.g. they have different features: the production images could be from satellites, or during poor weather conditions, etc.). Therefore, a large difference between production data and train data, i.e., data drift, causes a large decrease in prediction accuracy.

There are a couple of different ways to detect drift. One way is by manually labeling the data that you ask the model to make predictions about, and then examining if the model made sufficiently accurate predictions. Not only is this impractical for large datasets, but also defeats the purpose of using a model in the first place. Other drift detection methods compare the prior distribution of model’s predictions with the distribution of predictions made by the model on new data. Modzy’s approach works differently. The Modzy approach centers around how the data is interpreted by the model during inference. The advantage of framing the problem in this way is that our drift detector is applicable in a more general setting.

Modzy’s Approach to Drift Detection

Modzy’s drift detection solution relies on the observation that a machine learning model will cluster similar samples from a familiar dataset close together in its feature space. Additionally, the features extracted for dissimilar data will be distributed differently in the feature space, indicating a level of drift. This observation led Modzy to the following formulation of drift in neural networks.

Modzy compares the way that the training data passes through the network with how the user’s data passes through the network. In other words, we look for patterns in the features identified by the model given the user’s data and compare them with the patterns familiar to the model created during training. Any sizable difference between the two can indicate drift.

To holistically sample drift throughout the network, we sample the feature space at multiple locations in the network, and extract the features of the train data samples at these locations. These features are then clustered according to their corresponding classes. The distances between the features learned from the train data and the clusters are calculated and then compared against the distances between the features identified for the production data and the same clusters. In this example, we compare two distance metrics— one estimating the distribution of distances for the training data in the model’s embedding feature space and one estimating the same distributions for the user’s production data. These distributions of distances are compared using statistical methods. As the differences between the pairs of distributions increases, the amount of drift between the two datasets increases, alerting the user to be more wary of the model’s predictions. This can be interpreted as a measure of reliability, i.e., the smaller the amount of drift, the more reliable the model’s predictions on the user’s dataset are.


Comparisons of the distributions of distances calculated for the train set, its corresponding test set, and a separate out-of-distribution set at two different layers in an image classification network. Since the train and test sets are sampled from the same probability distribution, their distance distributions are very similar, whereas the out-of-distribution dataset’s distance distribution is obviously different from the others.

What this Means for You

A model is only as good as the data upon which it was trained, making it crucial to know that the data passing through a model when it is deployed to a production environment is similar to its training data.
Suppose a significant amount of drift is detected in your dataset with respect to a given model. There are a few ways that you, as a data scientist working with this model, can respond. First, you could investigate the root cause of the drift. Was the data captured using a new sensor? Was the data formatted differently? Did the wrong data get sent to the model? If you can identify the cause and rectify the issue, then you can continue using the model with extra assurance that your data is familiar to the model; otherwise you should take the model out of production and use another one.

In many cases, drift is inevitable. During training, it isn’t possible to account for every possible scenario your model might encounter in in the real world, but that doesn’t mean you should give up on AI. Fortunately, when drift is detected and is found to be detrimental to the performance of a model, Modzy’s retraining solution can be utilized to custom tailor that model to your specific data to ensure the best results.