Machine Learning Monitoring Process: Detecting Prior Drift

Aug 7, 2024

3 min read

Prior drift is a phenomenon that occurs when the distribution of data changes over time, causing the assumptions made about the data in the past to become invalid. This can lead to poor model performance, as the model can no longer make predictions based on outdated assumptions.

One typical example of prior drift is the concept drift in machine learning, where the relationship between the input features and the target variable changes over time. For instance, a model trained to predict customer churn might perform well when it is first deployed, but as the customer base changes and new trends emerge, the model’s predictions become stale.

It is essential to regularly retrain machine learning models on current data to mitigate the effects of prior drift. This can help ensure that the model remains up-to-date and makes accurate predictions. Additionally, techniques such as online learning and concept drift detection can adapt models in real-time as the data distribution changes.

However, it is only sometimes possible to retrain models or adapt to changing data in real time. In these cases, it may be necessary to manually intervene and update the model’s assumptions to account for the changing data distribution. This can be time-consuming and resource-intensive, but it is needed to maintain the model’s performance and reliability.

A straightforward way to check for prior drift is with what is called a monitoring process. One way to check for this phenomenon is by tracking the true-positive rate of the model in the training environment. We can import binom from the python stats library to test for prior drift. This method is the binomial distribution’s cumulative distribution function (c.d.f.). Consider the following from the scipy documentation for this method.

binom takes n and p as shape parameters, where p is the probability of a single success and 1−p is the probability of a single failure.

Now we can leverage this method a long with the some information about our model.

from stats import binom
X = 88    # number positive detections.
tpr = 0.9 # True positive rate in notebook.
N = 100   # number of trials
p = binom(X, tpr, N)
if not t in (0.4, 0.6):
   print("prior drift")

In the above code, the variable tpr is the true-positive rate you find in your notebook. The variable X is the number of positive detection. The variable N is the total number of samples that ran through the classifier or the number of trails.

Let’s recall the definition of the c.d.f:

So, passing these values to the c.d.f. of the binomial distribution gives the probability of finding less than X number of positives with that many trials with a model that is expected to have the given true-positive rate. So, we want this value to be as close to the middle, 0.5, as possible. There is significant prior drift when the probability is outside some predefined open ball around 0.5. The boundary here is 0.4 to 0.6, and this can be expanded, but in practice should not be expanded larger than (0.3, 0.7).

Prior drift is a critical issue when working with machine learning models, as it can significantly impact the model’s ability to make accurate predictions. Regularly retraining models and adapting to changing data can mitigate the effects of prior drift and maintain the model’s performance over time. By using the simple test above, we can test for prior drift and account for the randomness in that calculation.