Every day organisations across the world are adopting artificial intelligence to solve various types of problems in business, process, technology etc. As more and more AI models are created, it becomes critical for the business to keep track of the model performance.

Types of Model Evaluation

There are two types of model evaluation -

  • Offline Evaluation
  • Online Evaluation

Offline Evaluation: When a model is being developed, its trained on a particular dataset. The performance of the model is then evaluated on a new dataset that the model has not seen before. This helps in understanding how the model generalises on new data and consequently its performance. Precision, recall or accuracy are examples offline evaluation metrics used for experiments building classification models.

Online Evaluation: Once the model is ready, it hits the road by deploying it in production. The performance of the prediction in the real world is called online evaluation. Models used in E-commerce or online world use some form A/B testing. In banking organisations that employ risk models, PSI ( Population Stability Index) score is one of the metrics used to determine the performance of the model.

There are tools out there to keep track of model performance and the model building experiments. ModelChimp is one of those tools.

Challenges of Online Evaluation

Offline evaluation compared to online is quite straightforward because -

  • Data: Data to evaluate on is present
  • Metric: Base Metric to evaluate on is generally based on the type of model used. For eg, precision and recall for classification
  • Time: Once the model is trained, it can be evaluated quickly offline

Looking at the same aspects for online evaluation, it becomes tricky.

  • Data: Data might not always be present to do online evaluation. For eg,  a face detection algorithm would predict who the person is but until and unless there is a way capture the prediction is correct or not, it will be difficult to determine the performance of the model. Even if the model is performing well offline, it might surprise you online. Google’s photo-organising service had such surprise a few years back
  • Metric: The metric used for online evaluation is mostly different compared to the offline metric used to determine. The metrics can vary based on use cases and organisation.  The online metrics are generally business metrics, Population Stability Index is an example of online evaluation metric
  • Time: The time period to evaluate a model also depends on the use case and organisation. For an E-commerce business, a recommendation algorithm can be evaluated quickly once it is deployed whereas a recommendation algorithm used for products sold at brick and mortar can take up to a month to perform the evaluation

Online Evaluation for Deep learning

Most of the deep learning models that are in production today are trained on unstructured data like image, audio, video and text. For structured data, machine learning models generally suffice.

One of the biggest challenges of building a deep learning model is to get annotated data. Annotating data costs time and money, and organisations use services of companies like Figure 8 and dataloop.ai to achieve this.

Once the data is annotated, the model is trained and evaluated offline. The online evaluation of such models becomes difficult as each prediction of these models has to be manually inspected. This manual inspection becomes the biggest challenge for the online evaluation of Deep Learning Models.


Both evaluations are necessary for businesses and organisations to keep track of. Online evaluation for deep learning models will be harder because of the manual effort that goes into the annotation. There are steps being taken to remove the human effort and this paper is an example of that.