Model quality is inextricably linked to training data. Today, having powerful computers and computational software makes it easy to build machine learning models that perform perfectly on training datasets. However, the downfall of these supposedly perfect models is their poor performance on previously unseen data. Figure 1 shows an illustration of this: the sample model, shown in orange, perfectly interpolates the training data (black data points). However, given any unseen data point, the sample model will produce a highly incorrect output with respect to the true model, shown in blue. This is because the sample model is biased, or was overfit, to the training dataset. A high-quality model has low error, while being able to generalize, i.e. a model that adapts correctly to new, previously unseen samples [1]. At Modzy, we’re using techniques like cross-validation and regularization to ensure model generalizability and to prevent overfitting.

Figure 1: Visualization of Bias-Variance Dilemma

What You Need to Know

Model quality is dependent on two main optimization tasks, minimizing the model error over the training dataset and maximizing model confidence over unknown samples to prevent overfitting. This is commonly referred to as the bias-variance dilemma.

k-fold cross-validation (CV) is a technique that can be used during the training process that ensures model quality and generalizability. The training data is divided into k subsets, typically assigning k-1 subsets for model training and a single subset for testing and evaluation. This process is then repeated k times with different subsets used for training, testing and evaluation, as shown in Figure 2 (the yellow box represents the testing set). The evaluation scores for each fold are then averaged. The risk of overfitting is minimized by using non-overlapping testing subsets in each fold, and the final average performance represents an unbiased score of how the model handles unseen samples.

Most machine learning models require input hyperparameters that are used during training. They are unknown a priori and therefore need to be optimized, which is a process called model selection. The best practice, in this case, is to use nested k-fold cross-validation, which eliminates the generalization error of the underlying model and its hyperparameter search. Tuning the hyperparameters using non-nested k-fold CV biases the model to the data set, resulting in an over-optimistic score [2]. Nested k-fold CV uses a series of train/test/validation splits. In the inner k-fold loop, the model performance is approximately maximized by fitting the model to each training set, and then directly maximized in selecting the hyperparameters over the validation set. In the outer loop, the generalization error is then estimated by averaging the model performance score over the test datasets that have not been used in the inner loop.

Regularization techniques are also used to preventing overfitting during model training. As mentioned previously, the best way to produce a high-quality machine learning model during training is to minimize the model error (calculated using a loss function) and maximize generalizability (calculated using a regularization term). Regularization techniques allow these two terms to be given priority using a scalar hyperparameter, , which scales the regularization term. For example, if a dataset is known to be very noisy,  can be set to be a large number, giving model generalization priority over minimizing error. Unfortunately,  is not known a priori and will have to be tuned using k-fold CV.

Modzy Approach to Model Quality and Training Data

Understanding the methods and effects of overfitting is imperative when designing and training machine learning models, and at Modzy, we take the necessary steps to ensure all our models are generalizable to previously unseen data samples.

What This Means for You

Modzy provides a collection of machine learning models that are guaranteed to be unbiased towards the datasets that they were trained on, making them generalizable to unseen samples. The data science team at Modzy recognizes the importance of balancing model bias and variance to ensure quality performance and transparent model assessment.


[1] Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[2] Cawley, Gavin C., and Nicola LC Talbot. “On over-fitting in model selection and subsequent selection bias in performance evaluation.” Journal of Machine Learning Research 11. Jul (2010): 2079-2107.