A New Way to Think About the “Bias vs. Variance” tradeoff
In ML, the tradeoff between overfitting and underfitting is often depicted using a U-shape. Traditionally, we want our model to learn enough information from the train set to be able to generalize well over the unseen data (test set). At the same time we don’t want it to learn too much to capture the noise and spurious patterns that pertain to a specific dataset, which will result in poor generalization.
The tradeoff between overfitting and underfitting is particularly apparent in linear models. However, neural nets very often have near-perfect accuracy on the train and the test sets at the same time. Why does it happen? The reason is model complexity — neural nets are typically more complex, being able to learn thousands and millions of parameters. This added complexity yields an interesting effect where such models are able to minimize the test error even when the train error is already zero. Such effect can be graphically depicted as a “double descent” U-curve.
This effect is described in the paper Reconciling modern machine learning practice and the bias-variance trade-of by Belkin et al. In this article I will empirically investigate and visualize the results of the paper.
For this task I picked a toy dataset from Kaggle on housing prices. The dataset contains various houses’ characteristics and their corresponding sale prices. The goal is to predict the price based on the characteristics. This dataset is traditionally used as a gateway into ML, as it is simple to understand and manipulate. The dataset contains 1168 observations and 78 variables.
I initially performed basic imputation and FE to prepare the data for the modeling. These manipulations are outside of the scope of this article, and the full code can be found on my GitHub. Although Belkin et al. only investigated classification tasks, I will also add regression into the bucket to see if the same relationship holds. As the predictor variable is originally continuous (sale price), I convert it into several equally-sized bins using pandas’s qcut method. Following the paper, I test the performance of the Random Forest Classifier (Regressor) by varying the number of leaves and the number of trees. The error is measured with Accuracy (classification) and Mean Absolute Error (regression).
I initially investigate the double-descent relationship using 4 classes. I train a Random Forest Classifier without bootstrapping with 10–300 leaves and a single tree, and 1–100 trees with 300 leaves. The increase in the number of trees can be thought of increased complexity of the model. As the plot shows, train accuracy plateaus around 300 leaves and does not increase much with increased complexity. However, test accuracy jumps as the number of trees goes up.
A similar relationship can be observed with a larger number of classes. Here I adjusted the number of leaves and the number of trees accordingly. As we can see, even when the train accuracy plateaus long before increasing the complexity, the jump in accuracy can still be observed.
Finally, I investigate this relationship for a regression task. We can see that the double U curve is not present in this case.
The takeaway message is that with increased complexity, the accuracy of the test set goes up when the number of classes is small. As we increase the number of classes, the jump becomes less and less pronounced. Regression can be considered as a special case of classification with number of classes = number of observations (at most). I believe this is the reason why we don’t see the double U curve in the regression task. The original paper has observed a similar relationship.