One of the simplest models for classification is k-Nearest Neighbors (kNN). The premise of kNN is to classify a new unseen point $P$ by considering the k neighboring points of P and selecting the majority of the k neighbors. In this notebook, we will investigate the effects of selecting different values of k for kNN, and how this extrapolates to more general models.
When evaluating a model, the most intuitive first step is to look at how well the model performs. For classification, this may be the percentage of data points correctly classified, or for regression it may be how close the predicted values are to actual. The bias of a model is a measure of how close our prediction is to the actual value on average from an average model. Note that bias is not a measure of a single model, it encapuslates the scenario in which we collect many datasets, create models for each dataset, and average the error over all of models. Bias is not a measure of error for a single model, but a more abstract concept describing the average error over all errors. A low value for the bias of a model describes that on average, our predictions are similar to the actual values.
The variance of a model relates to the variance of the distribution of all models. In the previous section about bias, we envisoned the scenario of collecting many datasets, creating models for each dataset, and averaging the error overall the datasets. Instead, the variance of a model describes the variance in prediction. While we might be able to predict a value very well on average, if the variance of predictions is very high this may not be very helpful, as when we train a model we only have one such instance, and a high model variance tells us little about the true nature of the predictions. A low variance describes that our model will not predict very different values for different datasets.
The image describes what bias and variance are in a more simplified example. Consider that we would like to create a model that selects a point close to the center. The models on the top row have low bias, meaning the center of the cluster is close to the red dot on the target. The models on the left column have low variance, the clusters are quite tight, meaning our predictions are close together.
What is the order of best scenarios?
Let's investigate bias and variance in terms of kNN. A great dataset is the iris dataset, which contains three classes of irises and $50$ examples of each. The data set contains sepal and petal length and width. For the sake of clarity, I add some noise to the iris data so that the decision boundaries are not as clear.
The plots above have red, green, and blue colored sections corresponding to decision boundaries. Decision boundaries are the boundaries in which a specific classication decision is chosen. For example, the red section represents the area where a new point would be classified as red, and similarly for blue and green. The decision boundary is the dividing line between the red and green sections. When comparing $k=1$ and $k=50$, there are major differences in the decision boundary. For $k=1$, the decision boundary is not a smooth and contains many small "islands of classification". The decision boundary is very complex and attempts to fit specifically to each data point. This is not necessary the best choice. If we consider $k=50$, then the decision boundary is far mor simple. The decision boundary is roughly composed of straight lines, for example the green region is roughly triangular.
How do we bring bias and variance into these plots?
The bias for kNN can be thought of classification accuracy. For small values of $k$, our model performs exceptionally on the training data. $k=1$ even does perfect classification! For large values of $k$, our models performs quite poorly, hardly better than random guessing. Thus bias increases as $k$ increases.
Below is a plot of the training accuracy using the the training data to create decision boundaries and predict. The bias increases (accuracy is the opposite of error).
Variance for kNN is best understood through the decision boundaries. With the smaller values of $k$, the decision boundaries are jagged and complex, but simple and straight for larger values of $k$. Thus variance decreases as $k$ increases.
Below is a gif of the decision boundaries evolving as $k$ increases.