K-Nearest Neighbor (KNN) is one of the simplest Machine Learning algorithms for supervised learning. KNN algorithm classifies unlabeled input data based on similarity with other instances. The nearest (most similar) neighbors are determined by the Euclidian distance between the instance and input. In general, classifying new input data in KNN methods follows these steps:
Determine the value of K
Calculate the Euclidian distance between input and all training data
Create a subset of instance consisting of K instances which have the least distance among all data
Classify the input data based on labels in the subset, with the mode of K for discrete-valued output or mean of K for continuous-valued output (majority rule)
To determine the best value for K, multiple trial-and-error attempts with different values of K are necessary since the general method for determining K is still unknown.
A small value of K could lead to unstable prediction and prone to misleading. On the other hand, larger K value may result in more accurate prediction up to a certain threshold before becoming unstable. However, one rule of thumb states that K value should be an odd number for avoiding ties.
Another aspect that could also be misleading is the majority rule. KNN algorithm assumes the new data is similar to the majority label from the nearest neighbor. However, this rule could be misleading if the training dataset contains some noisy data, hence creating multiple labels ‘clustered in a specified area’.
For example, in 5-nearest neighbor algorithm of one new instance, top-2 nearest neighbor has the label ‘A’ and the rest are ‘B’. Even if the new data point has more similarities (closer to) data with label ‘A’, this data point will still be classified as ‘B’ because of the majority rule.
In conclusion, the KNN algorithm is a simple supervised learning method for classifying data according to similarities between each data. Moreover, finding the best k value is crucial since it directly affects the algorithm’s performance.
Comments