# k Nearest Neighbor (KNN)

August 11, 2017

## k Nearest Neighbor (KNN): introduction

The necessity of data mining techniques has emerged quite immensely nowadays due to massive increase in data. Data mining is the process of extracting patterns and mining knowledge from data. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique. The model for KNN is the entire training dataset. When a prediction is required for a unseen data instance, the KNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.

Nearest neighbor classifiers is a lazy learner’s method and is based on learning by analogy. It is a supervised classification technique which is used widely. Unlike the previously described methods the nearest neighbor method waits until the last minute before doing any model construction on a given tuple. In this method the training tuples are represented in N-dimensional space. When given an unknown tuple, k-nearest neighbor classifier searches the k training tuples that are closest to the unknown sample and places the sample in the nearest class The K nearest neighbor method is simple to implement when applied to small sets of data, but when applied to large volumes of data and high dimensional data it results in slower performance. The algorithm is sensitive to the value of k and it affects the performance of the classifier.

Figure 1 KNN classification [5]

In contrast to Naive Bayes classification, which is based on a probabilistic algorithm, the K-Nearest Neighbor classifier relies on mathematically measuring the distance between documents that are presented. All documents are pre-labeled as belonging to its appropriate class. The cosine similarity measurement is commonly used with k-NN. If the given document d is similar to a document that is labeled to class C, we can assign document d to this class as well. Often, the classification accuracy of k-NN can be improved significantly if Large Margin Nearest Neighbor or Neighborhood component analysis is applied. The use of k-NN for text classification is widespread; however, it needs a significant pre-processing stage. This pre-processing stage transforms documents and words into sparse numeric vectors which use linear algebraic operations, which are necessary for most classification techniques as well. Nearest-neighbor classifiers are based on learning by resemblance, i.e. by comparing a given test sample with the available training samples which are similar to it. For a data sample X to be classified, its K-nearest neighbors are searched and then X is assigned to class label to which majority of its neighbors belongs to. The choice of k also affects the performance of k-nearest neighbor algorithm. If the value of k is too small, then K-NN classifier may be vulnerable to over fitting because of noise present in the training dataset. On the other hand, if k is too large, the nearest-neighbor classifier may misclassify the test sample because its list of nearest neighbors may contain some data points that are located far away from its neighborhood.

The basic steps of the k-NN algorithm are:

• To compute the distances between the new sample and all previous samples, have already been classified into clusters;
• To sort the distances in increasing order and select the k samples with the smallest distance values;
• To apply the voting principle. A new sample will be added (classified) to the largest cluster out of k selected samples.

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data

### Applications of KNN

KNN is a versatile algorithm and is used in a huge number of fields. Let us take a look at few uncommon and non trivial applications:

Nearest Neighbor based Content Retrieval: This is one the fascinating applications of KNN – Basically we can use it in Computer Vision for many cases – You can consider handwriting detection as a rudimentary nearest neighbor problem. The problem becomes more fascinating if the content is a video – given a video find the video closest to the query from the database.

Gene Expression: This is another cool area where many a time, KNN performs better than other state of the art techniques. In fact a combination of KNN-SVM is one of the most popular techniques there. This is a huge topic on its own and hence I will refrain from talking much more about it.

Protein-Protein interaction and 3D structure prediction: Graph based KNN is used in protein interaction prediction. Similarly KNN is used in structure prediction

### Advantages of K-nearest neighbor’s Algorithm

• KNN is simple to implement.
• KNN executes quickly for small training data sets.
• Performance asymptotically approaches the performance of the Bayes Classifier.
• Don’t need any prior knowledge about the structure of data in the training set.
• No retraining is required if the new training pattern is added to the existing training set.

### Limitation to K-nearest neighbors’ Algorithm

• When the training set is large, it may take a lot of space.
• For every test data, the distance should be computed between test data and all the training data. Thus a lot of time may be needed for the testing.

#### References

[1] Thomas Cover and Peter Hart, nearest neighbor pattern classification, Information Theory, IEEE Transactions on, 13(1):21–27, 1967.

[2] H. Bhavsar and A. Ganatra, “A Comparative Study of Training Algorithms for Supervised Machine Learning”, International Journal of Soft Computing and Engineering (IJSCE), Volume 2, Issue 4, September 2012.

[3] N. Suguna and Dr. K. Thanushkodi, “An Improved k-Nearest Neighbor Classification Using Genetic Algorithm”, IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 4, No 2, July 2010

[4] Rahul Saxena, “KNN Classifier, Introduction to K-Nearest Neighbor Algorithm”, December 23, 2016, available online at: http://dataaspirant.com/2016/12/23/k-nearest-neighbor-classifier-intro/

[5] https://helloacm.com/a-short-introduction-to-k-nearest-neighbors-algorithm/

Insert math as
Formula color
Type math using LaTeX
Preview
$${}$$
Nothing to preview
Insert