# K-Means Algorithm with Numerical Explanation

August 10, 2017

## K-means clustering Overview

in a number of classical data mining techniques the k-means algorithm is one of the most popular technique of unsupervised learning or clustering. that technique can be used with any kind of mining techniques web mining, text mining or any structured data mining. that algorithm is suitable to used for partitioning of data into K number of clusters. therefore that technique is also known as partition based clustering approach.

### Basic Functioning

that technique traditionally need to select k number of initial centroids. these centroids are compared with the remaining data objects in the input data samples. in order to compare the data objects the distance functions are used such as euclidean distance.

### Basics of clustering

Clustering is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities. Regarding to data mining, this methodology partitions the data implementing a specific join algorithm, most suitable for the desired information analysis. This clustering analysis allows an object not to be part of a cluster, or strictly belong to it, calling this type of grouping hard partitioning. In the other hand, soft partitioning states that every object belongs to a cluster in a determined degree. More specific divisions can be possible to create like objects belonging to multiple clusters, to force an object to participate in only one cluster or even construct hierarchical trees on group relationships.

There are several different ways to implement this partitioning, based on distinct models. Distinct algorithms are applied to each model, differentiating it’s properties and results. These models are distinguished by their organization and type of relationship between them. The most important ones are:

• Centralized – each cluster is represented by a single vector mean, and a object value is compared to these mean values
• Distributed– the cluster is built using statistical distributions
• Connectivity– he connectivity on these models is based on a distance function between elements
• Group– algorithms have only group information
• Graph– cluster organization and relationship between members is defined by a graph linked structure
• Density – members of the cluster are grouped by regions where observations are dense and similar

Clustering may be defined as a data reduction tool i.e. used to create subgroups that are more and more manageable than individual datum. Basically, clustering is justify as a process used for grouping a large number of data into meaningful groups or clusters based on some similarity between data. Clusters are the groups that have data similar on basis of common features and dissimilar to data in other clusters.

### K-Mean Clustering

The K-Means clustering algorithm is a partition-based cluster analysis method. According to the algorithm we firstly select k objects as initial cluster centers, then calculate the distance between each object and each cluster center and assign it to the nearest cluster, update the averages of all clusters, repeat this process until the criterion function converged. Square error criterion for clustering

$E=∑_{i=1}^k∑_{j=1}^{n_i}‖x_{ij}-m_i ‖^2$

$$x_{ij}$$​is the sample j of i-class, ​$$m_i$$​ is the center of i-class, i is the number of samples of i-class. K-means clustering algorithm is simply described as

Input: N objects to be cluster (​$$x_1,x_z,…,x_n$$​) the number of clusters k;

Output: k clusters and the sum of dissimilarity between each object and its nearest cluster center is the smallest;

Process:

1. Arbitrarily select k objects as initial cluster centers ​$$(m_1,m_2,…,m_k)$$​;
2. Calculate the distance between each object ​$$x_i$$​ and each cluster center, then assign each object to the nearest cluster, formula for calculating distance as:

$d(x_i,m_i )=√∑_{j=1}^d(x_i-m_j)^2,i=1…N,j=1…k$

$$d(x_i,m_i)$$​is the distance between data i and cluster j.

3. Calculate the mean of objects in each cluster as the new cluster centers,

$m_i=1/N ∑_{j-1}^{n_i}x_{ij} ,i=1,2,…,K$

$$N_i$$​ is the number of samples of current cluster i;

4. Repeat 2) 3) until the criterion function E converged, return ​$$(m_1,m_2,…,m_k )$$​Algorithm terminates.

### K – Means: Step-By-Step Example

As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of two variables on each of seven individuals:

 Subject A B 1 1.0 1.0 2 1.5 2.0 3 3.0 4.0 4 5.0 7.0 5 3.5 5.0 6 4.5 5.0 7 3.5 4.5

This data set is to be grouped into two clusters.  As a first step in finding a sensible initial partition, let the A & B values of the two individuals furthest apart (using the Euclidean distance measure), define the initial cluster means, giving:

 Individual Mean Vector (centroid) Group 1 1 (1.0, 1.0) Group 2 4 (5.0, 7.0)

The remaining individuals are now examined in sequence and allocated to the cluster to which they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a new member is added. This leads to the following series of steps:

 Cluster 1 Cluster 2 Steps Individual Mean Vector (centroid) Individual Mean Vector (centroid) 1 1 (1.0, 1.0) 4 (5.0, 7.0) 2 1, 2 (1.2, 1.5) 4 (5.0, 7.0) 3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0) 4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0) 5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7) 6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)

Now the initial partition has changed, and the two clusters at this stage having the following characteristics:

 Individual Mean Vector (centroid) Cluster 1 1, 2, 3 (1.8, 2.3) Cluster 2 4, 5, 6, 7 (4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster.  So, we compare each individual’s distance to its own cluster mean and to that of the opposite cluster. And we find:

 Individual Distance to mean (centroid) of Cluster 1 Distance to mean (centroid) of Cluster 2 1 1.5 5.4 2 0.4 4.3 3 2.1 1.8 4 5.7 1.8 5 3.2 0.7 6 3.8 0.6 7 2.8 1.1

Only individual 3 are nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1).  In other words, each individual’s distance to its own cluster mean should be smaller that the distance to the other cluster’s mean (which is not the case with individual 3).  Thus, individual 3 is relocated to Cluster 2 resulting in the new partition:

 Individual Mean Vector (centroid) Cluster 1 1, 2 (1.3, 1.5) Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)

The iterative relocation would now continue from this new partition until no more relocation occur.  However, in this example each individual is now nearer its own cluster mean than that of the other cluster and the iteration stops, choosing the latest partitioning as the final cluster solution. Also, it is possible that the k-means algorithm won’t find a final solution.  In this case it would be a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.

#### Reference

[1]Kapil Joshi, Himanshu Gupta, Prashant Chaudhary, Punit Sharma, “Survey on Different Enhanced K-Means Clustering Algorithm”, International Journal of Engineering Trends and Technology (IJETT) – Volume 27 Number 4 – September 2015

[2]Amir Ahmad, Lipika Dey, “A k-mean clustering algorithm for mixed numeric and categorical data”, Data & Knowledge Engineering 63 (2007) 503–527

[3]http://mnemstudio.org/clustering-k-means-example-1.htm

[4]http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/

Insert math as
Formula color
Type math using LaTeX
Preview
$${}$$
Nothing to preview
Insert