Clustering in Data mining

August 10, 2017

clustering is a technique to prepare the group of similar data objects based on their internal similarity. the data objects are representing the properties or features of an individual pattern of data. these properties or features are used for computing the similarity or differences among the data objects. the data objects which are grouped is termed as the cluster of data. the cluster analysis includes two main components first is known as centroid and second is data objects or cluster members.

Clustering is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities. Regarding to data mining, this methodology partitions the data implementing a specific join algorithm, most suitable for the desired information analysis. This clustering analysis allows an object not to be part of a cluster, or strictly belong to it, calling this type of grouping hard partitioning [1]. In the other hand, soft partitioning states that every object belongs to a cluster in a determined degree. More specific divisions can be possible to create like objects belonging to multiple clusters, to force an object to participate in only one cluster or even construct hierarchical trees on group relationships There are several different ways to implement this partitioning, based on distinct models. Distinct algorithms are applied to each model, differentiating its properties and results. These models are distinguished by their organization and type of relationship between them. The most important ones are [2]:

• Centralized – each cluster is represented by a single vector mean, and a object value is compared to these mean values
• Distributed– the cluster is built using statistical distributions
• Connectivity– The connectivity on these models is based on a distance function between elements
• Group– algorithms have only group information
• Graph– cluster organization and relationship between members is defined by a graph linked structure
• Density – members of the cluster are grouped by regions where observations are dense and similar

Cluster Analysis

Cluster analysis is a multivariate method which aims to classify a sample of subjects (or objects) on the basis of a set of measured variables into a number of different groups such that similar subjects are placed in the same group. Cluster analysis has no mechanism for differentiating between relevant and irrelevant variables. Therefore the choice of variables included in a cluster analysis must be underpinned by conceptual considerations. This is very important because the clusters formed can be very dependent on the variables included [3]. Cluster analysis includes a broad suite of techniques designed to find groups of similar items within a data set. Partitioning methods divide the data set into a number of groups predestinated by the user. Hierarchical cluster methods produce a hierarchy of clusters from small clusters of very similar items to large clusters that include more dissimilar items. Hierarchical methods usually produce a graphical output known as a dendrogram or tree that shows this hierarchical clustering structure. Some hierarchical methods are divisive, that progressively divides the one large cluster comprising all of the data into two smaller clusters and repeats this process until all clusters have been divided. Other hierarchical methods are agglomerative and work in the opposite direction by first finding the clusters of the most similar items and progressively adding less similar items until all items have been included into a single large cluster. Cluster analysis can be run in the Q-mode in which clusters of samples are sought or in the R-mode, where clusters of variables are desired”.

Definition of clustering in data mining

A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar to each other and are “dissimilar” to the objects belonging to other clusters. Cluster analysis is also used to form descriptive statistics to ascertain whether or not the data consists of set distinct subgroups, each group representing objects with substantially different properties. The latter goal requires an assessment of the degree of difference between the objects assigned to the respective clusters [4].

Figure 1 Example of Clustering

Central to clustering is to decide what constitutes a good clustering. This can only come from subject matter considerations and there is no absolute “best” criterion which would be independent of the final aim of the clustering. For example, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection).

Clustering pertains to unsupervised learning, where the data with class labels are not available. It basically involves enumerating C partitions, optimizing some criterion, over iterations, so as to minimize the inter-cluster distance (dissimilarity) or maximize the intra-cluster resemblance (similarity). Majority of the techniques that have been used for pattern discovery from bacteria are clustering and classification methods.

Clustering Algorithms

Clustering algorithms may be classified as listed below [5]:

Exclusive Clustering: In exclusive clustering data are grouped in an exclusive way, so that a certain datum belongs to only one definite cluster. K-means clustering is one example of the exclusive clustering algorithms.

Overlapping Clustering: The overlapping clustering uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. for example fuzzy c means.

Hierarchical Clustering: Hierarchical clustering algorithm has two versions:

• Agglomerative clustering and divisive clustering Agglomerative clustering is based on the union between the two nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final clusters wanted. Basically, this is a bottom-up version
• Divisive clustering starts from one cluster containing all data items. At each step, clusters are successively split into smaller clusters according to some dissimilarity. Basically this is a top-down version.

Probabilistic Clustering: Probabilistic clustering, e.g. Mixture of Gaussian, uses a completely probabilistic approach.

Reference

[1] “What is Clustering in Data Mining?” available online at: http://bigdata-madesimple.com/what-is-clustering-in-data-mining/

[2] Steven M. Holland, “Cluster Analysis”, available online at: http://strata.uga.edu/software/pdf/clusterTutorial.pdf, January 2006.

[3] Rosie Cornish, “3.1 Cluster Analysis”, available online at: http://www.statstutor.ac.uk/resources/uploaded/clusteranalysis.pdf, 2007

[4] A Tutorial on Clustering Algorithms http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/

[5] Ritika, “Research on Data Mining Classification”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 4, April 2014

$${}$$