# Classification and prediction in data mining

August 10, 2017

Classification is a technique of supervised learning in data mining. that technique is applied when the data patterns or samples are having some predefined pattern labels or class labels. the supervised learning algorithms first prepare the data models based on the existing patterns. these existing patterns are known as training samples. additionally the preparation of data models are known as the training of algorithms. after the training of algorithms the data model is used to recognize the similar newly appeared samples or patterns. that is a very essential and popular technique in data mining because for obtaining the precise outcomes these techniques are used.

figure 1 classification

There are two forms of data analysis that can be used for extract models describing important classes or predict future data trends. These two forms are as follows

1. Classification
2. Prediction

These data analysis help us to provide a better understanding of large data. Classification predicts categorical and a prediction model predicts continuous valued functions. For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation.

## Classification

Following are the examples of cases where the data analysis task is Classification:

• A bank loan officer wants to analyse the data in order to know which customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyse to guess a customer with a given profile will buy a new computer.

In both of the above examples a model or classifier is constructed to predict categorical labels. These labels are risky or safe for loan application data and yes or no for marketing data.

## Prediction

Following are the examples of cases where the data analysis task is Prediction:

Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his company. In this example we are bother to predict a numeric value. Therefore the data analysis task is example of numeric prediction. In this case a model or predictor will be constructed that predicts a continuous-valued-function or ordered value.

# How Does Classification Works

With the help of bank loan application the Data Classification process can be understood this process includes the two steps:

• Building the Classifier or Model
• Using Classifier for Classification

Building the Classifier or Model

• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and their associated class labels.
• Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as sample, object or data points.

# Using Classifier for Classification

In this step the classifier is used for classification. Here the test data is used to estimate the accuracy of classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered acceptable.

## Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following activities:

• Data Cleaning – Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute.
• Relevance Analysis – Database may also have the irrelevant attributes. Correlation analysis is used to know whether any two given attributes are related.
• Data Transformation and reduction – The data can be transformed by any of the following methods.
• Normalization – The data is transformed using normalization. Normalization involves scaling all values for given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used.
• Generalization –The data can also be transformed by generalizing it to the higher concept. For this purpose we can use the concept hierarchies.

Note: Data can also be reduced by some other methods such as wavelet transformation, binning, histogram analysis, and clustering.

### Comparison of Classification and Prediction Methods

Here are the criteria for comparing methods of Classification and Prediction:

• Accuracy – Accuracy of classifier refers to ability of classifier predict the class label correctly and the accuracy of predictor refers to how well a given predictor can guess the value of predicted attribute for a new data.
• Speed – This refers to the computational cost in generating and using the classifier or predictor.
• Robustness – It refers to the ability of classifier or predictor to make correct predictions from given noisy data.
• Scalability – Scalability refers to ability to construct the classifier or predictor efficiently given large amount of data.
• Interpretability – This refers to the extent the classifier or predictor understands.

References

[1] “Data Mining – Classification & Prediction Introduction”, available online at: http://www.idconline.com/technical_references/pdfs/data_communications/Data_Mining_Classification_Prediction.pdf

[2] “Chapter 4: Classification & Prediction”, online available at: http://www.inf.unibz.it/dis/teaching/DWDM/slides2012/lesson9-Classification1.pdf

$${}$$