Data analysis is now integral to our working lives. It is the basis for investigations in many fields of knowledge, from science to engineering and from management to process control. Data on a particular topic are acquired in the form of symbolic and numeric attributes. Analysis of these data gives a better understanding of the phenomenon of interest. When development of a knowledge-based system is planned, the data analysis involves discovery and generation of new knowledge for building a reliable and comprehensive knowledge base.
What is Data Preprocessing
Exploratory data analysis and predictive analytics can be used to extract hidden patterns from data and are becoming increasingly important tools to transform data into information. Real-world data is generally incomplete and noisy, and is likely to contain irrelevant and redundant information or errors. Data preprocessing, which is an important step in data mining processes, helps transform the raw data to an understandable format.
Data pre-processing is an important step in the data mining process. It describes any type of processing performed on raw data to prepare it for another processing procedure. Data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user. Data pre-processing is a step of the Knowledge discovery in databases (KDD) process that reduces the complexity of the data and offers better conditions to subsequent analysis. Through this the nature of the data is better understood and the data analysis is performed more accurately and efficiently. Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks).
Importance of Data Pre-processing
Data have quality if they satisfy the requirements of the intended use. There are many factors comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and interpretability. Real-world data is usually incomplete, (it may contain missing values), noisy, (data may contain errors while transmission or dirty data), inconsistent, (data may contain duplicate values or unexpected values which lead to inconsistency). Data preprocessing is a proven method of solving such problems.
No quality data, no quality mining results!, which means that if the analysis is performed on low-quality data then the results obtained will also be of a low quality which is not desired in the decision-making process. For a quality result, it is necessary to clean this dirty data. To convert dirty data into quality data, there is need of data pre-processing techniques.
Techniques of Data Pre-processing
We look at the major steps involved in data preprocessing, namely, data cleaning, data integration, data reduction, and data transformation.
Figure 1: Techniques of Data Pre-processing
Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied. Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable output. Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on avoiding over-fitting the data to the function being modeled. Therefore, a useful preprocessing step is to run your data through some data cleaning routines.
Data cleaning or data cleansing techniques attempt to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the data. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table or database.
Tasks in Data Cleaning:
- Fill in missing values
- Identify outliers and smooth noisy data
- Correct inconsistent data
Fill in Missing Values:
- Ignore the tuple
- Fill in the missing values manually
- Use a global constant to fill in the missing value.
- Use the most probable value
- Use the attribute mean or median for all the samples belonging to the same class as the given tuple
Identify outliers and Smooth Noisy Data:
- Outlier analysis.
Data mining often requires data integration—the merging of data from multiple data stores. Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy and speed of the subsequent data mining process. Data Integration is a data preprocessing technique that merges the data from multiple heterogeneous data sources into a coherent data store. Data integration may involve inconsistent data and therefore needs data cleaning Data Integration is the process of integrating data from multiple sources and has a single view over all these sources. Data integration can be physical or virtual.
Tasks in Data Integration:
- Data Integration-Combines data from multiples sources into a single data store.
- Schema integration-Integrate metadata from different sources
- Entity identification problem-Identify real-world entities from multiple data sources
- Detecting and resolving data value conflicts-For the same real-world entity, attribute values from different sources are different
- Handling Redundancy in Data Integration
Data transformation is the process of converting data from one format or structure into another format or structure. In this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand.
In data transformation, the data are transformed or consolidated into forms appropriate for mining. Strategies for data transformation include the following:
- Smoothing, this works to remove noise from the data. Techniques include binning, regression, and clustering.
- Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process.
- Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for data analysis at multiple abstraction levels.
- Normalization, where the attribute data are scaled so as to fall within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0. 5.
- Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute.
- Concept hierarchy generation for nominal data, where attributes such as street can be generalized to higher-level concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level.
A database/data warehouse may store terabytes of data and to perform complex analysis on such a voluminous data may take very long time on the complete data set. Therefore, data reduction is used to obtain a reduced representation of the data set that is much smaller in volume but yet produces the same analytical results. Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected, ordered, and simplified form
Data reduction Strategies:
- Data Compression
- Dimensionality reduction
- Discretization and concept hierarchy generation
- Numerosity reduction
- Data cube aggregation.
 Tomar, Divya, and Sonali Agarwal, “A survey on pre-processing and post-processing techniques in data mining”, International Journal of Database Theory and Application 7, no. 4 (2014): pp. 99-128.
 Bilquees Bhagat, “Data pre-processing techniques in data mining”, September 2, 2017, available online at: https://cloudera2017.wordpress.com/2017/09/02/1182/
 “Data Preprocessing”, available online at: http://www.comp.dit.ie/btierney/BSI/Han%20Book%20Ch3%20DataExploration.pdf