Outlier Detection in Data Mining

January 2, 2018 Author: rajesh
Print Friendly, PDF & Email

Outlier detection is a primary step in many data-mining applications. In many data analysis tasks a large number of variables are being recorded or sampled. One of the first steps towards obtaining a coherent analysis is the detection of outlaying observations. Although outliers are often considered as an error or noise, they may carry important information. Detected outliers are candidates for aberrant data that may otherwise adversely lead to model misspecification, biased parameter estimation and incorrect results. It is therefore important to identify them prior to modeling and analysis.

Outlier Detection Overview




Outlier Detection is an algorithmic feature that allows you to detect when some members of a group are behaving strangely compared to the others. Outlier detection is an important research problem in data mining that aims to find objects that are considerably dissimilar, exceptional and inconsistent with respect to the majority of the data in an input database. Outliers are extreme values that deviate from other observations on data; they may indicate variability in a measurement, experimental errors or a novelty. An outlier is an observation (or measurement) that is different with respect to the other values contained in a given dataset. Outliers can be due to several causes. The measurement can be incorrectly observed, recorded or entered into the process computer, the observed datum can come from a different population with respect to the normal situation and thus is correctly measured but represents a rare event.

What is Outlier Detection?




Outlier detection is the process of detecting and subsequently excluding outliers from a given set of data. An outlier may be defined as a piece of data or observation that deviates drastically from the given norm or average of the data set. An outlier may be caused simply by chance, but it may also indicate measurement error or that the given data set has a heavy-tailed distribution.  In other words, an outlier is an observation that diverges from an overall pattern on a sample.

Outliers can be of two kinds: univariate and multivariate. Univariate outliers can be found when looking at a distribution of values in a single feature space. Multivariate outliers can be found in an n-dimensional space (of n-features). Looking at distributions in n-dimensional spaces can be very difficult for the human brain that is why we need to train a model to do it for us.

Outliers can also come in different flavors, depending on the environment: point outliers, contextual outliers, or collective outliers. Point outliers are single data points that lay far from the rest of the distribution. Contextual outliers can be noise in data, such as punctuation symbols when realizing text analysis or background noise signal when doing speech recognition. Collective outliers can be subsets of novelties in data such as a signal that may indicate the discovery of new phenomena.

Outlier Analysis

Figure 1: Outlier Analysis

In real world problems, outliers are also called: noises, anomalies, abnormalities, discordances, deviants, In other words, an outlier is a data point which has many different properties compared to the natural of the dataset. Detecting outliers is always a very important task in data mining. Generally, it helps remove noisy data that could affect the final outcome of the mining algorithms. Furthermore, finding outliers could also be useful to find the abnormal characteristics in data generation process.

Approaches of Outlier Detection




Detecting outliers can be approached by two different techniques: unsupervised technique and supervised technique. On the one hand, the unsupervised technique is used to find outlier without any prior knowledge about the properties of the outliers. On the other hand, the supervised technique uses a dataset that stores the characteristics of both normal data and outliers then try to find the best way to separate them into two classes.

Unsupervised Approaches

There are several unsupervised approaches that can be used to solve the outlier detection problem. For example:

  1. Statistical Model: The statistical approach assumes that data can be modeled in a joint probability distribution. It will then try to fit a certain distribution to the input data. The data that have low fit values can be considered as outliers.
  2. Regression Model: Unlike statistical model, regression model aims at finding the trend of data over a specific parameter (such as time). An outlier is detected if and only if it does not follow the overall trend of the dataset.
  3. Principal Component Analysis: Principal component analysis has been widely known for its ability to select important features underlying in the dataset. It is studied that most of the variance of the dataset can be captured at the lower space formed by the top eigenvectors. If the distance from a certain point to the selected top eigenvectors is too much, it could be highlighted as an outlier.
  4. Proximity-based model: In the proximity-based model, a data it considered as a data point in a high-dimensional space. An outlier is a certain point that is isolated from the remaining data. Detecting the isolated data points can be carried out by applying the clustering algorithms, density-based algorithm, or event nearest neighbour approaches.

Supervised Approaches

Supervised approaches simply consider the input dataset has two different labels: the normal and the abnormal (or the outlier). The outlier detection problem now then becomes a simple classification problem. We could use any classification algorithm to solve it such as SVM, Neural Network, and Linear Classifier

Some Application of Outlier Detection

Network Intrusion Detection (NDS): The network-based system often store information about operating system calls, network traffics. Analyzing these types of data can help us to find the malicious activities that can affect the performance of the whole system.

Credit Card Fraud: As you may know, using a credit card (or event internet banking account) we could pay the bill or withdraw money instantly from every place in the world. This is a huge advancement because we do not have to carry a big pile of money with us while travelling. However, due to some accident, users might lose the information of their cards (banking accounts) to some hackers. This leads to some suspicious transactions which are not known to the real owners. This is where the outlier detection techniques come in. They help banks to find the “abnormal” transactions; the bank will then verify the transactions before making it happen.

Medical Diagnosis: In the medical point of view, the normal users often have the similar pattern; and diseases often occur with an unusual pattern which reflects the disease conditions. If we could detect the “unusual pattern” we could identify the “potentially” ill patients and do further diagnoses to verify it.

Earth Science: Many applications of outlier detections have been implemented in Earth science. For example climates changes, weather pattern

References

[1] Irad Ben-Gal, “Chapter 1: Outlier Detection”, Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers,” Kluwer Academic Publishers, 2005.

[2] Silvia Cateni, Valentina Colla and Marco Vannucci, “Outlier Detection Methods for Industrial Applications”, available online at: http://cdn.intechopen.com/pdfs/4666.pdf

[3] Hatng, “Outlier detection, an overview and applications”, available online at: https://blogdotrichanchordotcom.wordpress.com/2016/05/23/outlier-detection-an-overview-and-applications/ , May 23, 2016.

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert