Outlier Detection: Python based implementation

Charu Gupta
Sep 21, 2020
5 min read

Updated: Sep 23, 2020

An outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error. An outlier can cause serious problems in statistical analyses. Depending on the source; outliers may require different treatment before progressing further in analysis. Some times it may require deletion of outliers, or at other times techniques which are robust on outliers. So, this makes it very important to understand the data and detect the outliers present in them.

In other words, in a given dataset, outliers to other (regular) data-points are just like comparing apples to oranges, they both are fruits, but taste completely different. Similarly, outlier is an abnormal observation that lies far away from other values (in terms of values or features). In other words, Outliers are data points that don’t belong to a certain population under consideration. An outlier is an observation that diverges from otherwise well-structured data.

For Example, you can clearly see the outlier in this list:

[20,24,22,19,29,18,4300,30,18]

It is easy to identify it when the observations limited and it is one dimensional but when data grows in terms of number of observations and/or dimensionality (multi-dimensional), one may need a technique to detect those values.

How Anomalies Detection can help?

Detecting anomalies can be useful in multiple ways firstly if it is due to measurement error or wrong/incorrect data then cleaning this data during the Data Mining process can help us generating more accurate predictions and creation of better models.

Secondly, anomalies can alert for some problem or unusual activity; for example, Fraud detection by identifying change in the pattern of a credit card user. Unusual pattern in accessing a website can help in alerting for a Hacking attack. Smart watches and wristbands that can detect our heartbeats every few minutes, can help in detecting anomalies in the heartbeat data, thus alerting for a heart disease. Anomalies in traffic patterns can help in predicting accidents. In these type of situation, possibility of pro-active action increase and help in reducing the damage.

Classification of Outlier Detection Techniques:

Let’s classify these techniques in terms of dimensionality of the dataset

Univariate outlier detection

When a single feature in the dataset is under consideration. Following are some techniques univariate outlier detection-

Standard Deviation using z-score
Box plot

Multivariate outlier detection

These are found considering multi-dimensional features in the dataset. Below are some algorithms covered for this:

DBScan Algorithm
Isolated Forests Algorithm
Local Outlier Factor

Univariate outlier detection

In Univariate outlier detection, we consider a single feature from the given dataset and find out outlier data-points in it. The main techniques by which Univariate outlier detection is performed are as follows:

Standard Deviation using z-score:

In statistics, If a data distribution is approximately normal then about 68% of the data values lie within one standard deviation of the mean and about 95% are within two standard deviations, and about 99.7% lie within three standard deviations

It simply suggests that if a data point lies beyond 3 standard deviations away it can be considered as outlier.

The method to find out this is by getting the Z-score for all the data points given in a dataset. Z-score is a measure to find out how many standard deviations below or above the population mean, a data point lies

Code to find out Outliers using STD and Z-Score:

Let’s define some Data randomly, and format it

For the same dataset, applying direct Standard Deviations and Z-Score methods would yield the same output.

Box Plot Visualization:

Box plots are very simple and effective way to visualize outliers in a dataset. Boxplots displays the numerical data using their quantiles. And Data falling above a certain range are marked as outliers in the dataset.

The concept of the Interquartile Range (IQR) is used to build the boxplot graphs. It divides the data into four quartiles based on which outliers are decided.

Interquartile Range (IQR) is an important concept to define the outliers. It is the difference between the third quartile and the first quartile (IQR = Q3 -Q1). Outliers are defined as the observations that are below (Q1 − 1.5x IQR) or above (Q3 + 1.5x IQR). This are represented as the lower and upper whiskers in the box plot. So, all the values lying beyond the whiskers in a box plot are treated as outliers

Multivariate outlier detection

Till now we have dealt outlier detection in a single data feature. Using multivariate outlier detection techniques, we can detect outliers in a dataset using a combination of n features. In other words, multivariate outliers are found by considering n-dimensional space in the given dataset. There are different techniques to do the same. Some of them are discussed below:

DBScan Clustering: DBScan is a clustering algorithm which is used create clusters using data to form groups. It create clusters based on the minimum number of core points having pre-defined minimum distance (called eps) between them along with additional border points(which are part of the same cluster but farther away from the core points). All the points which are neither the core point nor the border point of any cluster are termed as Noise Point. These Noise points are only candidates for outliers.

Isolation Forest Implementation:

Isolation Forest algorithm is an ensemble based technique using decision tree. Anomalies are found based on their properties that they are the minority consisting of fewer instances and they have attribute-values that are very different from those of normal instances. Hence, as anomalies are ‘few and different’, they are more susceptible to isolation than normal points and can be isolated closer to the root of the tree.

The isolation forest first constructs random decision trees or isolation trees and repeats the process several times. Then, the average path length is calculated and normalized.

The above code prints the outliers detected and their position in the original data set. This method has a parameter to fix the percentage of the outliers to be detected in the dataset.

Local Outlier Factor:

Local Outlier Factor (LOF) is an Unsupervised technique for Outlier Detection

The anomaly score of each data-point is called ‘Local Outlier Factor’, which signifies the local deviation of density of a given data-point with respect to its neighbors. The term ‘local’ indicates the level of isolation of a data-point with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density.

To find out the outliers in a data, the local densities of all the data-points is calculated and data-points which have substantially lower density than their neighbors, are called outliers.