Z-score for outlier detection
The values that do not follow a pattern and diverge from the data set can be regarded as an outlier. It’s interesting to narrow down the outlier present in our data to either mitigate them or study them carefully as they can be of great statistical and business importance for the case.
Z-score which is also known as a standard score that gives a statistical test of how much a value deviates from the mean, Z score tells how many standard deviations away a data point is from the mean.
Z-score can be calculated as:
Z score = (x -mean) / std. deviation
The threshold set for the z-score calculation is 3 units. If the z-score value is greater than 3 or less than -3 then the data point can be regarded as an outlier.
Let us consider a set of data points:
399,114,737,677,438,806,231,607,880,550,374,748,342,985,853,187,762,953,914,453,2010,2179,3800
The mean for these points is 869.52 whereas the standard deviation is 792.10. After the z-score calculation for each individual point we came across that only 3800 has a value greater than 3 which is 3.69 and hence is one of the outlier in our present dataset.
The z-score values can be both negative and positive.