Why John Tukey set 1.5 IQR to detect outliers instead of 1 or 2?

statistics

By definition, 50% of all measurements are within $\pm0.5IQR$ of the median. Compare this - heuristically - with a normal distributions where 68% are within $\pm\sigma$, so in that case IQR would be slightly less than $\sigma$. Cutting at $\pm 1.5IQR$ is therefore somewhat comparable to cutting slightly below $\pm3\sigma$, which would declare about 1% of measurements outliers. This matches quite well with the habit of using "$3\sigma$" as a bound in many simple statistical tests. On the other hand, cutting at $\pm1IQR$ would be like cutting near $\pm 2\sigma$, making about 5% outliers - too many; and cutting at $\pm2IQR$ would be like cutting at $\pm4\sigma$, thus turning even many quite extreme measurements into non-outliers. So $\pm 1.5IQR$ is also what Goldilocks would choose.

The 3rd quartile (Q3) is positioned at .675 SD (std deviation, sigma) for a normal distribution. The IQR (Q3 - Q1) represents 2 x .675 SD = 1.35 SD. The outlier fence is determined by adding Q3 to 1.5 x IQR, i.e., .675 SD + 1.5 x 1.35 SD = 2.7 SD. This level would declare .7% of the measurements to be outliers.

We certainly CAN use whatever outlier bound we wish to use, but we will have to justify it eventually. In the not-so-recent past, it was typical to expect distributions to be Gaussian. With that assumption, ±1IQR is too exclusive, resulting in too MANY outliers, ±2IQR is too inclusive, resulting in too FEW outliers. ±1.5IQR is easy to remember, and is a reasonable compromise, under assumptions of Gaussianity.

However, for your distribution and expected outlier fraction, those assumptions may not be appropriate. Additionally, perhaps the definition of outlier is incorrect for your problem, and requires greater detail than just how it behaves within the bounds of a single metric?

As I recall, Prof. Michael Starbird, in one of his lectures in the recorded series, Joy of Thinking: The Beauty and Power of Classical Mathematical Ideas, answers this question. Dr. Starbird reports having attended the very conference presentation in which Tukey introduced this test, and during which Tukey himself was asked this very question. Tukey's answer: two seems like too much and one seems like not enough.

Closed form for ${\large\int}_0^\infty\frac{x\,\sqrt{e^x-1}}{1-2\cosh x}\,dx$

Kotlin no text from api

How do I create a log file in json format using python? [duplicate]

ef core, why does it generate this query instead of a simple insert? (save object graph with byte[] inside) and associated performance issue

javascript [1,2,3,4,5,6,7] becomes 123, 234, 345, 456 etc

Fading in/out of 2 forms parallel in Delphi with Firemonkey

How to add props in className [duplicate]

Is there any gcc compiler warning which could have caught this memory bug?

UTF-16 decoding fails when reading from csv

Cypress: How to add loop based on the array length?

PHP curl SFTP files list to array? [duplicate]

SQL Select from table where joined values from a second table are a subset of values from a third table

Why John Tukey set 1.5 IQR to detect outliers instead of 1 or 2?

Related

Recent Posts