Are there 3 or 4 quartiles? 99 or 100 percentiles?
So I understand that a quartile is a quantile where the data is divided into four groups.
1 2 3
---|---|---|---
And 1, 2, and 3 are the quartiles. The second quartile is the median, etc.
But while studying for the GRE, I read this as part of the answer solution to a question:
From this you can conclude that the word quartile refers to one of the four groups that is created by listing the data in increasing order and then dividing the data into four groups of equal size.
Which indicates that these are the quartiles:
1 2 3 4
---|---|---|---
So which is it? Are there 3 quartiles or 4?
On the bounty: I have sufficient evidence that the term is overloaded, so any answer that glosses over that won't be accepted. I'd like an explanation of how to use the terms and when.
Solution 1:
If you look in a non-mathematical dictionary, you will often find both definitions. For example, http://www.oxforddictionaries.com/us/definition/american_english/quartile defines quartile as
1 Each of four equal groups into which a population can be divided according to the distribution of values of a particular variable.
1.1 Each of the three values of the random variable that divide a population into four groups.
It is possible to find some examples where the first definition is used. In a passage in Digest of Education Statistics 1999, edited by Thomas D. Snyder, page 157, Table 143 has four columns under the heading "Socioeconomic status quartile", labeled Lowest, Second, Third, and Highest. Moreover, in footnote 1 of Table 144, we find the passage
The "Low" SES group is the lowest quartile; the "Middle" SES group is the middle two quartiles; and the "High" SES group is the upper quartile.
So a "quartile" in this context is a subset of the sample to which an individual belongs.
The Wikipedia article on quartile cites only one reference, the article
"Sample quantiles in statistical packages", which, as the title suggests,
is all about computing numbers to describe quantiles,
in particular, the return value of the R function quantile()
.
The article therefore is mainly (exclusively?) concerned with the correct
way to compute the numerical values that divide the data into quartiles
(or other quantiles).
But if you go to other sources such as the NIST/SEMATECH e-Handbook of Statistical Methods,
you will find passages such as
The box plot uses the median and the lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ.
Here, clearly each quartile is a number: the lower quartile is not bounded by Q1; it is Q1 in this context, which is a number that can be subtracted from another number.
My attempts to search for "quartile" on the Web seem to dredge up many more examples of the "number" usage than of the "subset" usage. I can guess a few of reasons for this, though I have not found much other discussion of it:
-
Unless the number of observations in your sample is divisible by $4$, you will not be able to separate the sample into four equal parts by rank.
-
Much of statistics has the goal of describing data succinctly, for example by a mean and standard deviation. The four lists of members of each of four equal (or nearly-equal) subsets of a large sample do not constitute a succinct description; in some cases this can be almost as verbose as the entire data set. On the other hand, it requires just three numbers to describe the boundaries between these subsets of the data, hence those three numbers appear frequently in the literature.
-
There are several competing ways to compute the values that should serve as the "dividing lines" between the four (not necessarily exactly equal) ranked subsets of the data. This leads to a great deal written about "quartiles" using the "number" definition.
But notice that in the quoted passages from the Digest and Handbook, above, there is no ambiguity whatsoever about which meaning of "quartile" is intended. If a particular use of the word could possibly be ambiguous, one can first use the word in an unambiguous context to establish its meaning, or one can simply define it.
Solution 2:
The word quartile refers to both the four partitions (or quarters) of the data set, and to the three points that mark these divisions. After all, we can't have one without the other.
When citing a value for a quartile, though, we are specifically referring to the three dividing points, else it'd be meaningless. Thus, the first, second, and third quartiles have a specific value in a data set. These points are often referred to as the lower, middle, and upper quartile.
On the other hand, we can say that there are multiple data points contained in the first, second, third, and fourth quartiles. In this context, we refer to the actual partition.
It all depends on context. The word is malleable, but the intent ought to be clear when used properly.