Frequency of word use vs number of words

There is a well-known formula that appears to describe the frequency distribution of English words reasonably well:

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the Brown Corpus, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852).

Either from actual measurements of word frequency (see, for example, Wiktionary: Frequency lists) or from Zipf's law, one can deduce that most words are infrequently used; there are tens of thousands of words that taken all together are used less frequently than any of the or of or and. Indeed, the most-frequently-used 500 words account for more than half of all usage, leaving hundreds of thousands of words used quite infrequently indeed.

Zipf's law predicts that relative frequencies decrease like 1, 1/2, 1/3, 1/4 ... If there are a quarter-million words in the English language and their frequencies follow Zipf's law, then their relative frequencies add up to 1+1/2+1/3+1/4+...+1/250000, which is about 13.01, ie ln(250000)+gamma, where gamma is the Euler-Mascheroni constant, about 0.5772156649. Given n words in toto, if you want to find out the number of words in each of k groups with total usage the same in each of those groups, solve ln(mi)+gamma = i*(ln(n)+gamma)/k for i from 1 to k-1. That is, mi = exp(i*(ln(n)+gamma)/k -gamma). For example:

• With n=250000 and k=2: m1= exp(13.00643/2-gamma) = exp(5.926) = 374.65, implying that the first 375 words of 250000 will receive about 50% of all usage, if Zipf's law applies.
• With n=250000 and k=5: m1= exp(2.024) = 7.569, m2= 102.04, m3=1375.6, m4= 18544.5, implying that the first 8 words will receive about 20% of all usage, the first 102 about 40%, the first 1376 about 60%, and the first 18545 about 80%, if Zipf's law applies.