What percentage of characters in normal English Literature is written in capitals?
Between 2–4%, depending on the text and the genre.
To determine this, I downloaded a variety of texts from Project Gutenberg, then wrote a simple program to count the total number of alphabetic characters and the total number of capitalized characters in each file. Here are the raw numbers:
Title (Author) | Letter Count |
Caps Count |
Percent Caps |
---|---|---|---|
Pride and Prejudice (Austen) | 2,641,527 | 14,177 | 2.56% |
History of the Decline and Fall of the Roman Empire (Gibbon) | 1,295,410 | 34,893 | 2.69% |
Moby Dick (Melville) | 968,516 | 28,204 | 2.91% |
Great Expectations (Dickens) | 777,248 | 23,668 | 3.05% |
Shunned House (Lovecraft) | 66,779 | 2,223 | 3.32% |
Tom Sawyer (Twain) | 312,196 | 10,746 | 3.44% |
Somebody Comes to Town, Somebody Leaves Town (Doctorow) | 495,594 | 17,366 | 3.50% |
Bible (King James Version) | 3,343,105 | 117,344 | 3.51% |
Ulysses (Joyce) | 1,203,807 | 55,244 | 4.58% |
Hamlet (Shakespeare) | 139,132 | 7,812 | 5.61% |
Hamlet comes in with the highest percentage capitals, probably because it’s a script and the repeated character names are always capitalized. Ulysses is also unusually high, because Joyce is weird and uses lots of capitals in unexpected places. The other texts run from about 2.5% to 3.5%.
Edit: Added Melville, Lovecraft, Dickens, Doctorow to fill out the comparison of contemporary, early 20th century, and 19th century authors. I’m not seeing much of a trend here, with the most contemporary authors actually having a somewhat higher percentage of capitals than the earlier models. I suspect that more modern writers have shorter sentences, and therefore more sentence-initial capitalization, and that this effect swamps the effect of freer capitalization in earlier texts.
Assuming the Project Gutenberg etext of Herman Melville’s Moby-Dick is representative of all English literature:
- Uppercase letters: 24,559
- (Lowercase letters: 936,138)
- All characters: 1,231,937
24,559 / 1,231,937 = 2.00% capitals letters (across all characters).
Or as a percentage of letters (ignoring non-letter characters):
- Uppercase letters: 24,559
- (Lowercase letters: 936,138)
- All letters: 960,737
24,559 / 960,737 = 2.56% capital letters (across all letters).
Edit 2: Taking this a step further, I ran a script on the plain text ebooks from Project Gutenberg’s CD and DVDs:
Source | Caps | Letters | Pct Caps | Characters | Pct Caps |
---|---|---|---|---|---|
Moby-Dick: 1 ebook | 24,559 | 960,737 | 2.56% | 1,231,937 | 2.00% |
2003 CD: 594 ebooks | 11,407,295 | 319,286,662 | 3.57% | 417,687,793 | 2.73% |
2006 DVD: 16,536 ebooks | 179,318,621 | 4,913,640,039 | 3.65% | 6,380,437,180 | 2.81% |
2010 DVD: 14,792 ebooks | 152,637,904 | 4,102,894,980 | 3.72% | 5,433,866,318 | 2.81% |
Total: 31,923 ebooks | 343,388,379 | 9,336,782,418 | 3.68% | 12,233,223,228 | 2.81% |
The median values are 3.72% and 2.85%, and mode values are 3.12% and 2.29%.