What percentage of characters in normal English Literature is written in capitals?

Between 2–4%, depending on the text and the genre.

To determine this, I downloaded a variety of texts from Project Gutenberg, then wrote a simple program to count the total number of alphabetic characters and the total number of capitalized characters in each file. Here are the raw numbers:

Title (Author) Letter
Count
Caps
Count
Percent
Caps
Pride and Prejudice (Austen) 2,641,527 14,177 2.56%
History of the Decline and Fall of the Roman Empire (Gibbon) 1,295,410 34,893 2.69%
Moby Dick (Melville) 968,516 28,204 2.91%
Great Expectations (Dickens) 777,248 23,668 3.05%
Shunned House (Lovecraft) 66,779 2,223 3.32%
Tom Sawyer (Twain) 312,196 10,746 3.44%
Somebody Comes to Town, Somebody Leaves Town (Doctorow) 495,594 17,366 3.50%
Bible (King James Version) 3,343,105 117,344 3.51%
Ulysses (Joyce) 1,203,807 55,244 4.58%
Hamlet (Shakespeare) 139,132 7,812 5.61%

Hamlet comes in with the highest percentage capitals, probably because it’s a script and the repeated character names are always capitalized. Ulysses is also unusually high, because Joyce is weird and uses lots of capitals in unexpected places. The other texts run from about 2.5% to 3.5%.

Edit: Added Melville, Lovecraft, Dickens, Doctorow to fill out the comparison of contemporary, early 20th century, and 19th century authors. I’m not seeing much of a trend here, with the most contemporary authors actually having a somewhat higher percentage of capitals than the earlier models. I suspect that more modern writers have shorter sentences, and therefore more sentence-initial capitalization, and that this effect swamps the effect of freer capitalization in earlier texts.


Assuming the Project Gutenberg etext of Herman Melville’s Moby-Dick is representative of all English literature:

  • Uppercase letters: 24,559
  • (Lowercase letters: 936,138)
  • All characters: 1,231,937

24,559 / 1,231,937 = 2.00% capitals letters (across all characters).

Or as a percentage of letters (ignoring non-letter characters):

  • Uppercase letters: 24,559
  • (Lowercase letters: 936,138)
  • All letters: 960,737

24,559 / 960,737 = 2.56% capital letters (across all letters).


Edit 2: Taking this a step further, I ran a script on the plain text ebooks from Project Gutenberg’s CD and DVDs:

Source Caps Letters Pct Caps Characters Pct Caps
Moby-Dick: 1 ebook 24,559 960,737 2.56% 1,231,937 2.00%
2003 CD: 594 ebooks 11,407,295 319,286,662 3.57% 417,687,793 2.73%
2006 DVD: 16,536 ebooks 179,318,621 4,913,640,039 3.65% 6,380,437,180 2.81%
2010 DVD: 14,792 ebooks 152,637,904 4,102,894,980 3.72% 5,433,866,318 2.81%
Total: 31,923 ebooks 343,388,379 9,336,782,418 3.68% 12,233,223,228 2.81%

The median values are 3.72% and 2.85%, and mode values are 3.12% and 2.29%.