Where are all the Latin words?

Solution 1:

These figures are almost always the numbers for the top N words in a corpus. The results can vary considerably depending on what corpus is used and what N is (as you can see in this paper).

"The English language is a lot more French than we thought, here’s why" (by Andreas Simons, on Medium), summarizes one of the sources Wikipedia quotes for the 29% Latin figure:

The latest research was done in 1975 by Joseph M. Williams, where he examined the 10,000 most frequently used words in English, based on a rather small sample size of corporate letters. Here are my issues with his research:

  • the research carries a bias towards French and Latin, as companies are more likely to use academic language
  • proper names were not removed, possibly diluting the results for an etymological composition
  • he used the 10 000 most common words in that corpus of letters, not really “core vocabulary”

Because of these problems, the author found numbers of his own by taking a list of the 5,000 most common English words (which will be about "85% of all words in any English source") and scraping etymology sites (mostly Etymonline) to see what languages were mentioned in the first few words of each word's etymology:

Note that sometimes a word of Latin origin will return “French” using my method. This is because Etymonline always mentions French before Latin if the word entered English through French and the word changed sufficiently from the root. A word such as “origin” (from “origo”) will therefore return French, whereas a word such as “provide” (from “providere — provideo”) will return Latin.

I'm not sure how much I trust the results, but this is the most transparent analysis I found so far — the code used to generate the numbers is linked to in the article. This code can be modified to output words it classifies as Latin. I haven't run the code myself but it looks pretty simple to make these edits. Lines 220-240 in the original Sorter.py are:

    for word in words1:
        print(word)
        origin = scrape_and_interpret(word)
        if origin == "french":
            count_french += 1
            list_french.append(word)
        elif origin == "latin":
            count_latin += 1
            list_latin.append(word)
        elif origin == "old_english":
            count_old_english += 1
            list_old_english.append(word)
        elif origin == "germanic":
            count_germanic += 1
            list_germanic.append(word)
        elif origin == "greek":
            count_greek += 1
            list_greek.append(word)
        elif origin == "other":
            list_other.append(word)
            count_other += 1

Change two lines and get this:

    for word in words1:
        #print(word) # Comment out print statement that prints all words
        origin = scrape_and_interpret(word)
        if origin == "french":
            count_french += 1
            list_french.append(word)
        elif origin == "latin":
            count_latin += 1
            print(word) # Add `print` so that it prints out words of Latin origin
            list_latin.append(word)
        elif origin == "old_english":
            count_old_english += 1
            list_old_english.append(word)
        elif origin == "germanic":
            count_germanic += 1
            list_germanic.append(word)
        elif origin == "greek":
            count_greek += 1
            list_greek.append(word)
        elif origin == "other":
            list_other.append(word)
            count_other += 1

Alternatively, if you have access to the online OED, it's pretty easy to get a list by searching for current words of Latin origin, sorted by frequency. Note that many of these words also turn up when you search instead for words of French origin, since so many words have multiple etymological influences (it would be a bit strange to count them in only one direction or the other). I'm sure most people who know at least some English will recognize the top 1000 words on said list of Latin-origin words, and most educated people will recognize at least most of the next 1000 words, if not more.

Solution 2:

Two ways of doing these stats:

  • take a dictionary, look at origin of words, or
  • take a chunk of text, ditto.

The two can differ substantially if the second one counts occurrences. If any statistics gives 29% of words as Latin origin, then the former is the case: the stats was done on a dictionary, paying no regard to frequency of occurrence. In fact, the main Wikipedia entry for your linked statistics says this explicitly:

As a statistical rule, around 70 percent of words in any text are Anglo-Saxon.

Given that the three main lexical groups (Old English Anglo-Saxon, French, Latin) subsequently provided vocabulary for less and less familiar concepts, it is perhaps natural that the Latin items, which were the last to fill any voids, are the least frequent in terms of familiarity and frequency in (standard) texts. So it is hardly surprising that, off the top of one's head or in a random search, very few such words are found.

For some words, moreover, their exact pilgrimage into English is untractable perhaps. The word lexicon itself is an example. So I might take the stats with a grain of salt.

Solution 3:

I think the issue isn't solved by computational linguistics. Ultimately the question is: Did the word enter the lexicon due to the roman invasion or the french one.

Just because French and English both have the same word doesn't mean that that English got it from French or French from English. They independently can get it from multiple different places or from the same place

Otherwise we should forget about French and Latin and instead talk about protoIndoEuropean as surely the majority of the German/French/Latin/Greek/etc. can be traced back to that.