Is it true that a word ending in -y is more likely to be an adjective than a noun?

I originally made a quick python script on the "Part of Speech Database" here, which is a combination WordNet and Moby. Then I modified it to run on the frequency list here, based on COCA.

The first script found 29476 words ending in -y, of which 13677 were -ly. Therefore we are left with 15799 words ending in -y but not -ly. Among these words, only 2643 were adjectives.

Therefore our key result is 2643/15799 = 0.16729. Approximately 1 out of 6.

This did not incorporate word frequencies, and I suspected they would boost the ranking somewhat, as many of the -y nonadjectives were quite rare (for example otolaryngology, noun). Thus I edited the program to tally instances of each word from a COCA-derived frequency list.

This found:

  • 23,771,109 instances of -y words;

  • 5,713,230 instances of -ly words;

  • 18,057,879 instances of -y words that were not -ly words;

  • 1,632,165 instances of adjectives among this set.

This leads to a frequency of 1632165/18057879 = 0.090385. Roughly 9% of words ending in -y but not -ly were adjectives. Surprisingly, this result was even smaller. I guess in the scheme of things "traditionally-suffixed" adjectives aren't really that common.

From the data I also found the converse question (does being an adjective generally imply a -y ending?). There were 28426173 total instances of adjectives and 2134139 adjectives ending in -y, including -ly. The result here was quite similar: 0.075077. Only about 3 out of every 40 adjectives have the "traditional" suffix.


Frequency results (percent) using WRI curated data.

               ----------------------------------------
                           Word Ending 
               ---------------------------------------
                "y"           "ly"     "y" but not "ly"
Noun           61.58%        17.03%         81.09%
Adverb         24.24%        77.57%          0.88% 
Verb            4.35%         1.06%          5.78% 
Adjective      12.90%         6.46%         15.72%
Interjection    0.40%         0.13%          0.53% 
Determiner      0.12%                        0.17% 
Pronoun         0.06%         0.02%          0.08% 
Preposition     0.02%                        0.03%
Conjunction     0.03%         0.05%

The columns add up more than 100% because the same word can be accounted for in several rows.

enter image description here

Just as a reference, I used the following scripts (only one shown, Mathematica code):

n = Length@Flatten@WordData[___ ~~ "ly", "Lookup"]
{#[[1]], N@#[[2]]/n} & /@ 
  Tally@Flatten@(WordData[#, "PartsOfSpeech"] & /@ 
      WordData[___ ~~ "ly", "Lookup"]) // TableForm