How many distinct words are there in the English language?
TL;DR
It depends.
There's no good exact answer for the number of words in any language for several reasons.
- You may or may not count different meanings of the same spelling as a different word. But supposing you do (for implementation purposes I would), there's still a question of how different a meaning counts (like a repeated metaphorical usage). eg look at all the entries for 'set'.
- The concept 'word' has lots of edge cases. 'hmmm', 'kachunk', 'mooshy' lots of entries in Urban Dictionary that will just never appear in Merriam-Webster.
- New words are being added on (and forgotten) all the time. eg 'dove' for 'dived')
- Different languages have different ways of legitimately creating words (affixes). 'paraneologistically' is legitimate but this is its first appearance ever.
- For a given language, dictionaries vary widely in what they consider to be distinct words.
- You might consider a different spelling to be a different word, but I hesitate to even mention this because while computer input is by spelling, spelling is just a convention. Really, alternate spellings are not different words.
- you point out a good distinction, that 'water' and 'waters', 'eat' and 'ate' are mostly the same. The first is managed by stemming and the second is managed by lemmatization.
For all the above reasons though, none account for noticeable proportions of different words, except for multiple meanings. Pretty much every word has more than one distinct meaning. You feel like 'dog' is a 'dog' and that's all there is to it. But really, when Eminem refers to his homey as 'dog', it's a term of endearment that has little to do with canines.
Knowing an exact number of distinct words has little use. Knowing it roughly can give you a rough idea for resource allocation and general perception of processing.
From Wikipedia, there's an account of entries: M-W 470,000, AHD 350,000, WordNet 207.000, OED 171.000. This is like asking an app how many users they have; it could be # registered, # active, #non-duplicates, # pings from any IP addr, etc.
For fun, there's a constructed language called Toki Pona which was engineered to have 125 words, or rather root words, from which all other lexical things could be built. But that is a very very limited definition of word. And semantically you'll probably want to have many more entries in your database for the thousands of distinct concepts made out of those 125
Also for fun, there's a book by Randall Munroe Thing Explainer which is an experiment in making an illustrated scientific dictionary using only the 1000 most frequent words in English. But this also needs many more entries in its database for all the concepts that use two or more root words.
Historically, linguists have studied the distinct roots of words of Proto-Indoeuropean -and- Semitic. These, separately, each number in the hundreds. But that doesn't mean that 4000 years ago, their vocabulary was that small, just that the number of distinct roots was identifiable.
So in the end, the number of words is very rough, probably in the tens of thousand, but way more than 125.
Hey... you're still here. Maybe you're thinking of sounds systems? Hawaiian only needs 13 letters. Morse code really only has 2 from which other letters are built up. And if we're going there, you can do it with one letter, but you'll do a lot of counting.