Psychology of diphthongs

First of all, technically this probably should be at the English Language Learners site, because I'm an English learner, but my intuition says I'll probably get more useful answers here because of my question's nature. Tell me if I'm wrong about it.

Now, I'll try to tell briefly my experience learning about diphthongs. Recently, I talked (text messages) with a 15 y/o General American speaker about English pronunciation. The main topic was diphthongs. I wanted to know more about their nature, how native speakers understand it. And what really surprised me is he didn't really know what vowels I was talking about. From all vowels he only named 2 which as he considered consist from 2 "parts" — /ɔɪ/ and /aʊ/. Also, /ɔɪ/, as he said, consists from /oʊ/ + /iː/ while /aʊ/ consists from some vowel + some glide (he didn't tell about /u:/. Also, the starting "some vowel" first appeared to him to exist in English as standalone which surprised me, but then he couldn't name a word with this "vowel"). Then he said /oʊ/, /aɪ/, /eɪ/ diphthongs appear to be single-vowel sounds to him. After some conversation, he kinda agreed that /aɪ/ has two points. When listened to /oʊ/ pronunciation on dictionary.com he heard it as a single-vowel sound, although for me (a Russian speaker), it had some glide to /u:/. Then I asked him to pronounce /oʊ/ and then /u:/ and then he realized it's still an /oʊ/. So, /u:/ sounds like an optional addition to the first sound. He nevertheless couldn't name the reason behind the move to /u:/. The same situation with /eɪ/ sometimes having /i:/ after first sound.

At the end of our talk I had got quite a lot of information about how he pronounces vowels, but almost none information about why he pronounces them this way. For example why can't he move to /u:/ after first sound in /aɪ/ and move to /i:/ after first sound in /aʊ/. Why he moves to /i:/ in /eɪ/ but doesn't in /e/ etc.

So basically, I want to find some deep information about psychology of English pronunciation, or at least some ideas on this topic. I found nothing on the internet yet.

Solution 1:

TL;DR

All tense monophthongs in English become non-phonemic, phonetic-only diphthongs with weak off-glides in most speakers and contexts. Minor phonologic effects like this are part of getting an accent right, but they do not change the abstract phoneme, which is still just /e/ or /o/, /i/ or /u/.

Native speakers do not think of those phonologic effects as producing diphthongs because they are not phonemic diphthongs, merely phonetic ones which don’t count. Untrained natives speakers always think about their language’s sounds only in simple phonemes, not in complicated phonetics.

That’s the only psychological phenomenon at work here.

Language Interference

Like Spanish, Russian has a five-vowel system of monophthongs, where any diphthongs are always written out explicitly right there in the spelling. What’s confusing you is that English doesn’t work that way: its writing often fails to reflect whether a tense monophthong relaxes into a diphthong for a variety of reasons. The most important one is that our spelling reflects the phonemics of Middle English not the phonetics of Modern English. Other reasons in part include because of how this is automatic due to our loose articulation, and in part because its effect varies between speakers. Spelling never represents phonetics anyway, just phonemics at best.

Your mother tongue’s phonetics are confusing you because the phonetic rules in Russian work differently than the phonetic rules in English work. Phonemics are easy, but phonetics can be hard. You have to pay attention to abstract phonemics in order to think about a language’s sounds the way a native speaker does. But until you follow the same subtle phonetic rules for how to realize those phonemes, you’ll always sound foreign, and you may be confused or confusing.

In English as with any language, native speakers have trained our minds to “hear” only phonemic differences and ignore phonetic ones. You haven’t done that yet. The reason we don’t “hear” the word no in English as having a diphthong is because it is only a phonetic effect, not a phonemic one, and it’s phonemes that make one word different from another.

Phonemically it’s still just /no/, not /now/ or /noʷ/ or /noʊ̯/ or /nəɵ̯/ or whatnot, because those are not phonemes in English. There is no such thing as a minimal pair in English between /no/ and /now/, so those last ones should be in phonetic brackets not phonemic slashes, since diphthongal /ow/ is not a phoneme in English.

While there are many languages where these pairs with and without the trailing glide do contrast phonemically, English is not one such.

Phonemic /o/ in English becomes phonetic [oʷ], [ɐʉ̯], [ɒʊ̯], [əɵ̯], [ɛʊ̯], [œː], [œɤ̯̈], [œʉ̯], [œʊ̯], [oə̯], [oʊ̯], [ɔʊ̯], [ɵʊ̯], or [ʌʊ̯] in this or that region or speaker. But it can also remain a purely monophthongal [o] or [oː] in some speakers and utterances. You have to hear all those differing phonetics as the simple /o/ phoneme, and right now you are not doing that.

That’s like how English has no minimal pair contrasting /e/ and /ej/, which means that diphthongal /ej/ is not an English phoneme. Sure [ej] happens, but it is not a phoneme. It’s true that phonemic /e/ is usually realized as a phonetic diphthong in English, whether that’s as [æɪ̯], [äɪ̯], [ei], [eɪ̯], [ej], [eʲ], [ɛɪ̯], or even bizarrely enough [ʌɪ̯]. But it can also remain a phonetic monophthong [e] or [eː]. No matter the sounds, all those are still just the same phoneme, /e/.

The problem for speakers who come from a language that lacks these implicit, automatic phonological changes that English makes to all its tense vowels is that you guys will hear different phonetics and think you’re hearing different phonemes when you aren’t. Or at least, when we aren’t, which is the problem.

Putting the shoe on the other foot

This difficulty works both ways. It’s also hard for English speakers learning languages where an unconscious phonetic side-effect becomes a different phoneme altogether if they do that English-only thing in the language they’re trying to learn.

That’s where “foreign accents” come from: you’re unconsciously mis-applying the “foreign” language’s phonologic rules for phonetic variants to some other language that isn’t supposed to follow the first one’s phonetic rules. So it sounds off; often recognizable, but off.

People coming from English who are learning Romance languages like Spanish have to be explicitly taught to hold their mouths tighter so that they don’t add glides to all the Romance vowels like we do with English vowels. Even lookalike words like no are pronounced differently in the two languages. So the common Spanish word no is actually pronounced [no] in Spanish, not [noʷ] with a terminal off-glide as in English, because they don’t speak with as relaxed an articulation as we do. They keep their close (“tense”) vowels close throughout their articulation, but we do not. We always change ours a little at the end.

The Latin words lex meaning “law” and rex meaning “king” evolved into diphthongized ley and rey in Spanish; these are basic words you normally hear all the time. Even more confusingly than it does with the simple word no, this unconscious mutation of tense vowels by English speakers first learning Spanish can turn accidentally turn what should be minimal pairs like re–rey and le-ley in Spanish into confusable soundalikes because of the “foreign accent” intruding.

An untrained English speaker is simply unable to correctly pronounce simple little Spanish words like le and re without those sounding like a Spanish speaker would pronounce ley and rey, which are two entirely different words. Because we automatically turn close/tense vowels into diphthongs, English speakers will naturally (mis-)pronounce something like Spanish le as [leʲ], thereby making it sound closer to the pronunciation of the word ley in Spanish, which is just plain [lej] “just like it’s spelled”. Context would almost always be enough to help the native Spanish speaker differentiate which of the two words the English speaker had intended to say but didn’t quite do so.

However, there are certainly cases where legitimate confusion can arise, like between the minimal pair estés–estéis. Those are close not only in sound but also in meaning. Both words are the present subjunctive for a particular to be verb that’s been conjugated into the second person, but the first is in the second-person singular and the second is in the second-person plural. Singular estés is pronounced /esˈtes/ phonemically and [e̞s̺ˈt̪e̞s̺] phonetically, while plural estéis is /esˈtejs/ phonemically and [e̞s̺ˈt̪e̞js̺] phonetically, so it’s nothing more than that /ej/ diphthong in estéis which marks that one as being in the plural. This is possible because unlike English, Spanish does have an actual /ej/ phonemic diphthong that can participate in minimal pairs like this. Our lack of the same is what makes these two words “confusables” for English speakers.

Singular versus plural is an important distinction to make and to hear made, but here it’s not one that an untrained English speaker learning Spanish can even reproduce: English cannot even distinguish those two words’ differentiating sounds! So because the English phoneme in English place /ples/ [pʰleʲs] generally comes out with phonetics that fall between the two Spanish phonemes of monophthong /e/ and diphthong /ej/, it’s really hard for an English speaker to hear or reproduce the Spanish ones accurately. It takes a long time. It will likely take you a long time smoothing out your Russian phonetics English, too, and for many of the same reasons.

Differences in one language are not differences in others. In English, /d/ and /ð/ are contrasting phonemes, but in Spanish they are exactly the same phoneme but which one you say is controlled by the phonologic environment, just like whether we say our /e/ as [e] or [eʲ] in English also varies phonetically. That’s because in all Iberian Romance languages, voiced stops like [d] automatically become fricative allophones like [ð] between vowels.

It’s just “what one does” in these languages, something that happens automatically once you get the accent down: you turn intervocalic voiced stops into fricatives (and often even approximants). If you don’t do that, it sounds terrible to a native speaker. You have “a bad accent” or “a foreign accent”. Sure, you’re getting the phonemes right but the phonetics are all wrong. The same thing happens at first when people from certain language groups try to learn English. They’re hearing the wrong things, and not hearing the right ones.

Phonemic diphthongs versus phonetic ones

In English, we turn all our tense vowels in phonetic diphthongs with slight off-glides. We don’t think of those as diphthongs, though. They’re just automatic phonetic after-effects, not part of our ideal abstract phonemes we use in our heads to tell one word from another.

English has just three phonemic diphthongs: /ɔɪ̯/ as in boy, /aɪ̯/ as in buy, and /aʊ̯/ as in cow. (For the sake of convenience, sometimes those are respectively written /oj/, /aj/, and /aw/, but that’s just an alternate notation for the same three basic phonemes.)

The rest of what you’re talking about is a regular modification of all tense vowels. These phonetic diphthongs caused by off-glides don't really count as phonemic diphthongs. They're the result of how tight/loose or tense/lax that English articulation is (as compared with say, Italian), where a tense vowel in isolation “always” gains a trailing glide, either [w] or [j] (sometimes written [u̯] and [ɪ̯]) depending on which tense vowel we’re talking about. It’s a way of relaxing the tenseness.

That means not just /e/ and /o/ as in they and show, but even /i/ and /u/ in see [siːʲ] and who [huːʷ] get little off-glides phonetically. It can even happen with /ɑ/, as we see in the eye-dialect re-spellings of paw [pɔːʷ] for pa, grampaw for grandpa, and meemaw for mama or grandma. When writers use eye-dialect spellings like that, they’re trying to tell you that those words end like the word law [lɔː] not like the word la [lɑ] in those speakers’ versions (and that those are different sounds as far as the writer is concerned).

Nobody thinks about the phonetic side-effects of how our phonology works. It’s part of getting the accent “right” when you’re learning the language. You won’t find it in our books, nor even in our dictionaries for the most part, since native speakers don’t need to be taught to do these things.

Although some speakers do have more of pure monophthong for [e] and [o] when there’s a consonant following that tight vowel, especially when that consonant is [ɹ], everybody adds a glide at the end to relax it if the tense vowel occurs at the end of the word.

Tense /e/ and /o/ in General American

These next sets below all have a tense /e/ for their stressed vowel in General American, the vowel that children (but not linguists) are taught to call “a long a”. A native speaker will hear the first word of each series below also occurring “in full” at the start of each subsequent word in that row. Notice how the off-glide after the tense vowel, variously written [ej] or [eɪ̯] or [eʲ], is not included in how we think of these phonemically.

This is why your fifteen-year-old didn’t think of [oʷ] and [eʲ] as diphthongs, just as the basic phonemes /o/ and /e/. We don’t think about phonetics, just about phonemics, and there is no diphthong in the phoneme: phonemically they’re monophthongs even though in practice they usually are phonetic diphthongs.

Notice how simple the phonemic notation below is; this is how native speakers think of these words.

bay, bake, bear, berry /be, bek, ber, ˈberi/
day, date, dare, dairy /de, det, der, ˈderi/
fey, fake, fair, fairy /fe, fek, fer, ˈferi/
gay, gate, Gary /ge, get, ˈgeri/
hey, hate, hair, hairy /he, het, her, ˈheri/
Kay, Kate, Kerry /ke, ket, ˈkeri/
may, mate, mare, Mary /me, met, mer, ˈmeri/
pay, pate, pear, parry /pe, pet, per, ˈperi/
pray, prate, prayer, prairie /pre, pret, prer, ˈpreri/
_{(NB: that’s monosyllabic prayer /prer/ as in saying one’s prayers; when prayer means one who prays, it has two syllables /ˈpre.jər/)}
ray, rate, rare /re, ret, rer/
they, they’d, their /ðe, ðed, ðer/
way, wait, wear, wary /we, wet, wer, ˈweri/

These next sets all have a tense /o/ for their stressed vowel in General American, the vowel that children (but not linguists!) are taught to call “a long o”. Notice how the off-glide after the tense vowel, so [ow] or [oʊ̯] or [oʷ], is not included in how we think of these phonemically.

bow, boat, boar /bo, bot, bor/
dough, dote, Dory /do, dot, ˈdori/
foe, phone, for /fo, fon, for/
go, goat, gore, gory /go, got, gor, ˈgori/
glow, gloat, glory /glo, glot, ˈglori/
hoe, hope, whore, hoary /ho, hop, h⁽ʷ⁾or, ˈhori/
low, lope, lore, lorry /lo, lop, lor, ˈlori/
mow, moat, more, Morey /mo, mot, mor, ˈmori/
no, nope, nor, nori /no, nop, nor, ˈnori/
Poe, poke, pore /po, pok, por/
row, wrote, Rory /ro, rot, ˈrori/
so, soap, sore, sorry /so, sop, sor, ˈsori/
stow, stoke, store, story /sto, stok, stor, ˈstori/
toe, taupe, tore, Tory /to, top, tor, tori/
woe, woke, wore /wo, wok, wor/

Like all tense vowels in English, both /e/ and /o/ always have a slight off-glide at the end of an open syllable, meaning the versions in the first word in each row that don’t have a consonant at their end. That tiny little off-glide is considerably less noticeable when the syllable doesn’t immediately end there like in the rest of the words on the same row, and in many speakers it can even become a pure monophthong in those positions. But never in the first.

This phonetic monophthongization happens in the Inland North dialect as spoken in the Upper Midwest of the Great Lakes region, especially in areas affected by the Northern Cities Vowel Shift (which is a new kind of chain-shift like the infamous Great Vowel Shift of old). But it also occurs in many Californian speakers as well as in speakers of some dialects at further remove from North America, such as in Scotland.

Additionally, the normally phonemic tense–lax distinction between /e/ with /ɛ/ and sometimes also /o/ with /ɔ/ can be neutralized in phonologic environments where those phonemes are followed by a nasal /n, m, ŋ/ or by a rhotic /r/. Once you have neutralization, then even when these do still vary in this or that speaker, these now become only phonetic allophones of the same merged phoneme in the minds of the listener.

Summary

Native speakers have trained themselves to notice only those differences in sounds that change which word is actually being said. They do not notice differences that make no difference, including variances under neutralization and those resulting from different amounts of off-glide rounding via [ʷ] or off-glide palatalization via [ʲ]. Those are minor effects that never change the word.

Solution 2:

From the point of view of most native English speakers, diphthongs such as /eɪ/, /aɪ/ etc are not noticeably different from other, monophthongal vowels.

SIL defines a diphthong as "A diphthong is a phonetic sequence, consisting of a vowel and a glide, that is interpreted as a single vowel."

The distinct individual sounds of a language are called phonemes. Each diphthong is regarded as a phoneme, not as a combination of two phonemes.

You can see the same thing in traditional (non-IPA) transcription systems; for example, M-W transcribe "lake" as \ˈlāk.

The way that native English speakers are taught /eɪ/ when learning to spell, it is a "long A", not a combination of sounds.

Historically, most of the diphthongal phonemes go back to set of changes initiated at the time of the Great Vowel Shift. In the 14th century, the "long A" actually was /a:/. Over the following centuries it shifted to /ɛː/ and /e:/ and then eventually /eɪ/. (Of course, its quality can still vary depending on dialect and sociolect.)