Is D-glottalization a plausible explanation of ambiguity in Donald Trump interview with WSJ?

On Jan 11, The Wall Street Journal published an interview with President Trump that contained the following phrase:

With that being said, President Xi has been extremely generous with what he’s said, I like him a lot. I have a great relationship with him, as you know I have a great relationship with Prime Minister Abe of Japan and I probably have a very good relationship with Kim Jong Un of North Korea.

I have relationships with people, I think you people are surprised.

The White House subsequently denied this statement, claiming that the President rather said (emphasis mine):

I​'d probably have a very good relationship with Kim Jong Un of North Korea

WSJ responded by publishing audio recording that, in their opinion, supports their original transcript. This has been considered adequate evidence by many commentators, for example Newsweek. I, however, beg to differ.

Having listened to the recording carefully, it seems to me that Trump's I was followed by a glottal stop. Why is this important? T-glottalization commonly occurs in American English and I thought Trump could show similar yet less exhibited D-glottalization, which would perfectly resolve the whole controversy.

However, my brief research shows that this sound change is rarely mentioned by phonologists, and attributed mostly to African American speakers, such as this study from Houston area.

Not having reached a conclusion myself, my questions are:

  1. Am I correct transcribing Trump's words as [ʌɪʔ ˈprɒbəbli]?
  2. Is D-glottalization a thing for a New York white speaker, i.e. would he pronounce I'd as [ʌɪʔ]?

I'm going to show the acoustic signals that are on the tape. It's hard to be certain on this one because the normal cues that you'd look for are just very faint or hard to distinguish. I'd be surprised if a linguist comments decisively for the media about what the tape shows going strictly by the physical audio (they might infer using other techniques).

tl;dr you can't tell unless you're able to squint better than me. I can't see glottalization, but it wouldn't be expected anyways (it's normally a cue for coda [t], not [d]).

Below you'll see a spectrogram. This is gotten with the software Praat. I extracted the left channel and had Praat display a spectrogram using a window length of 5 ms and a dynamic range of 35.0 dB.

First, what are you seeing? The spectrogram shows acoustic energy by frequency, with darker spots corresponding to more energy. Between 15.696s and 15.877s is the phrase "and I" or "and I'd". Before that you'll see high frequency noise (sounds like a click or sucking sound). After that you see about 90 ms of silence, followed by the release of the p (noise in the range of the first four vocal formants).

There are three types of cues you can look at (maybe more).

  1. Formant transitions
  2. Presence/absence of glotallisation
  3. Stop closure duration

First, formant transitions. Focus on the syllable between 15.794s and 15.877s. You'll see two regions of periodic sound (regular pulses of energy) between around 400--800 Hz and 1400--1800 Hz. These are the first two formants (F1 and F2), and they are determined by the the shape of the oral cavity (created by jaw position and tongue shape). Comparing [a] and [i], F1 should be higher for [a], and F2 should be higher for [i]. So in the diphthong [ai], you are going to see F1 and F2 start out close, then F1 drops while F2 rises. Consonants also affect the shapes of the formants. In the Trump "big league" affair, it was clear that he was saying "big league" because velar consonants like [k] and [g] cause F2 and F3 to "pinch" together. Anyways, a labial consonant like [p] should cause all formants to drop. An alveolar consonant like [d] should cause F1 to drop, and F2 to become more central (close to where it would be for [a] or schwa). Since right before the disputed consonant (is he saying [dp] or [p]?) is [i], which is front and has a high F2, you would expect to see a drop in both F1 and F2 regardless of whether it was [d] or [p]. So I can't draw anything from formant transitions.

As for glottalization. That comes before [t] specifically, not [d] (in fact it makes [t] and [d] easier to tell apart). So we shouldn't expect glottalization. The longer pulse periods starting at 15.696s might be glottalization (he had paused). You don't see the pulses getting irregular as the contested consonant closure is formed, but you wouldn't expect them.

As for closure duration, 90 ms seems a bit long for a stop closure for someone who's speaking fast, so that'd be an argument in favor of [dp], but then again he might have paused for an instant to think what to say next, since he was speaking extemporaneously.

enter image description here


Overview

I'm going to try a little experiment here. I ask for the forbearance of my colleagues on EL&U. I want to share some data, but I do not have the expertise to interpret the data. So I want to make the data available in a Community Wiki answer so experts can weigh in on the data and perhaps draw some meaningful conclusions. You can vote up or down, but please consider holding off on delete votes for a day or two.

If you have any expertise in audio analysis, phonetics, phonology, or reading spectrograms, please feel free to edit this answer to include your interpretation. To avoid the risk of speculation or bald opining, please support your analysis with as much empirical evidence and authoritative references as you are able.

Media

The audio the WSJ have made available of their interview with President Trump can be downloaded and analyzed in audio analysis tools.

The downloaded video can be accessed here, as well as just the audio track in either .m4a format or converted to .wav format.

Waveforms and Spectrograms

These are screenshots of waveforms (above) and spectrograms (below) of the four times Donald Trump said "I have ... relationship" with various people in his interview with the WSJ. The screenshots were taken in a software application called Sonic Visualizer.

The start and end times are approximate for each utterance of "I have", and the scaling/zooming is likely inconsistent, as it was done manually with no real experience in the software. But they'll give you a broad idea of where to start for your own analysis.

The bolded parts of the text indicate approximately what words are represented in the corresponding screenshot. The sections were calibrated manually.

1. President Xi

Around 6 seconds in, President Trump says:

... President Xi ... I have a great relationship with him ...

waveform and spectrogram of quote 1

Sonic Visualizer session for this quote.

2. Prime Minister Abe

Around 10s in, President Trump says:

... As you know I have a great relationship with Prime Minister Abe ...

waveform and spectrogram of quote 2

Sonic Visualizer session for this quote.

3. Kim Jong Un

Around 16s in, President Trump says (this is the disputed quotation):

... and I['d?] probably have a very good relationship with Kim Jong Un ...

waveform and spectrogram of quote 3

Sonic Visualizer session for this quote.

4. Other people

Around 25s in, President Trump says:

... I have relationships with people ...

waveform and spectrogram of quote 4

Sonic Visualizer session for this quote.

Interpretation, comparison, and contrast

Conclusions


To my ears, 45 says "I'd probably..." with an unaspirated d, hardly surprising before another stop in casual, hurried speech that elides most everything. The word would that hangs in the air following this sentence reinforces the conclusion that he said I'd rather than I.


Am I correct transcribing Trump's words as [ʌɪʔ ˈprɒbəbli]?

More like [aɪd̚ prɑbəbli] , with an /aɪ/ diphthong, an /ɑ/ sound, and an unreleased d (see below)

Is D-glottalization a thing for a New York white speaker, i.e. would he pronounce I'd as [ʌɪʔ]?

My personal opinion is that linguists are wrong calling this sound a "glottal stop". it's an unreleased D, or simply "an unreleased X".

what can I do? -> [wət̚ kn ai du]
I had this -> [ai həd̚ ðɪs]
shop -> [ʃɑp̚]

Basically, one starts pronouncing the consonant, but stops in the stage when the air is shot from one's mouth. in the T or D case, one puts his tongue where the T/D are made (the alveolar ridge), then continues saying the next consonant, or simply stops there.

It happens in most, if not all, the North American accents, for all the stop consonants (/t/, /d/, /k/, /g/, /p/, /b/) before another consonant or at the end of the sentence.