Speech Dynamics: Acoustic manifestations and perceptual consequences

Louis C.W. Pols & R.J.J.H. van Son

Institute of Phonetic Sciences / Amsterdam Center for Language and Communication

Herengracht 338

1016 CG Amsterdam, The Netherlands

Abstract

Speech is generally considered to be an efficient way of communication between humans, and will hopefully play that same role in the future for communication between humans and machines as well. This efficiency in communication is achieved via a balancing act in which at least the following elements are involved:

- the lexical and grammatical structure of the message;

- the way this message is articulated, leading to a dynamic acoustic signal;

- the characteristics of the communication channel between speaker and listener;

- the way this speech signal is perceived and interpreted by the listener.

This paper will concentrate on the dynamic spectro-temporal characteristics of natural speech and on the way such natural speech, or simplified speech-like, signals are perceived. Dynamic speech signal characteristics are both studied in carefully designed test sentences as well as in large, annotated and searchable, speech corpora with a variety of speech. From actual spectro-temporal measurements we try to model vowel and consonant reduction, coarticulation, effects of word stress and speaking rate on formant contours, contextual durational variability, prominence, etc.

The more speech-like the signal is (on a continuum from a tone sweep to a multi-formant /ba/-like stimulus) the less sensitive listeners appear to be to dynamic speech characteristics such as formant transitions (in terms of just noticeable differences). It also became clear that the (local and wider) context in which speech fragments and speech-like stimuli are presented, plays an important role on the performance of the listeners. Likewise does the actual task for the listener (be it same-different paired comparison, ABX discrimination (X being either A or B), or phoneme or word identification) substantially influence his/her performance.

1. Dynamic spectro-temporal characteristics of natural speech

Because of articulatory efficiency it is understandable that natural speech is more a continuum of overlapping events than a concatenation of discrete events (Carré, this volume). Whether this is also beneficial for the listener is one of the topics in the second half of this paper. The dynamicity of speech manifests itself in many different ways, such as:

- the deletion (in pronunciation) of phonemes, syllables, and words, such as the pronunciation of Dutch /Ams@dAm/ rather than /Amst@rdAm/, or /brEm/ for the first part of ‘bread and butter’ (see also Greenberg, this volume);

- the insertion of phonemes to ease pronunciation, such as in Dutch /du-w-@n/ for ‘doe een (do one)’, or /di-j-@n/ for ‘die een (that one)’;

- the almost complete lack of clear word boundary manifestations in conversational speech, compared to the highly functional and visible white space between words in printed text;

- the substantial amount of within and between word coarticulation, assimilation, and degemination, as is for instance clear in the pronunciation of Dutch ‘is zichtbaar (is visible)’ as /IsIxbar/ rather than as /Is/ /zIxtbar/;

- the existence of vowel and consonant reduction, which is most apparent in not clearly articulated (=sloppy) speech and in unstressed syllables.

Apart from efficiency in articulation, there are various other factors that influence dynamicity, such as speaker idiosyncracy, speaking style (clear vs. sloppy; hyper vs. hypo) and speaking rate. We are most interested in the acoustic manifestations of these dynamic spectro-temporal phenomena, such as in pitch, loudness, and formant contours in (preferably segmented and labelled) speech. In order to study (the variation in) these contours, one has to measure, stylize, curve fit, and/or mathematically model them in such predefined segments. See for instance Pols and van Son (1993) in which either a fixed number of 16 points per (vowel) segment is used to describe formant contours, or 5 fourth-order (Legendre) polynomials. This allows a comparison between normal and fast-rate speech, or between stressed and unstressed syllables. One of the most important conclusions of that study was that rate changes in vowel duration did not change the amount of vowel undershoot, thus indicating an active control of articulation speed, at least for the trained speaker that was studied here.

In Weenink (2001) one can find a very straightforward illustration of the efficiency in using a simplified form of dynamic information (rather than static information) for vowel recognition in the TIMIT database. All 10 training and test sentences of all 438 male speakers were used, resulting in 35,385 (hand-labeled) vowel segments, that were spectrally analyzed, using a one-Bark bandfilter analysis of 18 filters (1 bark spaced as well as 1 bark bandwidth). Each bandfilter spectrum was intensity normalized. Three 25-ms frames per segment were used: one central frame, as well as the frames 25 ms to the left and to the right of that central frame. Table 1 shows the results of a discriminant analysis to classify 13 monophthongeal vowel categories, using static (1 frame) or dynamic (3 frames) information. Under all 4 conditions (related to using speaker normalization and/or vowel clustering as well) the dynamic results are always substantially better that the static results. Huang (1991) found similar results.

condition / # vowel items / static (1 frame)
% correct / dynamic (3 frames)
% correct
original / 35,385
438 x 13 x (1…25) / 59.3 / 66.9
speaker
normalized / 35,385 / 62.2 / 69.2
V centers / 5,374
438 speakers x 13 vowels / 78.9 / 90.1
speaker
normalized / 5,374
438 x 13 / 87.9 / 94.5

Table 1. Percentage correct vowel classification of the TIMIT data set using discriminant functions and static or dynamic spectral information. The 4 conditions reflect the use of speaker normalization and/or vowel clustering. For more details, see Weenink (2001).

Van Bergem (1995) showed that acoustic vowel reduction is a function of (experimentally controlled) sentence accent, word stress and word class (in words like ‘can’, ‘candy’, ‘canteen’). He also analyzed, and then modelled, the coarticulatory effects on the schwa in nonsense words of the type C1@ C2V and V C1@ C2. The schwa appeared NOT to be a centralized vowel but something that is completely assimilated with its phonemic context. Only after averaging the results over all contexts, one returns to the more commonly known picture of a schwa being a centralized vowel.

Later on van Son & Pols (1999a) tried to model consonant reduction as well. For that purpose 791 comparable VCV-segments were isolated from 20 min. of spontaneous speech and from that same text read aloud. Various aspects of vowel and consonant reduction could be identified, that can perhaps be summarized in the following way:

- in spontaneous speech (compared to read speech) there is a decrease in articulation speed, leading to lower F2-slope differences;

- any vowel reduction seems to be mirrored by a comparable change in the consonant, thus suggesting that vowel onsets and targets change in concert;

- in spontaneous speech there is a decrease in vocal and articulatory effort, resulting in shorter vowels and consonants and in a lower center-of-gravity (COG, first spectral moment).

The above-mentioned studies of van Son and van Bergem were successful, partly because they analyzed material in a carefully designed database. Nowadays it is more and more customary to rely on (annotated) large speech corpora, of which TIMIT is small by present standards. In our own institute, the IFA-corpus is a good medium-size example (van Son & Pols, 2001). It contains about 5.5 hours of speech from 4 male and 4 female speakers of various styles, from informal speech to read texts, and individual sentences, words and syllables. These about 50 K words are segmented and labeled at the phoneme level. The audio files, the annotations, as well as the metadata are accessible in a relational database. Questions can be asked like, what percentage of -en endings in Dutch verbs and nouns (such as in ‘geven’, to give and ‘bomen’, trees) are realized as /-@n/ rather than as /-@/? This percentage appears to grow from 0.3 % in informal speech to 77 % in read sentences. The IFA-corpus is also used to provide Dutch input to the INTAS 915 project about “Spontaneous speech of typologically unrelated languages Russian, Finnish and Dutch: Comparison of phonetic properties”. For some preliminary results, see de Silva et al. (2003).

The 10-million words Spoken Dutch Corpus (CGN) contains much more material (1,000 hrs of speech) from many different speakers under a variety of styles including telephone speech, but contains less varied material from a single speaker. After its completion at the end of 2003, it will be fully transcribed at the level of orthography, part-of-speech, and lemma’s, and partly transcribed at the phonemic, prosodic and syntactic level (Oostdijk et al., 2002). In the DARPA community one sees now the next level of speech collection, namely so-called ‘found’ speech, such as Broadcast News (Graff, 2002). This material is collected for other purposes, but appears to be useful for training and testing automatic speech recognition and creates other scientific challenges and resources.

Above we briefly mentioned some spectro-temporal phenomena related to vowel and consonant reduction. Meanwhile we extended this work by incorporating also the concept of efficiency in speech production (van Son & Pols, subm2003a). Our claim is that the organization of spoken utterances is such that more speaking effort is directed towards important speech parts than towards redundant parts. In order to test this we have to define a measure for the importance of a segment. Based on a model of incremental phoneme-by-phoneme word recognition (Norris et al., 2000), the importance of a segment is defined as its contribution to word disambiguation. More specifically, the importance of a specific segment for word recognition is defined as the reduction in the number of words in a large corpus (CELEX) that fit the preceding word-onset by adding the target phoneme as a constraint. For example, to determine the importance of the vowel /o/ in the word /bom/ (English ‘tree’), we determine how many of the words that start with /b/ have /o/ as their second phoneme. This reduction in the ambiguitiy is then expressed as the information Is (in bits) contributed by the vowel /o/ of /bom/. To incorporate also the average influence of the local context on the predictability of a target word, we adapt the CELEX word counts using the context distinctiveness (McDonald and Shillcock, 2001) of the target word. For more details see van Son & Pols (2003b). By dividing the data into quasi-uniform subsets that are uniform with respect to all relevant factors, we could show for (a subset of) the speech material in the IFA corpus that some 90 % of the total variance in Iscould be accounted for by the following factors: phoneme position, phoneme identity, word length in syllables, prominence, lexical syllable stress, length of consonant clusters, syllable part (onset, kernel, or coda), word position in the sentence, and syllable position in the word. The two first factors are most important and account for 81 % of the variance already. If we do a similar factorial analysis on acoustic reduction factors such as phoneme duration or Center of Gravity, interestingly enough we find a similar ordering of the factors. These (weak) positive correlations between Is and acoustic reduction are evidence for our tendency as language user to greater efficiency.

2. Perception of speech dynamics

In his interesting new book about the intelligent ear, Plomp (2002) emphasizes the fact that several biases have affected the history of (speech and) hearing research. They can be summarized as follows:

- a dominance of the use of sinusoidal tones as stimuli;

- a preference for the microscopic approach (e.g. phoneme discrimination rather than intelligibility);

- an emphasis on psychological rather than cognitive aspects of hearing;

- the use of clean stimuli in the laboratory, rather than the acoustic reality of the outside world with its disruptive sounds.

Concerning these biases there seems to be a clear parallel between hearing research (or psychoacoustics) and speech research. We know much more about the perception of simple stationary signals than about (spectro-temporally) complex and dynamic signals. Handbooks tell us something about the detection threshold and the just-noticeable-difference for pitch, loudness, timbre, and direction of pure tones and sometimes of single-formant stationary periodic signals, but very little about the perception of speech-like formant transitions (Pols, 1998).

2.1 Perception of speech-like tone and formant sweeps

That is why van Wieringen (1995) started her thesis project on this topic, see also van Wieringen & Pols (1998). Her most intriguing results (see figure 1), in my opinion, indicate that sensitivity (in terms of difference limen DL in endpoint frequency) decreases the more complex, the more speech-like, the /ba/- and /ab/-type stimuli get! This means that DL is larger for complex multi-formant stimuli than for simple single-formant or sweep-tone stimuli. Furthermore it appeared that DL is larger for shorter transitions (20 ms) than for longer transitions (50 ms), and larger for initial than for final transitions. She used synthetic /ab/- and /bu/-like signals, but later also natural, truncated stimuli. The lesser sensitivity for more complex, more speech-like stimuli, may actually help us to survive in the real world with highly variable speech input. With an analytic hearing we would have to interpret lots of variability that is most probably not very relevant for phoneme categorization.

Figure 1. Difference limens in endpoint frequency for various initial and final tone sweeps, single-formant sweeps, and complex formant sweeps. For more details see van Wieringen & Pols (1998).

2.2 Speech identification as a function of context, speaking style, reduction, and noise

Rather than discrimination, van Son studied vowel identification with and without (synthetic) formant transitions with a variable frequency range, both symmetric as well as with onglide or offglide only (Pols & van Son, 1993). Under these experimental conditions there was no indication of perceptual overshoot caused by these transitions (Lindblom & Studdert-Kennedy, 1967). Actually there was much more evidence for an averaging behavior of the trailing part of the formant track. In a subsequent study he took 120 CVC speech fragments from a long text reading and presented truncated segments for vowel and consonant identification (van Son & Pols, 1999b). The truncated segments varied from the 50 ms vowel kernel only, to the full CVC segment. As can be seen in figure 2, it is clear that any addition was beneficial for vowel identification, but the left context was more beneficial than the right context. There was also a context effect (not shown), in the sense that consonant and vowel identification was significantly better when also the other member of the CV-pair was correctly identified. Phoneme identification also differs substantially for speech segments taken from either spontaneous or read speech, and likewise differs for stressed and unstressed syllables.

Figure 2. Error rates of vowel identification for the various stimulus types. Given are the results for all tokens pooled (“All”) as well as for vowels with and without sentence accent separately (+ and – Accent respectively). Long-short vowel errors were ignored, i.e., /A/-/a:/ and /O/-/o:/ confusions in our experiment. Chance response levels would result in 86% errors and a log2 perplexity of 2.93 bits. +: p≤0.01, McNemar’s test, two tailed between “All” categories. *: 2 ≥ 12,  = 1, p≤0.01, between + and – Accent.

Van Son & Pols (1997) present some interesting results for the, so far not very well defined, phenomenon of consonant reduction. Spontaneous speech of one trained male Dutch speaker was collected first, after which this speaker also read aloud that same text. Next, from these two speech fragments 791 comparable pairs of VCV segments (both stressed and unstressed) were extracted and presented to subjects for consonant identification. The mean error rate over 22 subjects (see figure 3) clearly shows that more intervocalic consonants were misidentified in segments taken from spontaneous speech than from read speech, whereas also unstressed segments cause more errors than stressed segments. This reduction in consonant identification is also clearly reflected in the amount of acoustic consonant reduction. Figure 4 illustrates this acoustic reduction in the form of shorter consonant durations, but similar effects were found for other attributes of acoustic consonant reduction, such as lower Center of Gravity, smaller intervocalic sound energy differences, shorter formant distances of the adjacent vowels, and larger differences of F2 slopes. For more details, see van Son & Pols (1997; 1999a).

Figure 3. Mean error rate for consonantFigure 4. Mean consonant duration

identification. For more details see text.for 791 VCV pairs from spontaneous

and read speech of one male speaker.

Above we emphasized the effect of the presence of local context, likewise there is also ample evidence for the effect of lack of context. Take for instance the interesting data of Koopmans-van Beinum (1980). She compared vowel identification for vowel segments taken from three different conditions: isolated vowels, isolated words, and vowels taken from unstressed syllables in free conversation. The average intelligibility of the 12 different Dutch vowels reduced from 89.6, to 84.3, and to 33.0 %, respectively, in accordance with the reduction in vowel contrast (a measure for the total variance in the two-formant vowel space), see Table 2. One must realize of course that in their original context all vowels were perfectly well understandable!

condition / M1 / M2 / F1 / F2 / Average
isolated vowels%
(3 sets per speaker)ASC / 95.2
433 / 88.9
404 / 88.0
447 / 86.4
634 / 89.6
480
words%
(5 sets per speaker)ASC / 88.1
406 / 78.8
320 / 84.9
374 / 85.3
529 / 84.3
407
unstr., free conv.%
(10 sets per speaker)ASC / 31.2
174 / 28.7
119 / 33.3
209 / 38.9
255 / 33.0
189

Table 2. Average percentage correct identification by 100 Dutch listeners of isolated vowel segments, from 2 male and 2 female speakers, extracted under 3 different conditions. The Acoustic System Contrast (ASC) is also presented as a measure of the total variance in the log F1 - log F2 space. For more details see Koopmans-van Beinum (1980).