Perception of Pitch and Timbre and the Classification of Singing Voices

PERCEPTION OF PITCH AND TIMBRE AND THE CLASSIFICATION OF SINGING VOICES

Molly Erickson

Department of Audiology and Speech Pathology, University of Tennessee, Knoxville

Introduction

Since vocal pedagogues cannot directly view the vocal mechanism, they rely on perceptual cues to help them determine an individual’s voice classification. Traditionally, voice classification has been based on three perceptual parameters: range, timbre, and tessitura (Vennard, 1967); however, these parameters are poorly defined and the interrelations between them are unknown.

To date, no research has been conducted that examines the interrelationship of pitch, tessitura, and timbre as predictors of voice classification. Most research studies have focused on the acoustic correlates one parameter, timbre.

The accepted definition of timbre is as follows: two tones are of different timbre if they are judge to be dissimilar and yet have the same loudness and pitch (ANSI, 1973). To define timbre for the vocal instrument, an additional restriction is required. Not only must the two sounds be of the same pitch and loudness, they must also be of the same vowel quality. Using such a definition, a singer would have an individual timbre for each pitch-vowel combination. Yet this is not how vocal timbre has been treated traditionally. Cleveland (1977) states that an individual singer has a characteristic timbre that is a function of the laryngeal source and vocal tract resonances. Singers with similar timbres, then, constitute members of the same voice timbre type or voice category. It is possible, however, that any two voices may be perceived as having similar timbre on one pitch-vowel combination and dissimilar timbre on another. In this case, each voice possesses a set of timbres. It may not be possible to devise one simple acoustic measure that can accurately classify voice timbre types.

Research has shown a correlation between timbre type classification and average formant frequency, with basses having lower formant frequencies than tenors (Cleveland, 1977) and sopranos having the highest formant frequencies (Dmitriev & Kiselev, 1979). Since vocal tract length is directly related to formant frequency, it is believed that this physical attribute contributes to voice quality. Yet when the data provided by Dmitriev and Kiselev are examined closely, there is some evidence that vocal tract length may not be related to voice classification in females. They observed distinctly different and increasingly shorter vocal tract lengths for the voice categories of bass, baritone, and tenor, respectively. However, vocal tract lengths for the voice categories of mezzo-soprano and soprano were nearly identical. Likewise, the center frequency of the fourth formant decreased and showed little overlap for basses, baritones, and tenors, respectively. Yet a great deal of overlap in fourth formant frequency was observed for mezzo-sopranos and sopranos. These data suggest that the acoustic correlate of vocal tract length, formant frequency, may be a perceptual cue used to assess voice classification in males, but does not appear sufficient to differentiate the traditional voice categories in females.

Acoustic and temporal cues have been shown to be important in the perception of speech and musical instrument timbre. Every vibrating body has natural resonating frequencies. The natural resonating frequencies of the vocal mechanism are known as formants and are know to influence timbre perception. Additionally, four temporal parameters may be of importance in the perception of vocal timbre: onset (e.g., Darwin, 1981; Grey & Gordon, 1978), and vibrato rate and extent (McAdams & Rodet, 1988), and spectral variation. In cases where steady state spectral information is not sufficient for timbre perception, such as instances where fundamental frequency is high and harmonics are therefore widely spaced, these cues may provide additional acoustic information.

This paper attempts to investigate the perceptual validity of singing voice classification systems as they relate to two parameters, pitch and timbre. Previous studies (e.g., Cleveland, 1977) have examined the perception of voice classification using forced-choice paradigms based on the traditional voice classification system of bass, baritone, tenor, alto, mezzo-soprano, and soprano. Such perceptual experiments provide information as to how listeners place stimuli when provided with arbitrary classification categories. They do not provide information on the perceptual validity of the categories. The question that begs answering is this, if provided no classification system a priori, how do listeners tend to group vocal stimuli? Do they group them in a manner that supports current classification systems? When grouping vocal stimuli, is the perception of timbre truly independent of pitch?

This study employed two research paradigms in order to examine the perceptual dimensions of classical voice classification. First, multidimensional scaling procedures were used to discover the dimensions underlying vocal timbre. Second, an “oddball” paradigm was used to assess whether timbre, independent of pitch, can be used as a perceptual cue to group vocal stimuli into traditional voice categories.

Method

Stimuli

Master’s level singers from the Department of Music at the University of Tennessee, Knoxville have provided stimuli for the experiment. These subjects met the following criteria:

1. Bilateral hearing within normal limits as determined by a 20 dB hearing screening;

2. Voice study at the Master degree level or higher;

3. No voice problems at the time of taping as determined by a certified speech-language pathologist.

Two singers from each voice classification, mezzo-soprano and soprano, were recorded singing the vowel // on six different pitches, A3, C4, G4, B4, F5, and A5, at a constant loudness level. The resulting 24 stimuli were used in two perceptual experiments.

Recordings were made in a single-walled sound booth (Acoustic Systems RE-144-S). Subjects were recorded while producing a sustained // for 5 seconds using digital audio tape recorder (Sony PCMR500) and a Sennheiser MD 441-U microphone. Subjects stood in the center of the booth. Lip to microphone distance was 12 inches. A keyboard was used to present pitches. Prior to taping, subjects were allowed to vocalize freely and become comfortable with the recording environment.

Listeners

All listeners in this study were experienced vocal professionals. Listeners were recruited from the Knoxville Choral Society and the Knoxville Opera Company. Twelve listeners were recruited that met the following criteria:

1. Bilateral hearing within normal limits as determined by a 20 dB hearing screening;

2. Bachelor’s degree or higher in a vocal arts related discipline (e.g., pedagogy, performance, or choral conducting) or 5 years experience in a vocal arts discipline.

Procedure

Two listening experiments were conducted. Both experiments took place in a single-walled sound booth (Acoustic Systems RE-144-S). Stimuli were presented binaurally through Sennheiser headphones. Listeners entered responses using a computer monitor and a mouse.

Experiment 1

Experiment 1 utilized a dissimilarity paradigm. From the 24 vocal stimuli, all possible combinations of two stimuli (A and B) were constructed, resulting in 276 paired trials. Within each trial, stimuli were randomly assigned as A or B. Trials were presented in randomized order in an ABBA format to 12 experienced listeners.

Subjects were asked to rate each pair on a dissimilarity scaled (0-10). This was accomplished through use of a horizontal scroll bar presented via a computer monitor. The left side of the scroll bar indicated 0 while the right side of the scroll bar indicated 10. Subjects were instructed that they should rate the two stimuli as 0 if they were identical and 10 if they were very different. Subjects were told that they could use any measure between 0 and 10 to indicate the degree of dissimilarity. Subjects were told not to use pitch as a factor in their ratings. All subjects were presented with test stimuli so that they could become familiar with the computer interface and the task.

Experiment 2

For Experiment 2, trials of three stimuli were constructed, using an oddball paradigm. In this version of the oddball paradigm, two of the three stimuli in each trial were produced by the same singer at two different pitches (X1 and X2), while the third stimulus was produced by a different singer (Y). For each singer, three same-singer conditions were constructed: one pairing the pitches G4 and B4 (XG4 and XB4), a second pairing the pitches C4 and F5 (XC4 and XF5), and a third pairing the pitches A3 and A5 (XA3 and XA5). For each singer and each condition, the “odd” stimulus (Y) was varied across the three remaining singers and across the pitches A4, C4, G4, B4, F5, and A5. This design created 216 trials based on 4 singers times 3 conditions time 3 singer-pairs times 6 pitches. For each trial, stimulus order was randomized. The resulting 216 trials were presented in random order to 12 experienced listeners.

Prior to stimulus presentation, listeners were told that two of the three stimuli in each trial were produced by the same person and that they were to chose the stimulus produced by the different person. Listeners were allowed to replay each trial as many times as they needed. Listener judgments were recorded via a computer interface.

Results

Experiment 1

Distance measures obtained from Experiment 1 were subjected to multi-dimensional scaling analysis (MDS). The optimal MDS solution was found in 3 dimensions. Fit measures for the solution were as follows: Stress = .18 and R2 = .80.

Only the first two dimensions will be discussed in this paper. Dimension 1 was highly correlated with pitch (R = .83). Dimension 2 appears related to voice category and was moderately correlated with F2 through F4 frequency (R=.73). Dimensions 1 and 2 for all four singers are presented in Figure 1. Figure 2 displays mean values for dimensions 1 and 2 calculated for each voice category, mezzo soprano and soprano.

In general, experienced listeners rated same-pitch stimulus pairs from the same voice category as more similar than they did same-pitch stimulus pairs from different voice categories. However, at the pitches F5 and A5, listeners rated all stimulus pairs as very similar, regardless of voice category.

Pitch was a primary factor in all judgments. Within each singer, stimulus pairs with larger pitch differences generally were perceived as being more dissimilar than those with smaller pitch differences.

Experiment 2

For each X1X2 pair (XG4XB4, XC4XF5, and XA3XA5), the percent correct identification of the oddball stimulus (Y) was calculated as a function of pitch for two comparisons: Y in the same voice category as X1X2 and Y in a different voice category than X1X2.

Results for each X1X2 pair are presented in Figure 3. The plot labeled “Y in Same Voice Category as X1X2” provides a graphic representation of the ability to discriminate differences within the same voice category. Conversely, the plot labeled “Y in Different Voice Category than X1X2” provides a graphic representation of the ability to discriminate differences between voice categories.

XG4XB4 Condition

In the condition XG4XB4, experienced listeners were able to select the “oddball” stimulus with a high degree of accuracy. Y stimuli from a different voice category than X1X2 were accurately identified regardless of pitch. However, when Y stimuli were from the same voice category as X1X2, accuracy of Y stimulus identification decreased as the distance between Y stimulus pitch and XB4 decreased, dropping from 100% accuracy to approximately 70% accuracy.

XC5XF5 Condition

For the condition XC4XF5, experienced listeners were less able to select the “oddball” stimulus than they were in the XG4XB4 condition. Y stimuli from a different voice category than X1X2

Figure 1. MDS dimensions for Experiment1
Figure 2. Mean MDS dimensions for sopranos and mezzo sopranos. / Figure 3. Percent correct identification of Y stimulus as a function of pitch for all three X1X2 conditions.

were accurately identified more often than those from the same voice category as X1X2. Accuracy levels for this comparison ranged between 40% and 55%. Y stimuli from the same voice category as X1X2 generally were identified at or below chance levels. Unlike in the XG4XB4 condition, correct identification of the Y stimulus did not decrease as the distance between Y stimulus pitch and either X1 or X2 decreased. In fact, for Y stimuli in the same voice category as X1X2, peak percent correct scores were achieved when Y pitch equaled either XC4 or XF5.

XA3XA5 Condition

Listeners identified the “oddball” stimulus least accurately in the XA3XA5 condition. While Y was more accurately identified when in a different voice category than X1X2 than when in the same category as X1X2, accuracy levels generally were at chance or less. For both comparisons, the greatest accuracy was achieved when Y pitch equaled XA3, when Y pitch equaled XA5, or when Y pitch was nearly midway between XA3 and XA5. In all other conditions, accuracy levels were far less than chance.

Discussion

Results of Experiment 1 suggest that the following three factors affect the perception of dissimilarity: voice category, pitch difference, and high pitch. It was shown that generally, listeners found voices in different categories to be less similar than those in the same category. Listeners also found that stimulus pairs were more similar when they were closer in pitch. Finally, listeners in general found all stimuli to be highly similar at high pitches. Given these findings, several predictions concerning Experiment 2 can be made:

Listeners should be more able to accurately identify Y stimuli when they are in a different voice category than X1X2 than when Y is in the same category as X1X2;
Listeners’ accuracy in Y stimulus identification should increase as the distance between Y stimulus pitch and both X1 and X2 pitch increases.
Listeners should be less able to accurately identify Y stimuli when both Y and X2 occur at pitches above F5 than when Y is in close proximity to X1 or X2 in a pitch range less than F5.

Effect of Voice Category

In general, listeners were able to more accurately identify a Y stimulus when it was in a different voice category than X1X2. This was true in all three conditions with two exceptions: when the Y stimulus pitch was A3 or A5 in the XG4XB4 condition and when Y stimulus pitch was A3 in the XA3XA5 condition. The first exception is not surprising since the effects of pitch difference are maximized in this condition. The interaction of pitch difference and voice category may result in the obliteration of voice category effects at extreme pitches. The second exception is understandable only when individual voice samples are considered. On A3, soprano 1 produced a stimulus with a noticeably breathy quality. Thus, this stimulus may have been perceived as quite dissimilar from that produced by soprano 2 on the same pitch.