1.INTRODUCTION:

The objective of this thesis is to research and developprosodic features fordiscriminating proper names used in alerting (e.g., “John, can I have that book?”) from referential context (e.g., “I saw John yesterday”). Prosodic measurements based onpitch and energy are analyzed to introduce new prosodic-based features to the Wake-Up-Word Speech Recognition System(Këpuska V. C., 2006). During the process offinding and analyzing the prosodic features, an innovative data collection method was designed and developed.

In aconventional automatic speech recognition system, the usersare required to physically activate the recognition system by clicking a buttonor by manually starting the application. Using the Wake-Up-Word Speech Recognition System,a personcan activate a system byusing their voice only. The Wake-Up-Word Speech Recognition Systemwill eventually further improve the way people usespeech recognition by enabling speech only interfaces.

In the Wake-Up-Word Speech Recognition System, a word or phrase is used as a “Wake-Up-Word” (WUW) indicating to the system that the user requires its attention (e.g., alerting context). Any user can activate the system by uttering a WUW (e.g., “Operator”), that will enable the application to acceptthe commandthat follows (e.g., “Next slide please”). The non-Wake-Up-Words (non-WUWs) include the WUWs uttered in referential context, other words, sounds, and noise. Since the WUW may also occur within a referential context,and therefore indicating that the user does not need attention from the system, it is important for the system to be able to discriminate accurately between the two. The following examples further demonstrate the use of the word “Operator” in those two contexts:

Example sentence 1: “Operator, please go to the next slide.” (alerting context)
Example sentence 2: “We are using the word operator as the WUW.” (referential context)

The above cases indicate different user intentions. In the first example, the word "operator" is been used as a way to alert the system and get its attention. In the second example, the same word, “operator”, is used, but in this case it is used in a “referential context”.CurrentWake-Up-Word Speech Recognition system implements only the pre and post WUW silence as a prosodic feature to differentiate the alerting and referential contexts.

In this thesis,pitch and energy-basedprosodic features are investigated. The problem of general prosodic analysis is introduced in Section 1.1.

In Chapter2, the use of pitch as a prosodic featureis described. In general,pitch represents the intonation of the speech, and the intonation is used to convey linguistic and paralinguistic information of thatspeech (Lehiste, 1970). The definition and characteristics of pitch will be covered in Section 2.1. In Section 2.2, a pitch estimation method known as EnhancedSuper Resolution Fundamental Frequency Determinator or eSRFD(Bagshaw, 1994) is introduced. Finally, in Section 2.3, derivation of multiple pitch-based features from pitch measurementsare used tofind the best feature to discriminate the WUW used in alerting contextsand referential contexts.

In Chapter 3, an additional prosodic feature based on energymeasurement is described. The definition of prominence, an important prosodic feature based on energy and pitch, and its characteristics will be covered in Section 3.1. In Section 3.2, a description of energy computation is presented. Finally, in Section3.3, a derivation of multiple energy features from the energy measurementis presented and analyzed.

In Chapter 4, an innovativeidea of performing speech data collection is presented.After a number of prosodic analysis experiments conducted using WUWII Corpus (Tudor, 2007), the validation of the results obtained was deemed necessary using a different data set. Since, to our knowledge, no specialized speech database is available, an idea from Dr. R. Wallace was adopted to collect the data from the movies. We designed a systemwhich extracts speech from the audio channel and, if necessary, video information from recordedmedia (e.g., DVD) ofmovies and/ora TV series. This system is currently under development by Dr. Këpuska’s VoiceKey Group.

The problem definition and system introduction will be explained in Section 4.1, followed by the system design in Section 4.2.

1.1Prosodic Analysis

The word prosody refers to the intonation and rhythmic aspect of alanguage (Merriam-Webster Dictionary). Its etymologycomes from ancient Greek, where it was used in singing with instrumental music. In later times, the word was used for the “science of versification” and the “laws of meter” (William J. Hardcastle, 1997), governing the modulation of the human voice in reading poetry aloud. In modern phonetics the word prosody is most often referred to those properties of speech that cannot be derived from the segmental sequence of phonemes underlying human utterances.

Human speech cannot be fully characterized as the expression of phonemes, syllables, or words. For example, we can notice that the length of segments or syllables are shortened or lengthened in normal speech, apparently in accordance with some pattern. We can also hear that pitch moves up and down in some non-random way, providing speech with recognizable melody. In addition, one can hear that some syllables or words are made to sound more prominent than others.

Based on the phonological aspect, prosody can be classified into prosodic structure, tune, and prominence which can be described as follows:

  1. Prosodic structure refers to the noticeable break or disjunctures between words in sentences which can also be interpreted as the duration of the silence between words as a person speaks. This factor has beenconsidered in the currentWake-Up-Word Speech Recognition system where the minimal silence period before the WUW and after must be present. The silence period just before the WUW is usually longer than the average silence period of non-WUW or other parts of the sentence.
  2. Tunerefers to the intonational melody of an utterance(Jurafsky & Martin) which can be quantified by pitch measurement also known as fundamental frequency of the speech. The details on the pitch characteristic, pitch estimation algorithm, and the usage of pitch features are presented and explained in Chapter 2.
  3. Prominence includes the measurement of the stress and accent in a speech. Prominence is measured in our experiments using the energy of the sound signal. The detailsof energy computation, feature derivation based on energy, and the experimental resultsare presented inChapter 3.

2.PITCH FEATURES

In this chapter, the intonation melody of an utterance, computed using pitchmeasurements, is described. The pitch characteristicsand acomparison of various pitch estimation algorithms(Bagshaw, 1994)are covered in chapter 2.1. Based on the comparison resultsof multiple fundamental frequency determination algorithms (FDA) theEnhanced Super Resolution Fundamental Frequency Determinator (eSRFD)(Bagshaw, 1994) is selected as the algorithm of choice to perform the pitch estimation. The detailsof the eSRFD algorithm are covered in chapter 2.2. Derivation of multiple pitch-based features and their performanceevaluations are covered in chapter2.3.

2.1Pitch and pitch estimation methods

Intonation is one of the prosodic features that contain the informationthat may bethe key todiscriminate between the referential context and the alerting context. The intonation of speech is strictly interpreted as “the ensemble of pitch variations in the course of an utterance”(Hart, 1975). Unlike tonal languages such as Mandarin Chinese that has lexical forms that are characterized by different levels or patterns of pitch of a particular phoneme, pitch in the intonational languages such as English, German, the Romance languages, and Japanese, has been used syntactically. In addition, intonation patterns in the intonational languages are grouped with number of words, which are called intonation groups.Intonation groups of words areusually uttered in one single breath. The pitch measurement in the intonational languages reveals the emotion of a person and/or the intention of his/her speech. For example, consider the following sentence:

Can you pass me the phone?

The pattern of continuously rising pitch in the last three words in the above sentence indicatesa request.

Strictly speaking, pitch is defined as the fundamental frequency or fundamental repetition of a sound. The typical pitch range for adult males is between 60-200 Hz and 200-400 Hz for adult females and children. The contraction of vocal fold in humans produces a relatively high pitch and,conversely, the expanded vocal fold produces a lower pitch. This explains the reason a person’svoice risesin pitch when he/she gets nervous or surprised. That humanmalesusually havea lower voice pitch than females and children can also be explained by the fact thatmalesusually have longer and larger vocal folds.

After years of development of pitch estimation algorithms, pitch estimation methods can be classified into the following three categories:

  1. Frequency based methods such as CFD (Cepstrum-based FØ determinator) and HPS (Harmonic product spectrum), use frequency domain representation of the speech signal to find the fundamental frequency.
  2. Time domain based methods such as FBFT (Feature-based FØ tracker) (Phillips, 1985) uses perceptually motivated features and PP (Parallel Processing) methodsto produce fundamental frequency estimates by analyzing the waveform in the time domain.
  3. Cross-correlation methods,such as IFTA (Integrated FØ tracking algorithm) and SRFD (Super resolution FØ determinator),uses a waveform similarity metric based on a normalized cross-correlation coefficient.

The method of eSRFD (Enhanced Super Resolution Fundamental Frequency Determinator) (Bagshaw, 1994) was chosen to extract the pitch measurementfor the Wake-Up-Word because of its high overall accuracy. According to Bagshaw’s experiments, the accuracy of the eSRFD algorithm can have a voiced and unvoiced combined error rate below 17% and a low-gross fundamental frequency error rate of 2.1% and 4.2% for males and females, respectively. Figure 21and Figure 22 show the error rate comparison charts between eSRFD and other FDAs for male and female voices,respectively.

Figure 21FDA Evaluation Chart: Male Speech. Reproduced from (Bagshaw, 1994)

In Figure 21and Figure 22, the purple bars indicate the low-gross FØ error which refers to the halving error where the pitch has been estimated wrongly with a value about half of the actual pitch. The green bars represent the high-gross FØ error which refers to the doubling error where the pitch has been estimated wrongly with a value about twice that of the actual pitch. The voiced error represented by red bars refers to those unvoiced frames which have been misidentified as voiced ones by the FDA. Finally, the blue bars show the unvoiced errors which means the voiced data has been misidentified as unvoiced data.

Figure 22 FDA Evaluation Chart: Female Speech. Reproduced from(Bagshaw, 1994)

Figure 21and Figure 22refer to male and female fundamental frequency evaluation charts. They show that the eSRFD algorithm achieves the lowest overall error rate. This result was confirmed in a more recent study (Veprek & Scordilis, 2002). Consequently, eSRFDwaschosen to be the FDA used in the present project.

2.2esrfd pitch ESTIMATION ALGORITHM

The eSRFD (Bagshaw, 1994)is the advanced version of SRFD(Medan, 1991).The program flow chart of the eSRFD FDA is illustrated in Figure 23.

The theory behind the SRFD (Medan, 1991) algorithm is to use a normalized cross-correlation coefficient to quantify the degree of similarity between two adjacent, non-overlapping sections of speech. In eSRFD, a frame is divided into three consecutive sections instead of two as in the original SRFD algorithm.

At the beginning, the sample waveform is passed through a low-pass filter to remove the signal noise. The sample utterance is then dividedinto non-overlappingframes of 6.5ms length (tinterval = 6.5ms) and each frame contains a set of samples, SN,where, which is divided into 3 consecutive segments with each containing an equal number of a varying number of samples, n. The definition of segmentation is defined byEquation 2-1and is further described in Figure 24 below.

Equation 21

Figure 23 eSRFD Flow chart

Figure 24Analysis segments of eSRFD FDA

In eSRFD each frame is processed by a silence detector which labels the frame as unvoiced or silence if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin and zmax is smaller than a preset value (e.g., 50db signal-to-noise level);conversely, the frame is voiced if the sum of the absolute values of xmin, xmax, ymin, ymax, zmin and zmax is equal to or larger than a preset value (e.g., 50db signal-to-noise level). No fundamental frequency will be searched if the frame is marked as an unvoiced frame. In those caseswhereat least one of the segments of xn, yn, or zn is not defined, which usually happens at the beginning of the speech file and the end of the speech file, these frames will be labeled as unvoiced and no FDA will be applied to them.

If the frame of the sample is not labeled as unvoiced, then candidate values for the fundamental period are searched from values of n within the range Nmin to Nmax by using the normalized cross-correlation coefficient Px,y(n) as described byEquation 22.

Equation 22

In Equation 22, the decimation factor L is used to lower the computational load of the algorithm. Smaller L values allow higher resolution but also causesincrease inthe computational load of the FDA. Larger L values produce faster computation witha lower resolution search. The L value is set to 1 since the purpose of this research is to find as accurately as possible the relationship between pitch measurementsin WUW words.Therefore, computational speed is considered secondary and thus is not taken into account. However, the variable L will be considered when this algorithm is integrated into the WUW Speech Recognition System.

Figure 25Analysis segments for Px,y(n) in the eSRFD

The candidate values of the fundamental period of a frame are found by locating peaks in the normalized cross-correlation result of Px,y(n). If this value exceeds a specified threshold, Tsrfd, then the frame is further considered to be a voiced candidate.This threshold is adaptive and is dependent on the voice classification of the previous frame and three preset parameters. The definition of Tsrfd is described in Equation 23. If the previous frame is unvoiced or silent, then Tsrfd is arbitrarily set equal to 0.88.If the previous frame is voiced, then Tsrfd is equal to the larger value between 0.75 and 0.85 times the value of Px,y of the previous frame P’x,y. The threshold value is adjusted because the present frame has a higher possibility to be classified as voiced if the previous frame is also voiced.

If the previous frame is unvoiced or silent.

If the previous frame is unvoiced or silent.

Equation 23

In case no candidates for the fundamental period are found in the frame, then the frame is reclassified as ‘unvoiced’ and no further processing will be applied to the unvoiced frame. On the other hand, if the frame is classified as voiced,thenthe following process will be used to find the optimal candidate as described next.

After getting the first normalized cross-correlation coefficient Px,y, the second normalized cross-correlation coefficient Py,z, will be calculated for the voiced frame. The normalized cross-correlation coefficient Py,z is described byEquation 24.

Equation 24

After the second normalized cross-correlation, the score will be given to all candidates. If the candidate pitch value of a frame has both Px,y and Py,zvalues larger than Tsrfd, then a score or value of 2 is given to the candidate. If only Px,y is above Tsrfdvalues, then a score of 1 is assigned to the candidate. The higher score indicatesa higher possibility for the candidate to represent the fundamental period of the frame. After candidate scores are given, if there are one or more candidates with a score of 2, then all candidates’ scores with 1 in that frame are removed from the candidate list. If there is only one candidate with a score of 2, then that candidate is assumed to be the best estimation of the fundamental period ofthat particular frame. If there are multiple candidates witha scoreof 1 but no candidate scores of 2, then an optimal fundamental period is sought from the remaining candidates.

For the case of multiple candidates with scores of 1 but no candidate scores of 2, then the candidates are sorted in ascending order of fundamental period. The last candidate of the list which has the largest fundamental period represents a fundamental period of nM and nm for the m-th candidate.

Figure 26Analysis segments for q(nm) in the eSRFD

Then thethird normalized cross-correlation coefficient, qnm, between two sections of length nM spaced nm apart, is calculated for each candidate. The section nM in a frame is illustrated in Figure 26, and Equation 25describes the normalized cross-correlation coefficient, q(nm) used in this case.

Equation 25

After the third normalized cross-correlation coefficient is generated, the q(nm) value of the first candidate on the list is assumed to be the optimal value. If the following q(nm), multiplied by 0.77, is larger than the current optimal value, then the candidate for which q(nm) is considered to be the new optimal value. We apply the same concept throughout the entire list of candidates,resulting inthe optimal candidate value.

For the case where only one candidate has a score of 1 and there are no candidate scores of 2, then the possibility for the candidate to be the true fundamental period of the frame is low. In such a case, if both previous frames and the next frame are silent, then the current frame is an isolated frame and is reclassified as a silent frame. If either the previous or the next frame is a voiced frame, then we assume the candidate of the current frame is the optimal one and it defines the fundamental period of the current frame.