Fundamental Frequency and Energy Contours Calculation

Chapter 7 FEATURE CALCULATION

Chapter 7

FEATURE CALCULATION

7.1 Basic Prosodic Attributes

In the present section, calculations and procedures employed to obtain basic features contours are explained. These essential attributes (i.e. pitch and energy) will be the starting point in the aim to obtain more complex features, which contain valuable information for our purposes. The software used in section 7.1 is part of the Verbmobil long-term project of the Federal Ministry of Education, Science, Research and Technology.

In order to achieve feasible estimations, and avoid the difficulties caused by the non-stationary nature of speech, it’s assumed that the properties of the signal change relatively slow with time. This allows examination of a short-time window of speech to extract relevant parameters that are presumed to be fixed within the duration of the window. Most techniques yield parameters averaged over the course of the time window. Thus, if dynamic parameters are to be modelled, the signal must be divided into successive windows or analysis frames so that the parameters can be calculated often enough to follow relevant changes. Consequently, in order to obtain F0 and energy contours, smaller fragments of speech, called frames, are considered.

For each frame, the F0 and energy values are computed. There will be one single value per frame and for its calculation a longer analysis window is employed. Inside the analysis window, all the speech signal values are considered and analysis windows are this way always overlapped. Frame durations of 10 ms and 20 ms are commonly used in speech processing, while window lengths for F0 and energy calculations are usually established between 25 ms and 40 ms. The analysis performed in the present work considers frame durations of 10 ms and analysis window lengths of 40 ms.

Since voiced/unvoiced decision is the base of the F0 computation, it’s the first algorithm in being described within this section. The decision is frame-based, and only over voiced frames, F0 will be estimated.

7.1.1 Voiced/unvoiced decision.

Voiced speech involves the vibration of the vocal folds in response to airflow from the lungs. This vibration is periodic and it could be examined independently of the properties of the vocal tract. Its periodicity refers to the fundamental frequency of such vibration or the resulting periodicity in the speech signal, also called “pitch”.

Figure 7.1. Waveform of the glottal source.

In unvoiced speech the sound source is not a regular vibration but rather vibrations caused by turbulent airflow due to a constriction in the vocal tract. The sound created as a result of the constriction is described as a noise source. It contains no dominating periodic component and has a relatively flat spectrum meaning that every frequency component is represented equally (in fact for some sounds the noise spectrum may slope down at around 6dB/octave). Attending to the time waveform of a noise source, only a random pattern of movement is observed around the zero axis. In this context without any periodicity, pitch estimation makes no sense.

Figure 7.2. Different sources in speech production.

Therefore, for F0 estimation is essential to define which frames are considered voiced and which unvoiced. In contrast with F0 and energy calculations, non-overlapping windows will be employed for the voiced/unvoiced decision. The algorithm uses only values of the signal contained within a frame duration.

Voiced frames differentiate themselves from unvoiced frames by means of high amplitude values, a relative low zero-crossing rate and big energy values. Zero-crossing rate is understood as the number of zero-crossings per time unit, defined from now as the frame length, then 10ms. Several procedures to decide between voiced/unvoiced frames are introduced in [Hes83]. The algorithm used here applies thresholds, which are presented in [Hes83], and it’s described in [Kie97]. As a result of this work, following thresholds are proved to be suitably appropriated for the voiced/unvoiced decision:

Zero-crossing rate in Hz: (7.1)

Normalised energy of the signal: (7.2)

Normalised absolute maximum: (7.3)

Where

fsSampling frequency in Hz (here 16000)

NFrame length in samples (here 160)

sn n-sample value of the signal

n_crossAmount of zero-crossings during a frame

RangeDifference between maximum and minimum value in the signal

MaxRangeMaximum feasible range, dependent on the quantification

(here 16 Bit  MaxRange=65536)

Normalisation in (7.2) and (7.3) comes from the fact that the speaker may verbalise at different energy levels at different time.

The decision rule is achieved through the comparison of thresholds theoretically based () with a vector whose components result from equations (7.1) to (7.3):

n_cross < n_cross and

EneNorm > EneNorm and(7.4)

MaxNorm > MaxNorm

then

Voiced

else

Unvoiced

Where

= Vector obtained from (7.1) – (7.3)

Definition of appropriated thresholds was optimised in order to reach the best algorithm performance for the various speech samples available according to some theoretical background. Thresholds were selected through experiments made during the Verbmobil project development [Hag95]. After some simple experiments, based on trial and error methods, some experiments were also conducted using Neural Networks as classifier for the voiced/unvoiced decision on the frame plain. It was observed that this procedure provided thresholds, whose values yield better results. Detailed information and additional data about voiced unvoiced decision methods can be found in [Rab78], [Hes83] and [Kie97].

Since speech signal conditions are similar in this Diploma Thesis, these thresholds remain for calculations computed during it. Before these values were assumed, it was verified that they were able to compute efficiently voiced and unvoiced frames. Praat program is employed to compare regions selected as voiced. Both programs coincided consistently in which regions were classified as voiced. However, the Verbmobil program seemed to yield more accurate boundaries in voiced regions, while Praat created, in certain cases, too long regions, which included some undesirable unvoiced sounds in the section.

7.1.2Fundamental Frequency Contour.

7.1.2.1 Previous remark.

This section deals with the fundamental frequency (F0 or pitch) of a periodic signal, which is the inverse of its period, T (see figure 7.1). The period is defined as the smallest positive member of the set of time shifts that leave the signal invariant and makes only sense to a perfectly periodic signal. Speech signal results from a combination of a source of sound modulated by a transfer (filter) function (see figure 7.3) determined by the shape of the supra-laryngeal vocal tract, according to the source-filter theory described in section 3.3.2. This theory, stemmed from the experiments of Johannes Müller (1848), tested a functional theory of phonation by blowing air through larynges excised from human cadavers.

Obviously, a signal cannot be switched on or off or modulated without losing its perfect periodicity, and this combination causes speech signal to be only quasi-periodic, either due to small period-to-period variations in the vocal cord vibration or in the vocal tract shape. Therefore, the art of fundamental frequency estimation is to deal with the information in a consistent and useful way.

7.1.2.2 Difficulties in estimating pitch contour.

F0 is considered one of the most important features for the characterisation of emotions and is the acoustic correlate of the perceptive pitch. Its perception by human ear is non-linear, reliant on the frequency. In addition, human voice is not a pure sinusoid, but a complex combination of diverse frequencies.

Estimating the pitch of voiced speech sounds has always been a complex problem. Thought it appears to be a rather simple task on the surface, there are many subtleties that need to be kept in mind. F0 is usually defined, for voiced speech, as the rate of vibration of the vocal folds. Periodic vibration at the glottis may produce speech that is less perfectly periodic, due to the changes in the shape of the vocal tract that filters the glottal source waveform, making hard to estimate fundamental periodicity from the speech waveform.

Therefore, F0 estimation involves a huge number of considerations; it can be influenced by many factors such as phone intrinsic parameters or coarticulation. Furthermore, the excitation signal itself is not truly periodic, but it shows small variations in period duration (jitter) and in periodic amplitude (shimmer). These aperiodicities, in the form of relatively smooth changes in amplitude, rate or glottal waveform shape (for example the duty cycle of open and closed phases), or intervals where the vibration seems to reflect several superimposed periodicities (diplophony), or where glottal pulses occur without obvious regularity of time interval or amplitude (glottalizations, vocal creak or fry), don’t contribute to the speech intelligibility, but to the naturalness of human speech. Therefore, the mapping between physical acoustics and perceived prosody is neither linear nor one-to one; as we said, variations in F0 are the most direct cause of pitch perception, but amplitude and duration also affect pitch and make its estimation more intricate.

While there are many successfully implemented pitch estimation algorithms (s. [Che01.Hes83]), none of them work without making certain assumptions about the sound being analysed and everyone has to face many difficulties and to admit certain failure. Next paragraphs exhibit a brief historical overview of different methods tried. It can be seen, how from the first method ever employed, they meet with diverse limitations.

The first method tried was simply to low-pass the speech signal in order to remove all harmonics and then measure the fundamental frequency by any convenient means. This method faces two difficulties. First, it had to be an adaptive filter, because pitch can easily cover a 2 to 1 range and it always had to pass the fundamental and reject the second harmonic. The filter frequency was set by tracking the pitch and predicting the forthcoming pitch value; hence any error in one frame of speech could cause the filter to select the wrong cut-off frequency in the next frame and so lose track of the pitch altogether. The second difficulty arose from the fact that in many cases pitch had to be estimated from speech, where the fundamental frequency was omitted. For instance, in telephone speech frequency response drops off rapidly below 300 Hz; hence for many male voices the fundamental frequency is absent or so weak as to be lost in the system noise.

In the absence of the fundamental, it is common to search for periodicities in a signal by examining its autocorrelation function. In a periodic function, the autocorrelation will show a maximum at a lag equal to the period of the function. One first problem is that speech is not exactly periodic, because of changes in pitch and in formant frequencies. Therefore, the maximum may be lower and broader than expected, causing problems in setting the decision threshold. Another problem arises from the possibility that the first formant frequency is equal to or below the pitch frequency. If its amplitude is particularly high, this situation can yield a peak in the autocorrelation function that is comparable to the peak belonging to the fundamental. As a result, a pitch tracking process is used. Anyway, this process can usually ride out a single error, but not a string of errors.

Pitch can be determined either from periodicity in the time domain or from regularly spaced harmonics in the frequency domain. Consequently, pitch estimation techniques can be classified into two main groups:

period-synchronous procedures: These methods try to follow the periodic characteristics of the signal, e.g. positive zero-crossings, and estimate the signal period from this information.
short-term analysis procedures (window based). The short-term variety of estimators operates on a block (short-time frame) of speech samples and, for each one of these frames, one pitch value is estimated. The series of estimated values yield the fundamental frequency contour of the signal. There are different short-time analysis procedures e.g. cross- or autocorrelation or algorithms that operate in the frequency domain. Spectral procedures transform frames spectrally to enhance the periodicity information in the signal. Periodicity appears as peaks in the spectrum at the fundamental and its harmonics.

Period-synchronous procedures have the advantage of being generally faster and present an adequate performance in most applications. Short-term methods are considered more accurate and robust, due to the higher precision of calculating one changing attribute in a shorter time interval. In addition, they are less affected by noise and do not require a complex post-processing. Consequently, a short-term analysis procedure is used in this thesis for F0 calculation.

7.1.2.3Description of the algorithm.

The program used for F0 and energy contour calculations is a part of the prosodic module employed in the second phase prototype of the Verbmobil project. The procedure was developed in previous works at the Chair for Pattern Recognition of the Friederich-Alexander-University Erlangen-Nurnberg and is widely detailed in manifold works (s. [Kom89, Not91, Pen93, Har94, Kie97]). Consequently, here just a brief description is introduced.

Fundamental frequency estimation through a window-based procedure

This procedure performs a short-term analysis, which works in the spectral domain and provides sequential F0 computing. As it was already clarified, since F0 only makes sense to voiced frames, voiced/unvoiced decision must be the first step when F0 estimation problem is faced. The way this decision is made was detailed in previous section 7.1.1.

For the prosodic analysis of the human voice, F0 is usually expected to be in the interval between 35 Hz and 550 Hz. According to the Shannon theorem [Sha49], an analog signal must be sampled with at least the double of the highest frequency of the signal, to be able to be recovered without any losses. In order to respect this theorem, voiced regions are low-pass-filtered with cut-off frequency of 1100 Hz. Through this limitation of the F0 maximum to 550 Hz, noise and mistakes will less affect the algorithm. Then, the low-pass-filtered signal is digitised using a low sampling frequency (downsampling) in order to reduce the number of signal values that must be computed. Consequently, the F0 estimation process will be accelerated. For the resulting frames, the short-time spectrum is calculated through the Fast Fourier Transform (FFT, s. [Nie83]).

The procedure is based on the assumption that the absolute maximum of the short-time spectrum corresponds to one harmonic vibration of the F0. The main difficulty of the algorithm is to find a proper definition to build the decision rule, which chooses the maximum of the spectrum inside a voiced frame. This decision is created here indirectly through an implemented Dynamic Programming (DP) procedure. For every estimated F0 value (one per voiced frame), absolute decision values (dividers) are allowed. Dividers of all the frames in one voiced region yields hence a matrix, which is used by DP to compose a specific low-cost function, employed to find the F0 optimal path. This cost-function takes into account the distance to adjacent candidates and the distance to a known target value. This value is calculated, for reasons of robustness, for the voiced frames with the maximum of the energy signal using a multi-channel procedure.

Different possible candidates are calculated for every target value using correlation methods (periodic AMBF-Procedures, s. [Ros74]) and frequency domain procedures (Seffer-Procedures, s. [Sen78]) and the median of these values results in the target value of the voiced interval. The arithmetic mean of all the target values of the speech signal is the reference point R, which is applied for the divider determination within every voiced frame. For each frame t, the spectrum from start-frame S to end-frame E of the voiced interval is considered, andthe frequency Ft with maximum energy in this spectrum is calculated. With help of the divisors Kt=(Ft/R), the matrix J, containing diverse F0 candidates, is defined:

with (7.5)

In preliminary tests arise, that the correct F0 value is mostly enclosed when considering five candidates (n=2). Now, with the help of a recursive cost-function and by means of DP, the best path through the matrix J can be founded, which finally yields the F0 contour of the voiced region.

In addition, the procedure has some other advantages. On one hand, F0 values are not estimated in isolation for every frame. Instead, the cost-function establishes a relation with nearest neighbours, so that their spectral characteristics are also taken into account. On the other hand, proceeding this way short irregular periods produce no perturbation on the results. One additional benefit is that the expense calculations for every single frame, where the estimated valued is calculated, is limited. For further description of the cost-function see [Pen93] and [Kie97].

Post-processing of the F0 Contour

Independently of the F0 calculation method employed, post-processing is undoubtedly favourable, since direct application of the F0 values for further prosodic features calculations would be definitely inadequate. The sense in post-processing the F0 values lies behind different reasons:

Automatic algorithms for the F0 extraction generate errors.
Values of F0 are not calculated for every single frame of the signal.
Fluctuations between adjacent F0 values are distressing under certain conditions.
Calculations from the F0 contour are dependent on the voice reference (e.g. maximum).

Several possibilities for post-processing the fundamental frequency contour can be found in [Hes83]. In the framework of this work, post-processing is accomplished in different steps, as follows:

Smoothing of the F0 curve through a median filter.
Zero-setting of all the F0 values between 35 Hz and 60 Hz (before interpolation)
Interpolation of the unvoiced interval.
Semitone transformation and mean value subtraction.

Small failures of the algorithm can yield some undesirable noise. F0 curve smoothing through a median-filter is employed in order to leave out some of these small failures resulting from the algorithm. Smoothing increases the signal-to-noise ratio and allows the signal characteristics (peak position, height, width, area, etc.) to be measured more accurately