New Tools and Methods for Very-Large-Scale Phonetics Research

1. Introduction

The field of phonetics has experienced two revolutions in the last century: the advent of the sound spectrograph in the 1950s and the application of computers beginning in the 1970s. Today, advances in digital multimedia, networking and mass storage are promising a third revolution: a movement from the study of small, individual datasets to the analysis of published corpora that are several orders of magnitude larger.

Peterson & Barney’s influential 1952 study of American English vowels was based on measurements from a total of less than 30 minutes of speech. Many phonetic studies have been based on the TIMIT corpus, originally published in 1991, which contains just over 300 minutes of speech. Since then, much larger speech corpora have been published for use in technology development: LDC collections of transcribed conversational telephone speech in English now total more than 300,000 minutes, for example. And many even larger collections are now becoming accessible, from sources such as oral histories, audio books, political debates and speeches, podcasts, and so on.

These new bodies of data are badly needed, toenable the field of phonetics to develop and test hypotheses across languages and across the many types of individual, social and contextual variation. Allied fields such as sociolinguistics and psycholinguistics ought to benefit even more. However, in contrast to speech technology research, speech science has so far taken very little advantage of this opportunity, because access to these resources for phonetics research requires tools and methods that are now incomplete, untested, and inaccessible to most researchers.

Transcripts in ordinary orthography, typically inaccurate or incomplete in various ways, must be turned into detailed and accurate phonetic transcripts that are time-aligned with the digital recordings. And information about speakers, contexts, and content must be integrated with phonetic and acoustic information, within collections involving tens of thousands of speakers and billions of phonetic segments, and across collections with differing sorts of metadata that may be stored in complex and incompatible formats.Our research aims to solve these problems by integrating, adapting and improvingtechniques developed in speech technologyresearch and database research.

The most important technique is forced alignment of digital audio with phonetic representations derived from orthographic transcripts, using HMM methods developed for speech recognition technology. Our preliminary results, described below, convince us that thisapproachwill work.However, forced-alignment techniques must be improved and validated for robust application to phonetics research. There are three basic challenges to be met: orthographic ambiguity; pronunciation variation; and imperfect transcripts (especially the omission of disfluencies). Reliable confidence measures must be developed, so as to allow regions of bad alignment to be identified and eliminated or fixed. Researchers need an easy way to get a believable picture of the distribution of errors in their aligned data, so as to estimate confidence intervals, and also to determine the extent of any bias that may be introduced. And in addition to solving these problems for English, we need to show how to apply the same techniques to a range of other languages, with different phonetic problems, and with orthographies that (as in the case of Arabic and Chinese) may be more phonetically ambiguous than English.

In addition to more robust forced alignment,researchers also need improved techniques for dealing with the results. Existing LDC speech corporainvolve tens of thousands of speakers, hundreds of millions of words, and billions of phonetic segments. Other sources of transcribed audio are collectively even larger. Different corpora, even from the same source, typically have differing sorts of metadata, and may be laid out in quite different ways. Manual or automatic annotation of syntactic, semantic or pragmatic categories may be added to some parts of some data sets.

Researchers need a coherent model of these varied, complex, and multidimensional databases, with methods to retrieve relevant subsets in a suitably combinatoric way. Approaches to these problems were developed at LDC under NSF awards 9983258, “Multidimensional Exploration of Linguistic Databases”, and 0317826, “Querying linguistic databases”; with key ideas documented in Bird and Liberman (2001); and we propose to adapt and improve the results for the needs of phonetics research.

The proposed research will help the field of phonetics to enter a new era: conducting research using very large speech corpora, in the range from hundreds of hours to hundreds of thousands of hours. It will also enhance research in other language-related fields, not only within linguistics proper, but also in neighboring disciplines such as psycholinguistics, sociolinguistics and linguistic anthropology. And this effort to enable new kinds of research also brings up a number of research problems that are interesting in their own right, as we will explain.

2. Forced Alignment

Analysis of large speech corpora is crucial for understanding variation in speech (Keating et al., 1994; Johnson, 2004). Understanding variation in speech is not only a fundamental goal of phonetics, but it is also important for studies of language change (Labov, 1994), language acquisition (Pierrehumbert, 2003), psycholinguistics (Jurafsky, 2003), and speech technology (Benzeghiba et al., 2007). In addition, large speech corpora provide rich sources of data to study prosody (Grabe et al., 2005; Chu et al., 2006), disfluency (Shriberg, 1996; Stouten et al., 2006), and discourse (Hastie et al., 2002).

The ability to use speech corpora for phonetics research depends on the availability of phonetic segmentation and transcriptions. In the last twenty years, many large speech corpora have been collected; however, only a small portion of them have come with phonetic segmentation and transcriptions, including: TIMIT (Garofolo et al., 1993), Switchboard (Godfrey & Holliman, 1997), the Buckeye natural speech corpus (Pitt et al., 2007), the Corpus of Spontaneous Japanese ( and the Spoken Dutch Corpus ( Manual phonetic segmentation is time-consuming and expensive (Van Bael et al. 2007); it takes about 400 times real time (Switchboard Transcription Project, 1999) or 30 seconds per phoneme (1800 phonemes for 15 hours) (Leung and Zue, 1984). Furthermore, manual segmentation is somewhat inconsistent, with much less than perfect inter-annotator agreement (Cucchiarini, 1993).

Forced alignment has been widely used for automatic phonetic segmentation in speech recognition and corpus-based concatenative speech synthesis. This task requires two inputs: recorded audio and (usually) word transcriptions. The transcribed words are mapped into a phone sequence in advance by using a pronouncing dictionary, or grapheme to phoneme rules. Phone boundaries are determined based on the acoustic models via computer algorithms such as Viterbi search (Wightman and Talkin, 1997) and Dynamic Time Warping (Wagner, 1981).

The most frequently used approach for forced alignment is to build a Hidden Markov Model (HMM) based phonetic recognizer. The speech signal is analyzed as a successive set of frames (e.g., every 3 - 10 ms). The alignment of frames with phonemes is determined via the Viterbi algorithm, which finds the most likely sequence of hidden states (in practice each phone has 3-5 states) given the observed data and the acoustic model represented by the HMMs. The acoustic features used for training HMMs are normally cepstral coefficients such as MFCCs (Davis and Mermelstein, 1980) and PLPs (Hermansky, 1990). A common practice involves training single Gaussian HMMs first and then extending these HMMs to more Gaussians (Gaussian Mixture Models (GMMs)). The reported performances of state-of-the-art HMM-based forced alignment systems range from 80%-90% agreement (of all boundaries) within 20 ms compared to manual segmentation on TIMIT (Hosom, 2000). Human labelers have an average agreement of 93% within 20 ms, with a maximum of 96% within 20 ms for highly-trained specialists (Hosom, 2000).

In forced alignment, unlike in automatic speech recognition, monophone (context-independent) HMMs are more commonly used than triphone (context-dependent) HMMs. Ljolje et al. (1997) provide a theoretical explanation as to why triphone models tend to be less precise in automatic segmentation. In the triphone model, the HMMs do not need to discriminate between the target phone and the context; the spectral movement characteristics are better modeled, but phone boundary accuracy is sacrificed. Toledano et al. (2003) compare monophone and triphone models for forced alignment under different criteria and show in their experiments that monophone models outperform triphone models for medium tolerances (15-30 ms different from manual segmentation). However, monophone models underperform for small tolerances (5-10 ms) and large tolerances (>35 ms).

Many researchers have tried to improve forced alignment accuracy. Hosom (2000) uses acoustic-phonetic information (phonetic transitions, acoustic-level features, and distinctive phonetic features) in addition to PLPs. This study shows that the phonetic transition information provides the greatest relative improvement in performance. The acoustic-level features - such as impulse detection, intensity discrimination, and voicing features – provide the next-greatest improvement, and the use of distinctive features (manner, place, and height) may increase or decrease performance, depending on the corpus used for evaluation. Toledano et al. (2003) propose a statistical correction procedure to compensate for the systematic errors produced by context-dependent HMMs. The procedure is comprised of two steps: a training phase, where some statistical averages are estimated; and a boundary correction phase, where the phone boundaries are moved according to the estimated averages. The procedure has been shown to correct segmentations produced by context-dependent HMMs; therefore, the results are more accurate than those obtained by context-independent and context-dependent HMMs alone. There are also studies in the literature that attempt to improve forced alignment by using a different model than HMMs. Lee (2006) employs a multilayer perceptron (MLP) to refine the phone boundaries provided by HMM-based alignment; Keshet et al. (2005) describe a new paradigm for alignment based on Support Vector Machines (SVMs).

Although forced alignment works well on read speech and short sentences, the alignment of long and spontaneous speech remains a great challenge (Osuga et al., 2001; Toth, 2004). Spontaneous speech contains filled pauses, disfluencies, errors, repairs, and deletions that do not normally occur in read speech and are often omitted in transcripts. Moreover, pronunciations in spontaneous speech are much more variable than read speech.

Researchers have attempted to improve recognition of spontaneous speech (Furui, 2005) by: using better models of pronunciation variation (Strik & Cucchiarini, 1998; Saraclar et al., 2000); using prosodic information (Wang, 2001, Shriberg & Stolcke, 2004); and improving language models (Stolcke & Shriberg, 1996; Johnson et al., 2004).

With respect to pronunciation models, Riley et al. (1999) use statistical decision trees to generate alternate word pronunciations in spontaneous speech. Bates et al. (2007) present a phonetic-feature-based prediction model of pronunciation variation. Their study shows that feature-based models are more efficient than phone-based models; they require fewer parameters to predict variation and give smaller distance and perplexity values when comparing predictions to the hand-labeled reference. Saraclar et al. (2000) propose a new method of accommodating nonstandard pronunciations: rather than allowing a phoneme to be realized as one of a few alternate phones, the HMM states of the phoneme’s model are allowed to share Gaussian mixture components with the HMM states of the model(s) of the alternate realization(s).

The use of prosody and language models to improve automatic recognition of spontaneous speech has been largely integrated. Liu et al. (2006) describe a metadata (sentence boundaries, pause fillers, and disfluencies) detection system; it combines information from different types of textual knowledge sources with information from a prosodic classifier. Huang and Renals (2007) incorporate syllable-based prosodic features into language models. Their experiment shows that exploiting prosody in language modeling significantly reduces perplexity and marginally reduces word error rate.

In contrast to automatic recognition, little effort has been made to reduce forced alignment errors for spontaneous speech. Automatic phonetic transcription procedures tend to focus on the accuracy of the phonetic labels generated rather than the accuracy of the boundaries of the labels. Van Bael et al. (2007) show that in order to approximate the quality of the manually verified phonetic transcriptions in the Spoken Dutch corpus, one only needs an orthographic transcription, a canonical lexicon, a small sample of manually verified phonetic transcriptions, software for the implementation of decision trees, and a standard continuous speech recognizer. Chang et al. (2000) developed an automatic transcription system that does not use word-level transcripts. Instead, special purpose neural networks are built to classify each 10ms frame of speech in terms of articulatory-acoustic-based phonetic features; the features are subsequently mapped to phonetic labels using multilayer perceptron (MLP) networks. The phonetic labels generated by this system are 80% concordant with the labels produced by human transcribers. Toth (2004) presents a model for segmenting long recordings into smaller utterances. This approach estimates prosodic phrase break locations and places words around breaks (based on length and break probabilities for each word).

Forced alignment assumes that the orthographic transcription is correct and accurate. However, transcribing spontaneous speech is difficult. Disfluencies are often missed in the transcription process (Lickley & Bard, 1996). Instructions to attend carefully to disfluencies increase bias to report them but not accuracy in locating them (Martin & Strange, 1968). Forced alignment also assumes that our word-to-phoneme mapping generates a path that contains the correct pronunciation – but of course, natural speech is highly variable.

The obvious approach is to use language models to postulate additional disfluencies that may have been omitted in the transcript, and to use models of pronunciation variation to enrich the lattice of pronunciation alternatives for words in context; and then to use the usual HMM Viterbi decoding to choose the best path given the acoustic data. Most of the research on related topics is aimed at improving speech recognition rather than improving phonetic alignments, but the results suggest that these approaches, properly used, will not only give better alignments, but also provide valid information about the distribution of phonetic variants.For example, Fox (2006) demonstrated that a forced alignment technique worked well in studying the distribution of s-deletion in Spanish, using LDC corpora of conversational telephone speech and radio news broadcasts. She was also able to get reliable estimates of the distribution of the durations of non-deleted /s/ segments.

A critical component of any such research is estimation of the distribution of errors, whether in disambiguating alternative pronunciations, correcting the transcription of disfluencies, or determining the boundaries of segments. Since human annotators also disagree about these matters, it’s crucial to compare the distribution of human/human differences as well as the distribution of human/machine differences. And in both cases, the mean squared (or absolute-value) error often matters less than the bias. If we want to estimate (for example) the average duration of a certain vowel segment, or the average ratio of durations between vowels and following voiced vs. voiceless consonants, the amount of noise in the measurement of individual instances matters less than the bias of the noise, since as the volume of data increases, our confidence intervals will steadily shrink – and the whole point of this enterprise is to increase the available volume of data by several orders of magnitude.

Fox (2006) found this kind of noise reduction, just as we would hope, so that overall parameter estimates from forced alignment converged with the overall parameter estimates from human annotation. We will need to develop standard procedures for checking this in new applications. Since a sample of human annotations is a critical and expensive part of this process, a crucial step will be to define of the mimimal sample of such annotations required to achieve a given level of confidence in the result.

3. Preliminary results:

3.1. The Penn Phonetics Lab Forced Aligner

The U.S. Supreme Court began recording its oral arguments in the early 1950s; some 9,000 hours of recording are stored in the National Archives. The transcripts do not identify the speaking turns of individual Justices but refer to them all as “The Court”. As part of a project to make this material available online in aligned digital form, we have developed techniques for identifying speakers and aligning entire (hour-long) transcripts with the digitized audio (Yuan & Liberman, 2008). The Penn Phonetics Lab Forced Aligner was developed from this project.

Seventy-nine arguments of the SCOTUS corpus were transcribed, speaker identified, and manually word-aligned by the OYEZ project ( Silence and noise segments in these arguments were also annotated. A total of 25.5 hours of speaker turns were extracted from the arguments and used for our training data; one argument was set aside for testing purposes. Silences were separately extracted and randomly added to the beginning and end of each turn. Our acoustic models are GMM-based, monophone HMMs. Each HMM state has 32 Gaussian Mixture components on 39 PLP coefficients (12 cepstral coefficients plus energy, and Delta and Acceleration). The models were trained using the HTK toolkit ( and the CMU American English Pronouncing Dictionary (

We tested the forced aligner on both TIMIT (the training set data) and the Buckeye corpus (the data of s14). TIMIT is read speech and the audio files are short (a few seconds each). The Buckeye corpus is spontaneous interview speech and the audio files are nine minutes long on average. Table 1 lists the average absolute difference between the automatic and manually labeled phone boundaries; it also lists the percentage of agreement within 25 ms (the length of the analysis window used by the aligner) between forced alignment and manual segmentation.

Table 1. Performance of the PPL Forced Aligner on TIMIT and Buckeye.

Average absolute difference Percentage of agreement within 25ms

TIMIT 12.5 ms 87.6%

Buckeye 21.2 ms 79.2%

We also tested the aligner on hour-long audio files - i.e., alignment of entire hour-long recordings without cutting them into smaller pieces - using the British National Corpus (BNC) and the SCOTUS corpus. The spoken part of the BNC corpus consists of informal conversations recorded by volunteers. The conversations contain a large amount of background noise, speech overlaps, etc. To help our forced aligner better handle the BNC data, we combined the CMU pronouncing dictionary with the Oxford Advanced Learner's dictionary ( which is a British English pronouncing dictionary. We also retrained the silence and noise model using data from the BNC corpus. We manually checked the word alignments on a 50-minute recording, and 78.6% of the words in the recording were aligned accurately. The argument in the SCOTUS corpus that was set aside for testing in our study is 58 minutes long and manually word-aligned. The performance of the aligner on this argument is shown by Figure 1, where a boxplot of alignment errors (absolute differences from manual segmentation) in every minute from the beginning to the end of the recording is drawn. We can see that the alignment is consistently good throughout the entire recording.