Arthur Kunkle

ECE 5526

Comparison of the SPHINX and HTK Frameworks Processing the AN4 Corpus


Introduction

Two major frameworks exit that are widely accepted and used in the Speech Processing domain: CMU Sphinx, developed by Carnegie Mellon University, and HTK (HMM-ToolKit), developed by Cambridge University. Both frameworks can be used to develop, train, and test a speech model from existing corpus speech utterance data by using Hidden Markov modeling techniques. This project will provide a detailed comparison of both frameworks in two major phases.

The first phase will be a comparison of the functional and performance characteristics of Sphinx and HTK. The AN4 Corpus training data will be used to train a recognition model. The recognizer for each framework will be run against the test data. This procedure was already completed using Sphinx for the final homework assignment. The process will be adapted as closely as possible for the HTK toolset and the detailed steps developed for each phase will be presented. The goal will be to generate a model that most closely resembles the one generated by Sphinx. Different performance metrics and characteristics will then be defined and measured using the results:

·  Decoder time to completion

·  Decoder accuracy on the sentence level.

·  Decoder accuracy on the word level.

·  Types and quantities of decoding errors encountered during the decoding process.

·  Notable trends of errors

·  Framework code footprint size

·  Memory requirements for recognizer at runtime

The second major phase of the study will be comparing other features of Sphinx and HTK. The following features will be compared:

·  Coded data feature format support

·  Acoustic Modeling algorithm support

·  Language Modeling algorithm support

·  Overall ease of training and decoding corpora

·  Notable features of the Software Baseline of each toolkit

·  Operating System support

·  Available documentation and community support

·  Licensing and usage rights

·  Future development plans

The conclusion will also provide a reference to a feature comparison matrix, outlining the superior toolkit for each focus area.

Framework Overviews and History

The Sphinx system consists of the training portion (SphinxTrain) as well as a set of evolving speech decoders (Sphinx1-4 as well as PocketSphinx, a decoder designed specifically for embedded environments). The Sphinx Project has been supported by programs such as DARPA, IBM, Sun Microsystems. Some notable applications that use Sphinx include: Roomline, a conference room reservation system at CMU, and Let’s Go, a spoken dialog system in use at Pittsburgh’s transit system.

The HTK framework was originally developed in 1989 by the Speech Vision and Robotics Group of Cambridge University. While dubbed a general-purpose HMM toolkit, the main application area has been speech recognition. HTK was purchased by Entropic Laboratories in 1993 and then again by Microsoft during its acquisition of Entropic in 1999. The HTK source code was then licensed back to Cambridge University for advances in development.

Acoustic Model Training/Testing Procedure for HTK

The AN4 corpus is a training/testing database collected from census activities conducted by CMU in 1991. It contains alphanumeric utterances as well as a limited set of command words, for a total of 948 training utterances and 130 test utterances. All recordings at 16-bit linear PCM samples at 16kHz.

In order to make a valid comparison of the performance of the Sphinx and HTK recognizers, the same type of model will be trained and tested against that was created for Sphinx. The acoustic model will have the following characteristics:

1.  8 Gaussians per HMM state

2.  Context-dependant Tri-phone state models

3.  Tied states

HTK provides a living document of its programs, called “HTKBook”. (TODO ref) Section II of this reference is a step-by-step process on creating a toy training database using a very simple word grammar and recording utterances manually. This procedure will be modified to process the pre-recorded and transcribed data that AN4 provides. The following flowchart outlines the major steps in this process (this can be compared to the flowchart provided in the Sphinx tutorial).

The procedure that can be used as a tutorial for training and testing the AN4 database with HTK 3.0 is provided here:

Tutorial – HTKTrainingDecoding_tutorial.doc

The following link refers to the environment directory for the entire tutorial. It contains the original NIST sphere and processed MFCC data, HMM’s at each stage of the model training phase, perl scripts authored to support translating various AN4 nuances to the HTK-preferred format (i.e. translating transcriptions, phone lists, etc.), and any HTK configuration files used.

HTK Tutorial Directory – htktut

Training/Decoding Result Comparison

The following results were achieved decoding the test data from the AN4 corpus. These were achieved on a PC running Windows XP and Cygwin, with a 2.6GHz Pentium 4 processor with 2 GB System RAM.

Metric / Sphinx3 / HTK
Peak Memory Usage (MB) / 8.2 / 5.9
Time to Completion (sec) / 63 / 93
Sentence Error Rate (%) / 59.2 / 69.0
Word Error Rate (%) / 21.3 / 9.0
Word Substitution Errors / 92 / 92
Word Insertion Errors / 71 / 154
Word Deletion Errors / 2 / 0

It is very interesting to note that each framework made the same amount of Word Substitution errors. It also seems the source of most errors for HTK was the insertion of erroneous words. This may have been due to the more exploratory processes by the which the training model was developed, instead of the predefined and tested tutorial given with Sphinx. However, the HTK decoder did not make any deletions, which gave it a slight advantage on the overall word error rate. Also, while HVite (HTK decoder program) did use less memory during decoding, the time difference in running the test set is significant at 30 seconds.

Front-End Coded Data Feature Format Support

1. Sphinx

The Sphinx toolkit provides the tool wave2feat in order to process Microsoft Wave, NIST Sphere, or raw wave data files to MFCC with a limited set of configuration parameters. However, the Sphinx trainer and decoder are compatible with man other data formats, that must be generated outside of the toolkit. Sphinx-4 now contains front-end capabilities to process both MFCC and PLP Cepstral encoded data.

2. HTK

The HCopy tools provides a wealth of input/output combinations of model data and front-end features. HTK can natively handle waveform files with many different popular header formats (NIST, TIMIT, no header, Microsoft, etc). Any type of conversion is supported, including wave à feature, feature à feature, feature à wave. The main feature data types that are included are Linear Prediction Coefficients (LPC), Mel-Frequency Ceptral Coefficients (MFCC), and Perceptual Linear Prediction (PLP) coefficients. The first, second, and third differential coefficients are also supported for inclusion to output feature vectors (The tutorial used 39-element vectors, 13 for each). HTK utilizes its own feature vector file format to optimize feature data for native processing, allowing any of the feature types to be stored in one format. HTK also offers the option of saving compressed feature vector data, utilizing a vector quantization lookup table. The following tables from HTKBook illustrate the compatible conversions as well as the Supported parameters. Of course, all data feature types are then directly compatible with other applications in the toolkit.

Acoustic Modeling Support

Both toolkits use the same types of procedures for Acoustic HMM training. HTKBook reveals more details into the inner-workings of some of these algorithms:

First, the mean and variance for every Gaussian component in the HMM will be calculated from global mean and variances from the training data. (flat-start scheme). Parameters for the HMM models are then refined using Baum-Welch Re-estimation. A modified form called Embedded training offers the following enhancements by performing training of all models in parallel:

1.  Allocate and zero accumulators for all parameters of all HMMs.

2.  Get the next training utterance.

3.  Construct a composite HMM by joining in sequence the HMMs corresponding to the symbol

1.  transcription of the training utterance.

4.  Calculate the forward and backward probabilities for the composite HMM. The inclusion

2.  of intermediate non-emitting states in the composite model requires some changes to the

3.  computation of the forward and backward probabilities but these are only minor. The details

4.  are given in chapter 8.

5.  Use the forward and backward probabilities to compute the probabilities of state occupation

6.  at each time frame and update the accumulators in the usual way.

7.  Repeat from 2 until all training utterances have been processed.

8.  Use the accumulators to calculate new parameter estimates for all of the HMMs.

For decoding, HTK uses a form of the Viterbi algorithm called the Token Passing Model. This method passes information to all next states each time period. Each state then increments its log probability. All next-states are examined and only the highest is retained. This is used because it extends very easily across word boundaries. Each state will add more information as well as a pointer called a Word Link Record. This can be evaluated at the end of the utterance to extract boundaries. This also allows saving of more information than the best-case, allowing for multiple hypothesis to be recorded.

Language Modeling Support

1. Sphinx

The main language model (LM) used by the Sphinx decoder is a conventional bi-gram or tri-gram back off language model. However, Sphinx2-4 generally supports N-Gram statistical grammars as well as finite state grammars. For generation of language models, however, Sphinx relies on other software (CMU Statistical Language Model toolkit) for training and testing.

2. HTK

HTK provides a separate set of tools (In the HLMTools directory) for training and testing Language Models. While not performed as part of this study, an entire HLM tutorial, with training/test data, is provided that uses most of these functions, and would be a good exercise. (Section 15 of HTKBook). The HLMTools provide n-gram model generation, as well as class-based n-gram models. Tools are also available to easily measure LM perplexity, using LPlex. Tools also exist to generate count-based models, which have the ability to dynamically grow as the task vocabulary changes in content and size. Finally, the LMerge tool is useful for combining multiple existing LM’s together to one.

Both Sphinx and HTK also support simple, Finite-state grammars that are specified using a BNF-style syntax. These are useful when using a relatively small rule-set or command-driven sentences for the task. (I.e. “Call <phone-number>”). This was created manually in the first step in the HTK tutorial.

Operating System Support and Installation Procedure

Both system are developed using the popular and well-supported GNU utilities such as Autoconf, GNU Makefiles, and the GCC compiler. This allows them to support most Unix variants. Both toolkits also have Windows-specific building and installation instructions/defines that will allow them to be imported as a Visual Studio Project. README and INSTALL files in the source distributions are clear and easy to follow.

Framework Image Size Footprint

1. Sphinx

9.7M SphinxTrain/bin.i686-pc-cygwin

20K SphinxTrain/config

129K SphinxTrain/doc

13K SphinxTrain/etc

4.0K SphinxTrain/include

2.5M SphinxTrain/lib.i686-pc-cygwin

5.0K SphinxTrain/python

78K SphinxTrain/scripts_pl

4.0K SphinxTrain/src

1.0K SphinxTrain/templates

5.0K SphinxTrain/test

0 SphinxTrain/win32

332K SphinxTrain

2.0M sphinx3-0.7/doc

844K sphinx3-0.7/include

37K sphinx3-0.7/model

20K sphinx3-0.7/python

71K sphinx3-0.7/scripts

37K sphinx3-0.7/src

4.0K sphinx3-0.7/win32

7.9M sphinx3-0.7

37K sphinxbase-0.3/doc

323K sphinxbase-0.3/include

33K sphinxbase-0.3/src

37K sphinxbase-0.3/test

0 sphinxbase-0.3/win32

1.7M sphinxbase-0.3/

Total: ~26 MB

2. HTK

20K htk/env

1.1M htk/HLMLib

15M htk/HLMTools

93K htk/HTK

1.1M htk/HTKBook

7.0M htk/HTKLib

0 htk/HTKLVRec

24M htk/HTKTools

348K htk

Total: ~49 MB

Software Baseline Comparison

1. Sphinx

Sphinx is organized across three components. This leads to a large amount of code, especially if all of the possible decoder products are considered.

However, Sphinx uses general Unix-style organization of files (header files in /include, source files in /src, library files output to /lib); however, this leads to a 3 or 4 deep nesting of dependencies in some cases, which add to the overall complexity. However, on average, most source files only average 1200 LOC or less, however some C files were found to have in excess of 13000 LOC!

Sphinx does have built-in unit/regression test modules (invoked with make test). This is an excellent resource when developing the original Sphinx code to verify that pre-existing functionality is still intact.

2. HTK

HTK’s baseline is much simpler than Sphinx. All provided code is ANSI C and the recognizer and training components are in the same package. The base code is all found in the HTKLib directory. This single folder contains all source and header files needed to compile the HTKLib standalone library. The HTKTools directory contains one-to-one source file to executable targets. Most of these files rely on functionality provided by HTKLib, thus keeping the components very decoupled. For the Language Modeling portion of the toolkit, the same style is used, having two directories: HLMLib and HLMTools. Generally, the source and header files are very well maintained, with appropriate formatting conventions and comments. All prototypes and structure definitions are located in appropriate header files. Many of HTK’s source files are 1400 lines or less, with the exception of a few of the more complex tools such as HHED (which is over 6000 lines).

Ease of Use, Available Documentation, and Community Support

1. Sphinx

Sphinx’s main reference materials are available in from the main web page. These include:

1.  Fully implemented tutorial processing the AN4 speech corpus. This is an excellent resource to provide a quick way to verify all major components of Sphinx are running properly.

2.  Many different corpora that can be used to build systems. Also included are pre-built models that can be readily used with the recognizer.

3.  Sphinx “Manual”. This is a loose collection of background theory, Frequently Asked Questions, and decoding topics. Unfortunately, this section does not seem to be actively maintained and is generally not as extensive as HTK’s documentation.

The greatest asset to Sphinx is the presence of an easy to set up and run tutorial. After this is done, developers can then take a white-box approach to different steps to see what commands are needed.

Sphinx also has a Wiki-style portal interface and Doxygen/Javadoc documentations for developers.