Chapter 10 SOFTWARE DESCRIPTION
Chapter 10
SOFTWARE DESCRIPTION
10.1 Data processing
This chapter describes the functionality of all the software modules implemented and utilized in the framework of this thesis. Each one of these modules was employed for a specific function during the experimental performance and then integrated for the final scheme. In order to give a structured idea of the system, it is divided into three main blocks of functionalities:
a)Data preprocessing. Its function is to adapt the recorded speech data to satisfy the format requirements of compute_basismerkmale.c, which is included in the next block. This group includes only the program call.c, which is described in section 10.1.1.
b)Feature calculation. In order to provide the NN classifier with input patterns, a vector of features should be calculated at the previous stage. First, three basic features: fundamental frequency, energy contour and voiced/unvoiced decision, are estimated by compute_basismerkmale.c. From these basic features, other more complex, i.e. P1, are computed in ppal.c. Finally, quality voice features and jitter are calculated by means of Praat. Thought P2 prosodic features were discarded during the experimental research to be integrated in the final system, the software employed for their calculation, complex_calcs.c, would belong to this group and it’s also described in within section 10.1.2.
c)Classification. Once the features are computed, this block use them as the classifier inputs and make the final decision. Two neural networks and little program to combine their results are compound this group and are described in section 10.1.3.
Format Basic prosodic Complex prosodic Activation level Evaluation level
conversion features calcs. features calcs. classifier classifier
Call.c computeppal.c Activation Evaluation
basismerkmale P1 NN NN
complex quality features combination
calcs P2 calculations
INPUT Praat
(.nist file)
Data
Preprocessing Features calculationClassification
10.1.1 Data preprocessing
The first step in order to process the audio data, recorded for the emotional database, and obtain features that allow an efficient classification of the speech signal into emotional categories, is the extraction of basic features such as fundamental frequency, energy and voice/unvoiced decision. For this task, the basic prosodic module of the VERBMOBIL[1] project is utilized. However, the audio format of the input files for this software doesn’t match the format employed in the recorded speech.
The format of the signal files expected by the software is PhonDat. Signal files with PhonDat Header contain a binary header of constant length (512 bytes). The signal samples (2 bytes per sample) start after this header and are always in LoHi byte order (Intel format), what means that the low significant byte is on the left. The header contains a defined structure with information as sampling frequency, resolution in bits, etc. On the other hand, the format used for the recordings, which is actually the one used for most applications inside the speech technology group, is NIST. The NIST speech header format was defined by the 'National Institute of Standards and Technology, USA'. It is used in many American speech corpora. The NIST SPHERE header is an object-oriented, 1024-byte blocked, ASCII structure which is preceded to the waveform data. The header is composed of a fixed-format portion followed by an object-oriented variable portion. The sample byte order is HiLo, what means that the most significant byte is followed by least significant byte. As a result two main problems must be faced for the data conversion: Header length and speechdata format (byte order).
The prosody module utilized allows an input argument indicating the length of the header. The information contained in the header of the input files is not used by the program, which only takes the data of the signal to make computations. Therefore, this information can be omitted. The adopted solution is to skip the header of the input NIST files and set this parameter to 0. This way, the software interprets the bytes just as speech data starting from the beginning of the file.
The problem of the byte ordering can only be solved by swapping them. This way each sample inverts the order of the bytes before the basic prosodic module reads the speech file.
The module of data preprocessing is implemented using one unique program call.c, written in C, whose input is the emotional recording in NIST format and the output is the same file without header and with the bytes of each sample swapped.
10.1.2 Feature calculation
Four main programs are employed to calculate features of different nature:
a)compute_basismerkmale.c is the basic prosody module of the VERBMOBIL project. It extracts the fundamental frequency contour, energy contour and voice/unvoiced decision of each frame from an input speech file. The algorithms used to calculate these basic features are detailed in section 7.1.
b)ppal.c is the principal program used for the extraction of complex prosodic features. It calculates the whole set of P1 features (see section 7.2.1) with the exception of jitter, which is calculated by Praat. The program inputs are the vectors resulting from the compute_basismerkmale stage, i.e. vectors containing the frame-based basic features calculations. The output of this module is a vector of 36 features, whose calculations are described in 7.2.1, to which the jitter is added in order to complete the whole P1 feature set. This output vector will be the input of one of the neural network classifiers, the one that discriminates among the three levels of arousal (see chapter 2).
c)complex_calcs.c is the program that calculates all the P2 features employed during the experimental trials. The program input is the output of the compute_basismerkmale module. These data is mathematically combined, following the algorithms detailed in 7.2.2, and yield a vector of twenty prosodic features whose function is to be the input of an activation dimension based classifier. As it was previously said, the most significant results obtained after the experimentation do not use this module. However, it was entirely implemented during the development of this work, based on a promising study [Tis93].
d)Praat is a system for doing phonetics by computer. The computer program Praat is a research, publication, and productivity tool for phoneticians. It has been principally developed by Paul Boersma and David Weenink in the University of Amsterdam. Its functionalities are to analyse, synthesize, and manipule speech. All the quality features used during the present work, and described in section 7.3, are calculated by means of Praat. The use of this tool is optima during the investigation because the complexity of quality voice measurements, mainly based in spectral properties of the speech, would make their calculations really time-consuming. However, for a real-time systems, Praat processes are too slow and it should be taken into account for further development. The input of the Praat script is the original signal without preprocessing, since this tool admits multiple audio formats, including NIST. The output is a vector of a variable number of quality features, depending on the experiment. In the final system, the output vector is combined with the decision made by the activation level classifier in order to make the final assumption.
10.1.3 Classification
The software utilized to train and test the neural network classifiers is the Stuttgart Neural Network Simulator (SNNS 4.2), developed at University of Stuttgart and maintained at University of Tübingen.
SNNS is a software simulator for neural networks on Unix workstations developed at the Institute for Parallel and Distributed High Performance Systems (IPVR) at the University of Stuttgart. The goal of the SNNS project is to create an efficient and flexible simulation environment for research on and application of neural nets.
The SNNS simulator consists of two main components:
1)Simulator kernel written in C.
2)Graphical user interface under X11R4 or X11R5.
The simulator kernel operates on the internal network data structures of the neural nets and performs all operations of learning and recall. It can also be used without the other parts as a C program embedded in custom applications. It supports arbitrary network topologies and it supports the concept of sites. SNNS can be extended by the user with user defined activation functions, output functions, site functions and learning procedures, which are written as simple C programs and linked to the simulator kernel.
The graphical user interface XGUI (X Graphical User Interface), built on top of the kernel, gives a 2D and a 3D graphical representation of the neural networks and controls the kernel during the simulation run. In addition, the 2D user interface has an integrated network editor which can be used to directly create, manipulate and visualize neural nets in various ways.
A network created with SNNS cam be compiled into an executable C source. This source is generated from a network file using the tool SNNS2C, included in the simulator. After completion of network training with SNNS, the tool SNNS2C is used to integrate the trained network as a C function into a separate application.
The final system make use of three different networks, each one trained according to the best performance achieved during the experiments (chapters 8 and 9).
10.2 Analysis of the results
Most experiments of chapter 8 and 9 are evaluated attending to their confusion matrix and the means of their neural network outputs. Both calculations were implemented in the program confusion_th.c. This program reads a result file from the output of SNNS neural network simulator and has two possible outputs: confusion matrix ormean outcome values.
-The confusion matrix of the testing set. Given a class A. All the patterns belonging to class A are counted. Each of these patterns, after being tested by the network, can be well recognized by the classifier or can be confused with another class. The number of times that this class is classified into each one of the categories, included itself (correct recognition), divided by the total number of occurrences gives the rate of confusion/recognition to calculate the percentage of times that they are right recognized or confused with each one on the remaining classes. This procedure is applied to each one of the defined classes and the resulting confusion matrix looks as follows:
OUTPUT / Class 1 / Class 2 / … / Class NINPUT
Class 1
/ N11/N1 / … / … / N1N/N1Class 2
/ N21/N2 / N22/N2 / … / N2N/N2… / … / … / … / …
ClassN / NN1/NN / … / … / NNN/NN
Table 10.1. General appearance of a confusion matrix generated by confusion_th.c.
Where Nij is the number of patterns of class i classified as category j. When i and j coincide, the pattern is well recognized. Nn is the total number of occurrences of the class n at the input of the test. The Grey-highlighted ratios are the correct recognition rates. The matrixes are presented in percentages, thus these ratios are then multiplied by 100.
Confusion_th.c also allows setting two different thresholds in order to select the final winner for a pattern. After choosing the winner candidate of the output trough WTA rule, a decision, based in two different thresholds, determines whether this value can be actually considered as the winner or not. The confusion matrix would be, when this functionality is included, calculated after the thresholds decision. These thresholds are given as arguments to the program and are defined as follows:
-Threshold 1: Minimum value of the output to be considered as the winner. The program checks if the winner candidate given by the NN exceed this value, if not the pattern is classified into the central class.
-Threshold 2: Maximum value of the opposite emotion or emotional groups. When an utterance is classified into one emotion, e.g. angry, the program checks the output values for the emotions situated on the opposite side of the axis, e.g. bored and sad for the arousal dimension, must not exceed this value. Otherwise, the winner candidate is classified into the central category. This is based on the observed experimental fact (see Chapter 8) that mean output values of opposite emotions are well differentiated, for both the five outputs and the three outputs case.
-The mean outcome values of the testing set. For a teaching class A, the average value of each one of the outputs is calculated. The average is computed for each of the defined classed and for each one of the output nodes (for our experiments, each output node matches a classification category), and a representative graph is depicted.
Figure 10.2. Graphical representation of the mean output values of the training set generated by confusion_th.c.
The horizontal axis of figure 10.2 represents the teaching class and the vertical bars are the mean values of the neural network output.
1
[1] Verbmobil is a long-term project of the German Federal Ministry of Education, Science, Research and Technology (BMBF, Projekträger DLR). Its aim is to give Germany an international top position in language technology and its economical application in the next millenium by cooperation and concentration of as many as possible specialists from industry and science. The long-sighted aim is the development of a mobile translation system for the translation of spontaneous speech in face-to-face situations.