INDEX

I. INDEX

I. index....... xV

II. index Of FIGURES....... Xxi

III. index of tables....... xv

Iv. Glossary...... Xxix

1. IntroductioN...... 1

1.1 What are emotions?...... 1

1.2 Emotional computing...... 5

1.3 Emotional speech...... 5

1.3.1 Motivation...... 6

1.3.2 Obstacles and difficulties...... 9

1.3.3 Recognition systems review...... 12

1.3.4 Synthesis of emotions...... 15

1.4 Objectives of the present research work...... 17

1.5 Overview...... 19

2. Emotional dimensions...... 21

2.1 Emotional process...... 21

2.2 Dimensionality of emotions...... 22

2.3 Theory of the activation – evaluation space...... 23

2.4 Assumption of the present work...... 25

3. Expression of emotions: Cues for emotion recognition...... 27

3.1 Introduction...... 27

3.1.1 Expressing emotions...... 27

3.1.2 Different kinds of information within the oral communication...... 28

3.1.3 Vocal correlates of emotion and attitude...... 30

3.2 Prosody...... 32

3.2.1 Classification of prosody...... 33

3.2.2 Energy as a prosodic cue for emotion detection...... 34

3.2.3 Pitch as a prosodic cue for emotional detection...... 35

3.2.4 Durational cues for emotional detection...... 38

3.3 Voice Quality...... 38

3.3.1 Voice quality perception...... 39

3.3.2 Speech production theory overview...... 39

3.3.3 Influence of the source on voice quality...... 42

3.3.4 Influence of the filter on voice quality...... 44

3.3.5 Influence of emotions on voice quality...... 45

3.4Non-speech related ways of expression and detection of emotions...... 50

4. Emotional database...... 53

4.1Difficulties to acquire emotional data...... 53

4.1.1 Spontaneous speech...... 54

4.1.2 Acted speech...... 55

4.1.3 Elicited speech...... 56

4.2 Framework...... 57

4.3 Recording sessions...... 58

4.3.1 One day with AIBO...... 60

4.3.2 AIBO commands...... 61

5. Feature Selection...... 63

5.1 Need for Feature Reduction...... 63

5.2 Feature Selection process...... 64

5.3 Feature selection methods overview...... 67

5.3.1 Correlation-based Feature Selection...... 69

5.4 Feature Selection Procedures employed in this work...... 71

5.4.1 Regression models...... 71

5.4.2 Fischer’s discriminant: F-Ratio...... 72

5.4.3 NN pruning...... 74

5.4.4 Correlation analysis...... 74

5.4.5 Graphical analysis of feature statistics (“boxplot”)...... 75

6. Classifiers...... 77

6.1 Classifiers used in emotional recognition...... 78

6.2 Classifiers tried in the present work...... 80

6.2.1 Gaussian mixture models...... 80

6.2.2 Linear discriminant analysis...... 81

6.2.3 Decision trees...... 81

6.2.4 Neural Networks...... 82

6.3 Neural Networks...... 83

6.3.1 Introduction to Neural Networks...... 84

6.3.2 Initialisation of adaptive parameters in neural networks...... 87

6.3.3 Learning Algorithms...... 88

6.3.3.1 Backpropagation learning algorithm...... 89

6.3.3.2 RPROP learning algorithm...... 92

6.3.3.3 Pruning algorithms...... 95

6.3.3.4 Multiple step vs. One step procedure...... 96

6.3.4 Activation functions...... 97

6.3.4.1 Logistic activation function...... 99

6.3.4.2 Hyperbolic tangent activation function...... 100

6.3.5 Analysing Functions...... 100

6.3.5.1 402040 decision rule...... 101

6.3.5.2 WTA (Winner Takes All)...... 101

6.3.5.3 Band decision rule...... 102

6.3.5.4 Post-analysis method based on thresholds...... 102

6.4 Leave-one-out cross validation...... 103

6.4.1 Leave one sentence out...... 103

6.4.2 Leave one speaker out...... 103

7. Feature Calculation...... 105

7.1 Basic Prosodic Attributes...... 105

7.1.1 Voiced/unvoiced decision...... 106

7.1.2 Fundamental Frequency Contour...... 109

7.1.2.1 Previous remark...... 109

7.1.2.2 Difficulties in estimating pitch contour...... 110

7.1.2.3 Description of the algorithm...... 112

7.1.3 Energy Contour...... 116

7.2 Prosodic Features...... 119

7.2.1 P1...... 120

7.2.2 P2...... 129

7.3 Quality Features...... 135

7.3.1 Harmonicity based features...... 136

7.3.2 Formant frequency based features...... 139

7.3.3 Energy based features...... 145

7.3.4 Spectral measurements...... 147

8. Experiments with Prosodic Features...... 151

8.1 Preliminary experiments...... 152

8.1.1 Speaker dependent...... 153

8.1.1.1 A simple problem: Angry vs. Sad...... 153

8.1.1.2 Classifying into five emotions: first attempt...... 157

8.1.1.3 P1 set: Varying the neural network output configuration...... 161

8.1.1.4 P2 set: Varying the neural network output configuration...... 167

8.1.2 Speaker independent...... 171

8.1.2.1 P1 set: Varying the neural network output configuration...... 172

8.1.2.2 P2 set: Varying the neural network output configuration...... 177

8.2 Speaker dependent...... 181

8.2.1 Speaker A: Experiments performed with one male German speaker...... 182

8.2.1.1 P1 set: Varying the neural network output configuration...... 182

8.2.1.2 P2 set: Varying the neural network output configuration...... 187

8.2.1.3 Feature selection from all (57) prosodic features...... 192

8.2.1.4 Feature selection from P1 prosodic feature set...... 197

8.2.1.5 Feature selection from P2 prosodic feature set...... 200

8.2.1.6 Tuning the best feature set for the activation level classification of TK...... 205

8.2.2 Speaker B: Experiments performed with one female English speaker...... 213

8.2.2.1 P1 set: Varying the neural network output configuration...... 213

8.2.2.2 Feature selection from P1 prosodic feature set...... 218

8.3 Speaker independent...... 220

8.3.1 Validity of the arousal level based emotional grouping for the speaker independent case 220

8.3.1.1 Differentiating opposite levels on the activation axis: a basic approach.....220

8.3.1.2 Experiments made over the best 14 emotional speakers...... 222

8.3.1.3 Normalising some statistics independently for each speaker...... 224

8.3.2 Avoiding emotional content in neutral utterances from AIBO stories...... 228

8.3.2.1 Addition of new neutral utterances from read speech...... 228

8.3.2.2 Re-labelling of the neutral utterances from the AIBO stories...... 230

9. Experiments with Quality Features...... 233

9.1 Preliminary experiments...... 233

9.1.1 Experimental observation...... 233

9.1.2 Speaker dependant...... 237

9.1.2.1 Sentence based. First approach....... 237

9.1.1.2 Region based....... 241

9.1.1.3 Region based. Feature normalization...... 242

9.1.1.4 Frame based....... 244

9.1.1.5 Frame vs. region based classification...... 246

9.1.3 Speaker independent...... 248

9.1.3.1 Eighteen features. Region vs. frame based classification...... 248

9.2 Speaker dependent experiments...... 249

9.2.1 Speaker A: Experiments performed with one male German speaker...... 249

9.2.1.1 Additional information provided by voice quality features...... 249

9.2.1.2 Mapping emotions on the evaluation axis...... 252

9.2.1.3 Expert systems: angry vs. happy and bored vs. sad...... 253

9.2.2 RT: Experiments performed with one female English speaker...... 258

9.2.2.1 Set of 16 fetures. Angry vs. happy and bored vs.sad...... 258

9.2.2.2 Set of 20 features from the speaker independent selection...... 259

9.2.2.3 Prosodic features added to the set of 16 quality features...... 260

9.3 Speaker independent...... 261

9.3.1 Speaker dependant nature of the quality features...... 261

9.3.1.1 First attempt: same quality features used in the speaker dependent case....261

9.3.2 Selection of voice quality features...... 263

9.3.2.1 Feature selection among quality features: angry vs. happy...... 264

9.3.2.2 Separating speakers by gender. Angry vs. happy...... 266

9.3.2.3 F-ratio selection among prosodic and quality features...... 268

10. software description...... 271

10.1 Data processing...... 271

10.1.1 Data preprocessing...... 272

10.1.2 Feature calculation...... 273

10.1.3 Classification...... 275

10.2 Analysis of the results...... 276

APPENDIX A...... 279

REFERENCES...... 291

BUDGET...... 301

1