© RNIB 2010
RNIB Centre for Accessible Information (CAI)
Technical report # 8
Review of methods for evaluating synthetic speech
Published by:
RNIB Centre for Accessible Information (CAI), 58-72 John Bright Street, Birmingham, B1 1BN, UK
Commissioned by:
As publisher
Authors:
(Note: After corresponding author, authors are listed alphabetically, or in order of contribution)
Heather Cryer* and Sarah Home
*For correspondence
Tel: 0121 665 4211
Email:
Date: 17 February 2010
Document reference: CAI-TR8 [02-2010]
Sensitivity: Internal and full public access
Copyright: RNIB 2010
Citation guidance:
Cryer, H., and Home, S. (2010). Review of methods for evaluating synthetic speech. RNIB Centre for Accessible Information, Birmingham: Technical report #8.
Acknowledgements:
Thanks to Sarah Morley Wilkins for support in this project.
2
CAI-TR8 [02-2010]
© RNIB 2010
Review of methods for evaluating synthetic speech
RNIB Centre for Accessible Information (CAI)
Prepared by:
Heather Cryer (Research Officer, CAI)
FINAL version
© RNIB 17 February 2010
Table of contents
Introduction 3
Objective/Acoustic measures 4
User testing 6
Performance measures 7
Opinion measures 9
Feature comparisons 10
Conclusion 10
References 11
Introduction
This paper provides background to the development of an evaluation protocol for synthetic voices. The aim of the project was to develop a protocol which could be used by staff working with synthetic voices to conduct systematic evaluations and keep useful records detailing their findings.
In order to do this, a review of existing literature was carried out to determine the various methods used for evaluating synthetic voices. This paper synthesises findings of the literature review.
There are a number of different approaches to evaluating synthetic voices. The type of evaluation carried out is likely to depend on the purpose of the evaluation, and the specific application for which the synthetic voice is intended. The key approaches to evaluation discussed here are:
· Objective measures
· User testing (performance measures and subjective measures)
· Feature comparisons
Whilst these approaches may seem very different, in fact they complement each other, and different types of evaluation may be carried out at different stages in the development and use of a synthetic voice (Francis and Nusbaum, 1999). For example, objective/acoustic measures are used particularly by developers as diagnostic tools to refine the voice, ensuring it is as good as possible. Similarly, testing user performance in listening to the voice enables further development to make improvements. Subjective user testing is more useful for someone considering the voice for use in a product or service, to find out whether users are happy with the voice. Similarly, feature comparisons are useful to those considering investing in a voice for their application, to evaluate whether the voice has the desired features and whether it fits in with existing systems and processes.
The following review aims to give an overview of these approaches, explaining when they may be used, advantages and disadvantages of each approach, and providing some insight into the complexity of the process.
Objective/Acoustic measures
Historically, synthetic speech evaluations tended to focus on whether or not the voice was intelligible. As technology has progressed, synthetic voices have become more sophisticated. This means that intelligibility is generally a given, so the focus of evaluations has shifted towards how closely a synthetic voice mirrors a human voice (Campbell, 2007; Francis and Nusbaum, 1999; Morton, 1991). This is partly based on subjective user evaluation (discussed more later), but is also deemed important in an objective sense, in terms of measuring whether or not an utterance from a synthetic voice acoustically matches the same utterance in human speech.
This type of objective/acoustic measure forms a large part of synthetic speech evaluation, particularly for developers of synthetic voices. By measuring where voices differ from a human utterance, problem areas can be identified which can then be refined to make improvements.
Objective/acoustic measures are beneficial for a number of reasons. Firstly, by their very nature, objective measures offer a clear measurement of how a voice is performing, and can be used to diagnose problem areas needing development. Also, compared to the time and cost involved in user testing, objective measures can be an efficient form of evaluation (Hirst, Rilliard and Aubergé, 1998; Clark and Dusterhoff, 1999).
The negatives of objective/acoustic evaluation are that findings from such evaluations may not always match up with listener perceptions. Researchers have found that some objective measures may be over sensitive, compared to the human ear. Some studies show differences between natural and synthetic voices (highlighted by objective measures) which were not perceived by users (Clark and Dusterhoff, 1999). The opposite may also be true, in that synthetic samples may be perfect in acoustic terms, but may be perceived as unnatural by listeners. Morton (1991) explains this based on the processing required to listen to speech. Morton suggests that human speakers naturally vary their speech, with the aim of being understood, and that this affects the way human speech is processed by listeners. As a synthesiser does not have this capability, the processing needs of the listener are not accounted for. This may cause the listener to find the synthetic speech unnatural or difficult to understand.
Objective/acoustic measures can be used in evaluation of different aspects of a voice, such as prosodic features or intonation. Examples of acoustic measures which may be of interest include fundamental frequency, segmental duration and intensity. A common approach is to use statistical methods (Root Mean Squared Error - RMSE) to model the expected performance of the voice against actual performance, measuring the accuracy of the synthetic voice against a natural voice.
Whilst objective measures have been described as more efficient than user testing, of course specialist knowledge is required in order to run these tests. Clark and Dusterhoff (1999) report on trials of three different objective measures aimed at investigating differences in intonation, and highlight some of the complexities involved. The generally accepted measure - a statistical technique known as Root Mean Squared Error - measures difference between sound contours on a time axis. A drawback of this method is that it measures differences between contours at particular time points, rather than at "important" events (e.g. changes in pitch). Furthermore, differences between two utterances could occur due to timing, due to pitch or due to a combination of these factors, and it is difficult to combine measurements of these two different factors.
Clark and Dusterhoff trialled two alternative measures - to allow for the idea that differences could be due to various factors, and to focus on important pitch events. These measures were trialled, alongside subjective user perceptions of differences between utterances. Findings of the trials showed that whilst the new measures were effective in identifying differences between the synthetic and natural utterances, they did not add anything beyond the commonly used RMSE. These findings highlight the complexity of measurements in this area.
Another possible improvement to objective measures, suggested by Hirst et al (1998) is to use a corpus-based evaluation technique, comparing the synthetic utterance to a range of natural reference utterances, rather than just one. The authors suggest that this would improve reliability of data.
In summary, objective measures can be a useful means of evaluation for synthetic voices, and can be a very efficient form of testing. However, such techniques require specialist knowledge and measures are still in development.
User testing
Evaluations of synthetic voices often involve testing the voice with those who are ultimately going to use it. Testing with users is beneficial for understanding how the voice will work in a particular application (Braga, Freitas and Barros, 2002). There are a variety of approaches to user testing, falling into two broad categories - performance measures and opinion measures. Performance measures (such as intelligibility), give information about whether the voice is sufficiently accurate for people to understand it; and opinion measures (such as acceptability) give information on users' subjective judgements of the voice.
User testing in the evaluation of synthetic voices has various benefits. As a 'real world' test of the voice with those who will use it, such testing gives insight into whether the voice can be understood, and whether it is accepted. It could be argued that this is all that matters when testing a voice, as if end users can use and understand it, the technical accuracy of the voice may not be important. Indeed, Campbell (2007) suggests that users are more interested in the 'performance' of the voice (such as how pleasant it is, how suitable for the context and so on) than they are in the technical success of the developers. By testing with users, real world reactions to a voice can be gathered, which can be particularly informative for plans of future products or services.
Of course there are downsides to user testing too. It is time consuming and expensive to organise (Clark and Dusterhoff, 1999). Also, as individuals differ in both ability and opinion, it can be difficult to draw conclusions from diverse user data.
Performance measures
Performance measures test listeners' "reading performance" when reading with a synthetic voice. This is a way of testing how well the voice conveys information, or it's intelligibility. There are a number of intelligibility tests, which differ on various factors. For example, some have closed answers (where listeners select what they heard from multiple choice options), and other have free answers (where listeners simply report what they think they heard). Furthermore, some tests measure intelligibility at phoneme level (testing whether listeners can tell the difference between sounds) whilst others test intelligibility at word level (evaluating listeners ability to understand words). Testing with longer pieces of text (whole sentences) also allows evaluation of prosody (the non-verbal aspects of speech such as intonation, rhythm, pauses etc) (Benoît, Grice and Hazan, 1996)
To demonstrate some of these differences, two commonly used tests will be discussed in detail. These are the diagnostic rhyme test (DRT) and the semantically unpredictable sentences test (SUS).
Diagnostic Rhyme Test (DRT)
The diagnostic rhyme test is a closed answer test which tests intelligibility at phoneme level. Listeners are presented with monosyllabic words which differ only in the first consonant, and have to choose which word they have heard from pairs (for example, pin/fin, hit/bit).
Braga et al (2002) discuss the complexity of constructing the test set of word pairs for the DRT. The test set should reflect common syllabic trends in the language, considering likelihood of the consonant appearing at the beginning or the end of a word. Furthermore, pairs should be constructed so as to enable testing of the intelligibility of a variety of phonetic features. For the English version of the test, these features are voicing, nasality, sustension, sibilation, graveness and compactness (these features relate to things like the vibration of the vocal chords, the position of the tongue and lips and so on).
Results from the DRT can be evaluated in a variety of ways, from simply looking at the number of correct responses, to analysing confusion between particular phonetic features. This is useful as it gives not only an overall impression of intelligibility but can also identify areas where confusions occur (Braga et al, 2002). Furthermore, the test can be carried out with a variety of voices and performance easily compared.
Semantically Unpredictable Sentences (SUS)
The Semantically Unpredictable Sentences (SUS) test is a free answer test evaluating intelligibility at word level. Listeners are presented with sentences which are syntactically normal, but semantically abnormal. That is, the sentences use the right class of word but which may not make sense in context. For example, "She ate the sky". Listeners are presented with the sentence and asked to write down what they think they heard.
Using semantically unpredictable sentences allows evaluators to test intelligibility of words without the context of the sentence cueing listeners to expect a particular word. Furthermore, it is possible to generate a huge number of these sentences, which reduces the need to re-use sentences (which could cause learning effects). Benoît et al (1996) discuss the process of constructing a test set for the SUS. They recommend using a variety of sentence structures. There are various rules for inclusion in each set, and a computer is used to randomly generate sentences based on frequency of use of words. Benoît et al (1996) suggest the use of minisyllabic words (the shortest available words in their class) to further reduce contextual cues. They outline a procedure for running the SUS, suggesting that a consistent approach across tests will make it easier (and more accurate) to compare performance across different synthetic voices.
Results from the test consist of percentage of correctly identified sentences, both as a percentage of the full set and as a percentage for each sentence structure.
It must be noted that whilst the SUS is useful for comparing intelligibility of different systems, setting up the procedure is complex and must be done carefully to ensure results are comparable (Benoît et al, 1996).
Opinion measures
User testing with synthetic voices is not just about how well the listener can understand the voice, but also the listener's opinion of the voice. A widely used opinion measure in evaluations of synthetic voices is naturalness, that is, the extent to which the voice sounds human. The most common test is the Mean Opinion Score (MOS). This test involves a large number of participants who listen to a set of sentences presented in synthetic speech and rate them on a simple 5 point scale (excellent - bad). Scores are then averaged across the group (Francis and Nusbaum, 1999).
Generally speaking, more natural-sounding voices are more accepted (Stevens, Lees, Vonwiller and Burnham, 2005). However, research suggests that this may depend on context. Francis and Nusbaum (1999) report on a study in which users preferred an unnatural sounding voice in the context of telephone banking, as they reasoned they would rather a computer knew their bank balance than a real person.
Whilst naturalness is a widely used measure, some researchers suggest it is not ideal. Campbell (2007) suggests that 'believability' may be preferable as in some cases a voice doesn't need to be natural (for example, in cartoons). Other measures used include acceptability or likeability (Francis and Nusbaum, 1999).