Hearing, Organisation and Recognition of Speech in Europe

Periodic Progress Report

Research Training Network

Hearing, Organisation and Recognition of Speech in Europe

HOARSE

Contract N°: HPRN-CT-2002-00276

Commencement date of contract: 1/9/2002

Duration of contract (months): 48

Period covered by the report: 1/9/2002 to 31/8/2003

Coordinator:

Professor Phil Green

Department of Computer Science

University of Sheffield

Regent Court, Portobello St.,

Sheffield S1 4DP

Phone: +44 114 222 1828

Fax: +44 114 222 1810

e-mail

Part A. Research Results

A.1 Scientific Highlights

In its first year HOARSE has recruited young researchers to 6 out of its 7 partners, and at the remaining partner a placement will start in January 2004. The young researchers are, of course, at an early stage in their studies. Following are the highlights of their work so far:

At Bochum, doctoral researcher JUHA MERIMAA has concentrated on HOARSE Tasks 2.2 (Reliability of auditory cues in multi-source scenarios) and 2.3 (Perceptual models of room reverberation with application to speech recognition). A pilot listening test has been conducted to investigate the perceptual grouping of binaural cues and their relation to sound sources and the listening environment. The preliminary data show a complex dependence on the source signal properties. Reliability of binaural auditory cues (task 2.2) has also been directly addressed in an auditory modeling study in collaboration with Agere Systems. The investigation describes the behavior of interaural time-difference (ITD) and interaural level-difference (ILD) cues in some typical multisource and reverberant scenarios. A method for extracting reliable cues based on a novel way of analysing the output of an auditory cross-correlation model is proposed. Also at Bochum, post-doctoral researcher JOHN WORLEY is investigating the cues to sound source location and subsequently a resolution of the ‘cone of confusion’ contained within an individual listeners’ head-related transfer function (HRTF). This work is related to HOARSE tasks 2.1 (the precedence effect) and 2.2. Individual differences in HRTF measurements relate to the differences in the size and shape of individuals’ pinnæ and head. Typically when the source location is synthesised from non-individualised HRTFs listeners display a significant amount of reversal errors. It has been suggested that listeners can learn to associate the cues in non-individualised HRTFs to resolve source location. Worley has performed a longitude study to assess whether inexperienced listeners can learn to resolve front back allocation with location synthesised from HRTFs provided by a non-individualised dummy head. Over a training period that spanned 9 days, listeners did not display a reduction in the amount of reported reversal errors. However, listeners did display a difference in their propensity for visual capture of the auditory event by the visual cue. The results are to be analysed with reference to a difference between dynamic and passive listening with non-individualised HRTFs.

At DCAG, doctoral researcher JULIEN BOURGEOIS has been working on Task 4.2, informing speech recognition. His first-year aim aim was to study and to evaluate existing blind and semi-blind source separation methods. This theoretic and implementation effort encompassed speech-specific methods (CASA and AMDecor ), that exploit the correlated modulation of the sources in different frequency bands. Another group of techniques makes use of the statistical independence between the sources. Various independence measures were tested, using either second order or higher order information. Decorrelation was shown to be the most robust separation criterion in noisy conditions. Observing that speech signals fill a small portion of the time-frequency plane, source separation methods assuming that only one source is present at each point of the spectrogram were also considered and lead to the best results in terms of interference reduction. Spatial filtering (beamforming) is a classical alternative to these methods. It shows noise reduction ability but its performance depends on a reliable voice activity detection. Most methods require at least as many microphones as sources. For the evaluation of the various approaches (Task 5.1) Bourgeois used a set of speech recordings from the TIMIT digit database in highway conditions at different driving speeds. These were made with artificial heads so that the acoustic response between the speaker and the microphone remains constant.

At HUT Helsinki, doctoral researcher EVA BJORKNER has been making comparisons between habitual and throaty voice area functions of the vocal tract obtained from MR imaging and acoustic recordings, obtained by using inverse filtering (Task 3.1, glottal excitation estimation) for the vowels /a, ae, i, u/. This experiment combines for the first time both visual and acoustical analyses of the vocal tract area functions for throaty voice production.

At IDIAP (and at USFD), work is increasingly making use of multi-sensory recordings of meetings being made in the EC-funded project M4. The ‘meeting room’ research will be developed further in the FP6 Integrated project AMI, and all HOARSE researchers will have access to the increasingly important meetings corpora. Doctoral researcher GUILLAUME LATHOUD is working under task 5.2 (signal and speech detection in sound mixtures) on overlaps between speakers in the meetings data. Highly effective methods for segmenting concurrent speakers using microphone arrays have already been published [Lathoud & McCowan 03]. Current work focuses on combining with unsupervised spectral-based speaker clustering methods to answer the question "who spoke when and where".

At Liverpool, the work of doctoral researcher ELVIRA PEREZ has concentrated on Task 1.3, active/passive speech perception. The central question addressed is whether listeners build internal models of environmental noises to allow more effective stream

segregation. Two perceptual experiments were conducted to test the hypothesis that noise patterns which occur regularly in time are easier to segregate from a target speech signal than the same patterns if they occur in random sequences. Initial results suggest that noise maskers that occur in regular patterns are no easier to segregate than random patterns.

At Patras, no young researcher has so far been engaged on HOARSE, but John Worley will move there from Bochum in January 2004. Under Task 2.3 , perceptual models of room reverberation with application to speech recognition, some preliminary work has been performed, based on the use of smoothed room response measurements for room dereverberation. The tests indicate robustness of this method with respect to variations in source / receiver placement and immunity from practical dereverberation artefacts. This work is expected to form the starting point for Worley’s work.

At USFD, doctoral researcher JANA EGGINK is researching auditory scene analysis within polyphonic (multi-instrument) music. This is an addition to the HOARSE work programme, explained in §B3. Eggink is focussing on instrument recognition, which is in many ways related to speaker identification, and techniques developed for the latter problem have been successfully adapted for monophonic instrument recognition. To enable instrument recognition when more than one sound source is present, Eggink uses missing data techniques, which have been developed at USFD for speech or speaker recognition in the presence of noise. Eggink’s first system is intended to identify the fundamental frequencies (F0s) of all notes in pieces for small musical ensembles, and to determine the instruments on which these notes were played. The F0s are estimated first, using a frequency domain pattern matching approach related to the so-called 'harmonic sieves'. Based on the F0s, missing data masks are then constructed by declaring all frequency regions where a harmonic of an interfering tone is found to be unreliable or ‘missing’. The actual instrument classification is carried out using a Gaussian mixture model classifier, which has been adapted to work with missing data. Results were generally good, not only for artificial mixtures of tones with known F0s, but also for examples taken from commercially available music compact discs. A subsequent study has tacked the problem of identifying the lead instrument in polyphonic music, and is achieving very encouraging results. Publications will appear in the next reporting period.

A.2 Joint Publications and Patents

There are no joint publications so far. The publication record is presented at the end of this document.

Part B - Comparison with the Joint Programme of Work (Annex I of the contract)

B.1 Research Objectives

The research objectives, as set down in Annex I of the contract, are still relevant and achievable.

B.2 Research Method

We can identify two additions to our research methodology:

· For studies of auditory scene analysis and dealing with multiple speakers, there is increasing emphasis on the collection and analysis of ‘meetings data’. This work is supported by the FP5 EC project M4 (www.m4project.org)and will soon be augmented by the FP6 Integrated Project AMI (www.amiproject.org). IDIAP and USFD are partners in these projects and the data they collect will be available to HOARSE researchers.

· In the speech production modelling work led by HUT, MR imaging, which was not mentioned in the contract as a research method, has now been shown to be a powerful tool for obtaining knowledge about voice production.

B.3 Work Plan

B3.1 Breakdown of tasks

The only significant change to the work programme is to explore auditory scene analysis within the context of polyphonic music. This is the topic of young researcher Jana Eggink at USFD, who has a background in musicology. The scientific importance of this study follows from the fact that although most music is highly complex, with multiple instruments playing simultaneously, listeners are nevertheless able to follow individual melody lines, recognize reoccurring patterns, identify chords and instruments, and much more. Because the musical score can be used as an ‘oracle’ we can construct testbeds for well-controlled experimentation. Also, there are practical applications of such work in automatic music transcription systems and content-based music information retrieval. A key part of Eggink's work will be to validate existing techniques used for speech segregation and speaker identification in the musical domain (e.g., Gaussian mixture modelling for timbre classification, and multi-source decoding for transcription of polyphonic music).We therefore schedule a new task under the Auditory Scene Analysis theme: Task 1.5 Auditory Scene Analysis in Music

B3.2 Schedule and milestones: Table 1

Task / Lead Partner / 12 Month Milestone / Comments
1.1 / Neural Oscillators for Auditory Scene Analysis / USFD / Multiple F0s using harmonic cancellation.
Initial implementation of binaural grouping / Multiple F0 work published [Wu, Wang & Brown 03]
1.2 / Modelling grouping integration by multisource decoding / USFD / incorporation of noise estimation into oscillator-based grouping / . Multisource decoding theory journal article to appear in next reporting period
1.3 / Active/Passive speech perception / Liverpool / Planning experiments / Experiments conducted
1.4 / Envelope information and binaural processing / Liverpool / Preliminary experiments / Experiments about to start
1.5 / Auditory Scene Analysis in Music / USFD / F0 estimation / First system completed [Eggink & Brown 03]
2.1 / Researching the precedence effect / RUB / Psychoacoustic experiments on the precedence effect in realistic scenarios. / Real rooms have been measured and experiments employing the Franssen illusion are in progress. No work yet at Patras due to lack of visiting researcher. Planned to start in January 2004. See [Braasch & Blauert 03]
2.2 / Reliability of auditory cues in multi-source scenarios / RUB / The importance of single binaural cues in various multisource environments determined in psychoacoustic experiments / Completed [Braasch 03], [Braasch et al. 03], [Braasch & Blauert 03]
2.3 / Perceptual models of room reverberation with application to speech recognition / Patras / integrated response/ signal perceptual model for single source in reverberant environments. / Partially completed [Hatziantoniou and Mourjopoulos 03]
2.4 / Speech enhancement for reverberant environments / Patras / Research into auto-directive arrays, controlled from the perceptual directivity module / Partially completed
3.1 / Glottal excitation estimation / HUT / Research on combining new AR (Auto Regressive) models to inverse filtering / Ongoing
3.2 / Voice production studies / HUT / Inverse filtering experiments on high-pitched voices / First experiments and publications [Björkner andSundberg 03]
3.3 / Voice production and cortical speech processing / HUT / Development of DSP algorithms for parameterisation of the voice source, getting familiar with MEG / .Ongoing
4.1 / Developments in MultiSource Decoding / USFD / Probabilistic decoding contraints / Implemented in current software
4.2 / Informing Speech Recognition / Liverpool / Design of predictive noise estimation algorithms. Known BSS algorithms adopted as a common base for evaluation / Finalising intelligibility model for different noise types.
4.3 / Advanced ASR Algorithms / IDIAP / Multistream adaptation / Work to be reported on this task in Eurospeech 03, IEEE ASRU 03
5.1 / Speech recognition evaluation in multi-speaker conditions / DCAG / Database specification.
Targets for assessment report 1 / Evaluation experiments performed
5.2 / Signal and speech detection in sound mixtures / IDIAP / Analysis of auditory cues / Work reported: Ajmera et al 2003, Lathoud et al 2003
5.3 / Speech technology assessment by simulated acoustic environments / RUB / Simulation environment for hands-free communication developed / Completed, now being integrated into IKA telephone line simulation tool

B3.3 Research effort in the reporting period: Table 2

Participant / Young researchers financed by the contract (person-months) / Researchers financed from other sources (person-months) / Researchers contributing to the project (number of individuals)
1. USFD / 12 / 48 / 1YR + 5 others=6
2. RUB / 11.5 / 24 / 2YRs + 2 others=4
3. DCAG / 12 / 24 / 1 YR + 2 others=3
4. HUT / 6 / 12 / 1YR + 2 others =3
5. IDIAP / 12 / 24 / 1YR + 3 others=3
6. LIVERPOOL / 7 / 25 / 1YR + 2 others=3
7. PATRAS / 0 / 6 / 0YR+ 1 other = 1
Totals / 60.5 / 163 / 7YR+17 other = 24

B.4 Organisation and Management

B4.1 Organisation and management

HOARSE is being managed in the way described in Annex 1 of the contract. The non-executive director is Dr. Jordan Cohen of VoiceSignal inc, Boston, MA. Administrative is being handled from USFD by Linda Perna ().

B4.2 Communication Strategy

Most communication within HOARSE is conducted electronically. The HOARSE web site is www.hoarsenet.org. Meeting records and so on are on password-protected pages on that site. The email address for the whole network is .

Because this report covers the first year, there are few conference presentations by young researchers as yet. Full references are at the end.