Performance of Phase Transform for Detecting and Locating Sound Sources with Microphone arrays in Reverberant and Noisy Environments

by Kevin D. Donohue (6/16/05)

The detection and location of sound sources in a room with microphone arrays can be enhanced through processing the individual microphone signals prior to the delay and sum operation. One method in particular, the Phase Transform (PHAT) has been demonstrated to improve sound sources images especially in reverberant and noisy environments. This paper presents a receiver operating characteristic analysis for detecting multiple sound sources in order to study the impact of the PHAT on its ability to help distinguish true targets from noise peaks based on a simple threholding operation. Simulations demonstrate and quantify relationships between the type of noise and impact of the PHAT. Variationson the PHAT are proposed to improve robustness with steered response power method for source detection and location.

Introduction

The Steered Response Power (SRP) algorithm for sound source location (SSL) creates a likelihood function over the monitored space or field of view (FOV) through which the location of sound sources can be identified. Practical implementationsand applications of SSL require the detection of sources in (FOV) before a location estimate to location estimates are not based on the noise response. Therefore,this paper examines SRP detectionperformance and signal processing enhancements to improve the separation of target and noise likelihood function peaks. The detection performances are assessed using the area under the Receiver Operating Characteristics (ROC) curve and are estimated through Monte Carlo simulations. The impact of the phase transform (PHAT) is demonstrated for single and multiple targets in a FOV where the primary noise sources are room reverberation and coherent noise.

While several approaches exist for SSL, the SRP is typically the most robust to noise over a broad range of signal properties. The heart of the SRP is the filter-and-sum (or delay-and-sum) beamforming. This operation is very robust to uncorrelated noise, resulting in a noise power reduction proportional to the number of microphone channels. Uncorrelated noise typically results from the independent (electronic) noise on each microphone change. Correlated noise, on the other hand, results from coherent noise sources in the room such as source outside the FOV, multiple targets, and reverberation. Correlated noise presents greater challenges for detection performance than uncorrelated noise. While optimal approaches require knowledge of the spectral characteristics of the sound target and noise sources, such information is often not practical to obtain. Therefore, the PHAT has been suggested and applied for signal enhancement is the SRP algorithm with good results. The authors introducing this method indicate that it performs especially well with reverberation noise.

While results demonstrated in the literature have shown the PHAT to generate superior acoustic images with sharper targets of interest and a relatively attenuated noise field, there have been no studies to see how well targets of interested can be separated from noise peaks and directly related to detection performance.

This paper presents the results from MonteCarlo simulations to compute area under the ROC curve for comparing implementation of the PHAT for single and multiple impulse-like targets, ranging from narrow band to wide band, as well speech signals. In addition, various levels of noise are presented from difference sources to identify the aspects of the PHAT that are improving the detection performance.

The Steer Response Power Algorithm

This section formulates the SRP algorithm with notation consistent with parameters which will be directly related to performance through the simulations. Consider a 3-D space with microphones and sound sources at different spatial locations. Let be the pressure wave denoting the ith source located at position ri, where ri is a column vector denoting the x, y, and z axis coordinates. The waveform received by the pth microphone can be given as:

,(1)

where h() represents the combined impulse response of the microphone and propagation path (including multi-path) from ri to rp, nk(t) represents correlated noise sources resulting from other sound sources located at rk, andnp(t) representsuncorrelated noise results from the electronic noise in the sensor, amplifier, and digitizer on the pth microphone channel.

The coherent noise source nk(t) arises from ambient noises in the room (room noise), sourcesoutside the FOV, and other targets and reflections inside the FOV that are not associated with the source at ri, The impulse responseh() includes effects from both the direct and multiple paths. Figure 1 illustrates this environment. The inner rectangle with the broken line denotes the FOV with two targets of interest (i.e. two speakers talking simultaneously). Note that while estimating the SRP at one target the other target within the FOV acts as an interfering noise signal and vice versa. The solid rectangle denotes the room walls that generate room reverberations and independent noise sources such as traffic noise through a window or turbulence in air vents.

Figure 1. Noise and interference sources in array signal from acoustic source of interest with microphones denoted by the small gray rectangles.

For reverberant rooms the impulse response can be separated into a signal (direct path) and noise component (reflected path). If the primary operations on the sound source are assumed to be the effective delays from multiple reflections and associated attenuation from the propagation paths, the transfer function can be written as:

, (2)

where aipn(t) denotes the nthpath component of the effective impulse response associated source at ri, and microphone at rp, and τipn is the corresponding delay. The component corresponding to n = 0 is the direct path between the source and microphone. The algorithms for SSL operate on finite time segments of time, so for a single frame of an SRP estimate the contributions of target and noise scatterers are limited to those received over a finite time window. Let the time interval for processing a single SRP imagebe denoted by:

.(3)

Therefore the summation of the room impulse responses will be limited to those terms corresponding sound paths reaching the microphone in that time processing interval. Equation (1) expressed in frequency domain with the substitution of Eq (2) over the time window associated with the SRP frame becomes:

,(4)

where the summations index denotes summing up only those scatterer delays within the time interval Δl.

The PHAT can be applied to the signal segment of Eq. (4). For this paper the PHAT will be generalized to adjust for the level of influence from the magnitude spectrum. This transform will be referred to as the PHAT-β and defined as:

,(5)

where β varies between 0 and 1. When β is 1, Eq. (4) becomes the conventional PHAT and the modulus of the Fourier transform becomes 1 for all frequencies, when β is 0 the denominator is 1 and the PHAT-β has no effect. Therefore, by varying β between 0 and 1, the parametric impact of the modulus on performance can be studied.

To develop a signal and noise analysis, separate the signal term from the noise terms of Eq. (4) to result in a direct path signal term:

.(6)

and the remainingmultipath signal and independent noise terms:

.(7)

Equation (7) can be separated into a signal and noise terms. This is expressed as:

(8)

Note the 2 resulting phasor terms are weighted proportionally by the respective signal and noise to a powerrelated terms at each frequency (actual signal power for  = 1). Ideally if the expected spectra for signal and noise are known, a Wiener filter weighting can be implement to yield an optimal signal-to-noise ratio. This information is often not practical to obtain in a general acoustic environment, where source spectra change (for different sources) along with interference patterns, and noise sources. The PHAT does not guarantee an improvement in the overall SNR of the signal. For wideband signals with significant nonuniformity over the spectrum and a low noise floor, the PHATtends to enhance SNR by increasing the signal energy over the spectrum more so than the noise components. On the other hand for narrowband signals with good SNR to begin with, the PHAT increases the low power noise regions of the spectrum with no signal energy.

The SRP can be expressed as filter and sum operation corresponding to a pixel at ri and the summation of the magnitudes squared for the power computation. The delay, sum, and coherent power computation, can be expressed in the discrete frequency domain as:

(9)

where the summation limits on kcorrespond to the expected spectral range of the sound sources, is the frequency domain sampling interval, is the complex weight representing the delay and weighting associated with the image location atriand microphone geometry, Sil() is the coherent power used for creating the pixel amplitudes of SRP image. The last summation term represents the signal powers at each individual microphone, and it provides no information on the source location. Therefore, removing these power terms reduces their contribution to the variance of the SPR valueswith no loss in location information. Only terms directly related to the correlations between microphone pairs remain. Since the SPR values no longer contain the power terms of each microphone signal, the resulting values can be negative as well as positive. This power will be referred to as coherent power because it represents similarity between different microphone channels.

To observed the signal and noise contributions after delay and sum, for the image pixel at risubstitute Eq. (8) into Eq. (9) to obtain:

(10)

where notation and been reduce combine the magnitude weighting and PHAT-into single terms given by:

and (11)

To obtain the expected SRP amplitude associated with the target, the expected value operator can be applied to Eq. (10) with the assumptions that the phase terms for the noise are uniform over interval [-, ], and the residual phases in the first term can be assumed uniform over interval [-, ]. The 2 interval models the quantization error of the grid spacing or resolution of the SRP image. In general it is necessary that the grid sampling be fine enough such thatk for a significant portion of the signal spectrum, so the target source to be distinguished from the noise. The uniform phase assumption over the 2interval, results in the noise terms and the cross noise-target terms going to zero under the expected value operator over all cross-microphone pairs. Therefore, the expected SRP value at the pixel where a target is present becomes:

(12)

where the brackets denote the mean value for all target signal magnitude pair products in the microphone array, and H1 is the condition that a target source is present.

Note that when k (argument of sinc function) the expected value of the target amplitude goes to zero. Since this condition depends on target signal frequency and the grid resolution, a relationship can be derived to limit the variability of a single frequency term due to the quantization of the pixel grid. Assume that a maximum of 3 dB variations can be tolerated and consider the worst case variations, which are the diagonal distances between pixelsand the highest frequency in the signal. These conditions can be combined with the sinc function of Eq. (12) for a single frequency to result in the following grid spacing bound (with Δgrid denoting the equal horizontal and vertical spacing):

(13)

where d is the dimensionality of the SRP space, fh is the highest frequency for which the target magnitude will vary less than 3 dB due to quantization of the grid, and c is the speed of sound in the FOV space.

The variance of the target pixel results from both the quantization effects and contributions from the noise sources. With the same assumptions used in computing the expected value of Eq. (12), the variance of the SRP value from Eq. (10) is:

(14)

The first term account for the pixel quantization, the middle term results from correlations between signal and noise magnitudes, and the last term results from the noise sources and directly relates to the sum of the off-diagonal magnitude of the coherence matrix. In order to consider the detectability of the process the mean and variance of the pixels where a target is not present must also be given. This is obtained by letting the target energy go to zero in Eqs. (12) and (14) to obtain a mean value of:

(15)

and variance

(16)

The SNR for this study will be defined as the ratio of powers of the direct path target and all other power received at the microphone given by:

(17)

Since this work focuses on detecting target sources in an SRP image, a metric related to the separation between the SRP pixel values where a target is present (H1) and for pixels where it is not (H0) is useful for understanding the relationship between parameters (such as ) and detection performance. One such measure is Fisher’s Criterion, which divides the distance between pixels belonging to conditions H1 and H2 by the variances of each class. For the SRP image this results in:

(18)

The dependence on is explicitly shown in this notation, since simulations in the next section will explore this metric as a function of these parameters.

A performance measure more directly related automatic target detection is the area under the receiver operating characteristic (ROC) curve. The ROC is the probability of detection vs. the probability of false alarm for all thresholds. For a discrete set of SRP pixel values, Sil(from Eq. (9)), for cases where a target was and was not present, the ROC curve area can be computed from the function defined as:

(19)

where i and k represent independent estimates under the indicated hypotheses. Given n1 samples from H1, and n0 samples from H0, the ROC area can be computed from:

(20)

Simulations:

The impact of the β parameter in the PHAT will be examined through MonteCarlo simulations for a range of signals in coherent and reverberant noise. Performances will be evaluated in terms the Fisher’s criterion in Eq. (18), and the ROC area of Eq. (20). The 3 main components of the simulation are the room, noise signals, and target signals. The following describes details of parameters associated with these objects.

The room dimensions were 3.62m for length and width, and 2.22m for height. The walls were assigned reflectivity values of .95 and the ceiling and floor had a reflectivity of .9. However, in the simulation these values were scaled from 0 to 95% to represent different levels of room reflectivity. The room impulse response for a given target position was computed as described in []. In addition, an air attenuation factor was applied in the form of an FIR filter to the received signal, based on the distance of the path travel. The speed of sound was taken to be 348 m/s.

Room Property / Parameters
Length / 3.62m
Width / 3.62m
Height / 2.22m
Mean surface reflection coefficient / Variable from 0 to .89
Air absorption factor / -32.8x10-6 dB/(meter-Hz)
Sound Speed / 348 m/s
Microphone Geometry / 8 microphones, Equilateral Hexagon
Microphone Spacing / 1.29 m

The noise and target sound sources were created within the room dimension at a plane height of 1.57m, where the target sources are randomly placed in the FOV of square dimension 2.4m placed symmetrically within the room, and the coherent noise sources are randomly placed outside the FOV. In all simulation runs, 2 coherent noise sources were independently generated as white noise randomly placed within the room and outside the FOV. For the targets, 2 basic simulated signals were created as the impulse responses to a Butterworth filter all with a lower 3dB cutoff of 400 Hz and upper cutoff frequency of 600 (for narrow band signal) and 5600Hz for broad band signal. The Butterworth signal was chosen because it has a maximally flat spectrum for easy control of distributing spectral power, while the impulse response is a causal signal with appropriate phase spectrum. This was a general signal from which performance for narrow and broad band signals could be inferred. In addition, to the Butterworth signal, a real signal was recorded an adult female saying the word ‘shoe.’ This signal provided a limited degree of validation by comparing performance for the fricativeSHwith the broadband case, and the voiced part UW to the narrowband case. All signals were placed for each run using a uniform random distribution for there respective area. The grid spacing was set at .02m which according to Eq. (13) implies that all signal energy less than 5.4kHz will vary less than 3dB due to the random placement in the image plane.

In image plane is quantized with pixels at intervals for of .02 m, which according to Eq. (13) implies that all frequencies under 5.4kHz will have a 3dB amplitude variation or less due to the uniformly random placement in the quantized FOV. a simple bandlimited target with a butterworth spectrum use of increasing bandwidth.