M.Tech. Credit Seminar Report, Electronic Systems Group, EE Dept, IIT Bombay, submitted
November 2004
CO-CHANNEL SPEAKER SEPARATION
USING HARMONIC SELECTION METHOD
E. Ramanjireddy (Roll no. 04307036)
(Supervisor: Prof. Preeti S. Rao)
ABSTRACT
In speech transmission, a common type of interference is caused by the speech of a competing talker. Although, the human brain is adept at separating such speech, it relies partly on binaural data. When voices interfere over a single channel, separation is much more difficult. Separating of such speech is a complex and varied problem whose nature changes with the moment to moment variation in the types of sounds which interfere. In this seminar report, a method is described for the separation of vocalic speech known as harmonic selection method. And here the separation is done by selecting the harmonics of desired voice in the Fourier transform of the input. In implementation of this method, techniques have been developed for overlapping spectrum components, for determining the pitch of the target speaker and for assuring consistent separation.
CONTENTS
1. Introduction
2. Review of previous techniques
3. The overview of harmonic selection method
4. The preparation
5. The overlap detection
6. The peak separation
7. Pitch and harmonic estimation
8. The synthesis
References
1. INTRODUCTION
Human mind has the capability of concentrating on single voice in a crowded party. However psychoacoustic observation proves that people with normal hearing can listen almost two things at a time [2]. And also the capability of concentration on a single voice decreases when the speech is not binaural (speech from different directions). So, perceptual segregation of sounds called auditory scene analysis (ASA) is a challenging problem in audition research. This emerging area of using computer for ASA is called computational auditory analysis (CASA).
In this seminar, one such system for target speaker enhancement using harmonic selection method is studied. In this method, the Fourier transform of overlapped speech to be separated is taken. And the harmonics of target talker’s harmonics in the spectrum is separated by finding the pitch of that speaker. From that separated harmonics, the required speech is obtained by using inverse FFT.
In section 2, the overview of whole harmonic method is described. In section 3, the initial process to be done to make the speech signal ready for further processing is explained. In sections 4, 5 and 6 the detection of overlap, the separation of peaks and pitch and harmonic estimation are described respectively. In the last section, the reconstruction of the speech signal is explained.
2. REVIEW OF PREVIOUS TECHINQUES
A number of techniques have been proposed up to now, with their own limitations. Most of the current techniques are focusing on specific sound contexts particularly in voiced speech. One important requirement for any approach is that sound segregation should be incremental.
Some of the techniques are
2.1 Blind signal separation (BSS)
In this method [7], the source signals are not observed and no information is available about the mixture. The BSS model assumes the existence of n independent signals s1(t)…..sn(t) and the observation of as many mixtures x1(t),…..xn(t), these mixtures being linear and instantaneous. Thus source signal and observed signal vectors can be respectively written as,
s(t)={s1(t)…..sn(t)}T
x(t)={x1(t),…..xn(t)}T.
Here the observed sound is assumed to be linear mixtures of source signal, then
x(t)=A{s(t)} =
where A is a linear operator and can be represented in matrix form. The aim of blind source separation is to find a linear operator B(t), such that the components of reconstructed signal
y(t) =B*x(t)
are mutually independent, without knowing the operator A(t) and the probability distribution of source signal s(t). Blind source separation gives good results for the case of two speakers but quite worse ones for the case when three or more speakers are speaking simultaneously. Another assumption of this approach is that sound sources are independent to each other may get violated when a mixture of musical signals that contains harmony are manipulated. Another limitation is that it needs attributes of sounds to dissolve ambiguities of the order of independent components and their amplitudes to reconstruct each original sound.
2.2 Binaural harmonic based stream segregation (Bi-HBBS)
This method [8] can manipulate any number of sounds in principle and works as follows. Bi-HBBS method takes binaural input and extracts the detection of sound source by calculating the interaural time difference and interaural intensity difference. The Bi-HBBS method uses three kinds of agents: an event detector, a trace generator and a tracer. An event detector subtracts a set of predicted agents (from the previous frame information) from the actual input and sends the residue to the trace generator and the tracer. If the residue exceeds a threshold value, the tracer-generator searches for a harmonic structure and its fundamental stream, it generates a tracer to trace the harmonic structure. Tracer also composes a predicted next input by adjusting the segregated stream fragment with the next input and sends this prediction to the event detector.
The problem with this approach is that the usage of binaural inputs may cause spectral distortion. This is because the spectrum of the binaural input is not the same as that of the original sound due to the shape of the human head. Such transformation is called the head-related transfer function (HRTF). Due to HRTF [8], the power of lower frequencies is usually decreased while that of higher frequencies is increased. Thus it may make it difficult to segregate a person’s speech from that of other interfering speakers. Further, the sensitivity or resolution of direction is about 200 . So when more speakers utter at different places, the difference of direction becomes less than 200 and thus directional information becomes useless.
2.3 Harmonic selection method
This approach [1] requires that the target-to-interference ratio (TIR) be sufficiently high and both the voices be periodic i.e the speech to be separated must be restricted to voiced speech. Since such sounds account for a major portion of the acoustical speech signal, and since they also furnish important articulatory information about adjacent consonants, this restriction may not be a serious one. The detailed explanation of this method is explained in the next following chapters
3. THE OVERVIEW OF HARMONIC SELECTION METHOD
The basic requirement of this method is that voices to be separated must be periodic. (i.e. only for voiced sounds) and the explanation is done in case of two speakers. When two speakers are simultaneously sounded, the absolute spectrum of the resulting speech waveform shows two trains of equally spaced harmonics imposed one and another. The main task here is to separate the required harmonic train by finding the pitch of the target speaker. Before separation all the information about the peaks is entered in peak table. By using this peak table and separated harmonics the target spectrum can be reconstructed. Inverse DFT of this spectrum gives the target speaker’s voice. The basic block diagram for getting the overall view of this method is shown in figure below.
INPUT: SPEECH SIGNAL
OUT PUT: TARGET SPEECH
Figure 3.1: Block diagram of the separation process [1]
The individual blocks are briefly explained below
3.1 Preparation
It consists of A/D conversion, time domain weighting windowing and Fourier- transformation. The window length plays an important role here which must be long enough for individual harmonics to be resolvable, but short enough to safeguard the validity of low level peaks. Generally hanning window is preferred.
3.2 Reconstruction
It consists of reconstruction of spectrum using peak table and pitch harmonics inverse Fourier transformation, linking of segments into a continuous speech spectrum, and D/A conversion.
The central three processes are
3.3 Peak overlap detection
After taking the Fourier transformatation, the spectral magnitude response consists of some `maximas` known as peaks. The parameters frequency, amplitude and phase of those peaks are entered in the peak table. A problem of overlapping of peaks arises when the integer multiples of two speaker’s pitches are nearer. The peak overlapping is detected by
1. Symmetry test
2. A reasonable distance from the adjacent peaks and
3. Well behaved phase
Symmetry test compares the corresponding samples on opposite sides of the peaks. Overlap is indicated when the difference exceeds a threshold value.
In closely spaced peaks in which symmetry test fails, overlap is indicated by monotonic change in the phase of the spectrum. The average rate of variation of phase around a peak is estimated and if the actual rate exceeds the average rate by threshold value, overlap is indicated.
3.4 Peak separation
After the overlap detection of peaks, the overlapped peaks are to be separated which is done by priory knowledge of the peak shape. Since the input is restricted to vocalic speech the phonation will be approximately uniform over the 50 msec length of any individual segment. Because of linear summation of peaks in the overlap in a spectrum, a direct separation process is used by using complex addition and subtraction.
The parameters of the largest component are estimated by searching a point(with in predetermined range) around the peak maxima. Subtracting of ideal peak shape from the observed one results in a difference peak and is passed through peak verification process. The parameters of these two peaks are added to the peak table.
3.5 Pitch and harmonic estimation
3.5.1 Pitch estimation
Schroeder’s frequency histogram method is used to find the pitch. The method makes use of the fact that the voice harmonics are harmonic to a high degree of precision. In this method, all integer submultiples of all peak frequencies are found and entered in a histogram. The largest entry in the histogram is taken to be the pitch.
3.5.2 Harmonic estimation
After obtaining the pitch, its harmonic frequencies are calculated. Peaks nearer to those harmonic frequencies are found and the parameters about those peaks are entered in a new peak table. The detailed explanation of these blocks is done in the following chapters.
4. THE PREPARATION
The first task of preparation module is to extract the required number of samples at the appropriate location. With exception of first and last frames, for the analysis of i`th frame, WINDOW size samples are chopped out symmetrically from around the i`th frame. The sampled values are to be normalized after they are stored in an array, just to maintain uniformity through out all the frames. After then the samples are multiplied with WINDOW size hanning window.
4.1 Effect of windowing
Consider a sinusoidal signal,
r (t)=A cos(Єt+ø)
The discrete time signal to the above signal, with the assumption of having no aliasing error and no quantization error, is
x[n]=A cos[w*n+ ø]
Where w=ЄT, T=sampling period.
After applying the window w[n] to x[n], the resultant signal is
v[n] =A* w[n]*cos [w*n+ø].
In complex exponential form,
v[n] =
i.e the spectrum consists of the Fourier transform of the window, replicated at the frequencies and scaled by complex amplitudes of individual complex exponentials in the signal.
Due to the effect of windowing, a linear phase variation gets added up to the respective phase of the sinusoid. So to study the variation of phase spectrum purely due to signal, the effect of windowing must be eliminated. To achieve this, before taking FFT, the windowed signal is shifted by an amount equal to the half of window size in opposite direction (making the window symmetrical about the origin).one side effect of this operation is that the original signal also gets shifted by half of the window size, leading to an additional phase error of is introduced in the actual phase. Since the error is constant, it can be taken care of in the analysis. If the Fourier transform is taken directly on the above windowed signal, each FFT bin size (=freq. sampling/window length) will be a large to analyze. So zero padding is necessary in which number of zeroes equal to the value (FFTpoint--window length) which makes the bin size small.
4.2 Peak table formation
Since the magnitude spectrum is symmetric about the value (FFT- point/2), the analysis is done on first half of the array. In a magnitude spectrum, if any sample value is greater than the previous sample value and next sample value, the sample is considered as a peak. Once such peaks are found by running a loop through overall spectrum, the values of following parameters corresponding to those peaks are entered in a peak table.
1) Location of the peak in terms of the sample number
2) Magnitude of the peak
3) Phase of the peak in radians
4) Validity of the peak
The validity of the peak is found by finding the peak spread which contains sum of left peak spread and right peak spread. Each of which contains the number of samples counted, till spectrum just starts rising again around the peak. A peak is said to be valid if the sum of the left and right spreads exceeds a threshold value already set.
5. THE OVERLAP DETECTION
When the harmonic frequencies of two simultaneous speakers are nearer, then peak overlapping occurs. So this module runs a loop over all the entries in the table by conducting symmetry and phase tests and finally decides whether there is overlap or a single peak with noise and side lobes.
If d12 is the difference between the harmonics of two peaks and dw is the width of central lobe in the window spectrum and if d12 >dw, it can be declared as no overlapping. As d12 decreases, the two peaks merge and appear as a single peak. And mostly the overlapped peak appears as asymmetric around its maxima. At some value of d12, the overlapped peak appears symmetric though it contains overlap. Phase test in this case gives the overlap detection.
5.1 Symmetry test
Most of the overlapped peaks are asymmetric about its maxima this test compares the values on either side of maxima and if the difference exceeds certain value, it is declared as overlap. Since the peak value need not necessarily lie on one of the bin points, a polynomial interpolation is performed on bin points with a polynomial function f(x) fitting to the points. The function is plotted and 1000 sampled values are taken and now compare the corresponding side values of that maxima. If the difference is greater than a threshold value, the peak is said to be overlapped. Otherwise phase test comes into effect. Failure of symmetry test occurs when the ratio of two peaks under overlap is very low or if two harmonic frequencies are equal etc... So phase test is to be conducted after symmetry test.