Adaptation Techniques for Speaker Recognition

ADAPTATION TECHNIQUES FOR SPEAKER RECOGNITION

Costas Boulis

ABSTRACT

Several adaptation techniques are compared for the task of speaker recognition in the Switchboard database. Adaptation techniques have been proven to be successful for the task of speech recognition and a number of them are investigated here for their possible usefulness in the speaker recognition problem, since many of the issues are common in both tasks. Both transformation-based and approximate Bayesian approaches are used. The proposed techniques are compared with a baseline system of a mixture of Gaussians for each one of the target speakers. Results show that adaptation methods can outperform standard ML techniques.

1. PROBLEM DEFINITION

The problem that was presented to us was the following. Given 21 target speakers, perform 21 binary classifications (one for each target speaker) for each one of the test sentences. Each of the binary classifications is a YES if the sentence belongs to the target speaker and NO otherwise. Under this setting, one sentence may be decided to have been generated by more than one speaker, in which case there will be at least one false alarm. Also, some of the test sentences were spoken by non-target speakers (impostors). Some of the impostors have been present and some of them have not been present in the training set.

All the data were from the Switchboard database and all speakers were male. The data were partitioned to 3 non-overlapping sets. The training, development and evaluation sets. The training set included about 2 minutes of speech from each one of the target speakers and about 1 minute of speech for each one of 21 impostors (non-target speakers) called required set. An extra minute of speech for each one of the 21 impostors was also included in a separate set (extra set). The development set consisted of 343 sentences from target speakers, previously seen impostors and 22 unseen impostors. The evaluation set consisted of 617 sentences from target speakers, previously seen and unseen (in both training and development sets) impostors.

The deliverables of the project was a system built on the training data and the required set of the impostors and an optional system built on all data available (excluding evaluation set). All systems were scored against the evaluation set whose key was unknown to us.

2. BASELINE SYSTEM

The baseline system consisted of a mixture of Gaussians trained on each one of the target speakers. All parameters of the mixture (weights, means and covariances) were estimated using the EM algorithm [4]. Also the number of Gaussians of each mixture was treated as an unknown parameter and was estimated using a held-out set. All the Gaussians were chosen to have diagonal covariance matrix to avoid numerical errors due to inversion of close-to-singular matrices. Diagonal Gaussians can also offer finer granularity in the model. Since each full covariance Gaussian has dxd parameters to be estimated, a diagonal covariance Gaussian has only d. Therefore each time we increase the number of Gaussians in a mixture we add dxd parameters in the first case but only d parameters in the latter case. In addition to training issues, a mixture of diagonal Gaussians is much faster during evaluation.

The training data for each one of the target speakers were split into two equally sized sets, the held-in and the held-out sets. To determine the model order, a mixture of known order was estimated on the held-in set and the log likelihood of the mixture on the held-out set was calculated. A robust stopping criterion was used to determine the true model order of the mixture. The entire training procedure is best described with the following pseudocode:

M=1, initialize with global mean and global variance

While

M = M+1

Find Gaussian with the highest weight and split its mean into two according to:

Set the initial parameter values to be the current parameter values replacing with and

New parameter values = EM(held-in set, number of iterations, M, initial parameter values)

= calculate log likelihood (held-out set, new parameter values)

end

Final parameter values = EM(held-in + held-out sets, number of iterations, M-3, initial parameter values)

The robust stopping criterion is needed in practice because the log likelihood on the held-out set is not monotonic and therefore it may stagnate in a local maximum. The algorithm works by observing windows of likelihoods instead of specific likelihoods and stopping if the current window does not have significantly higher likelihood than the window of a model with 10 less Gaussians. The number of EM iterations was kept always constant to 3. The value of T was set to 0.03.

For impostor modeling I tried two alternatives. The first alternative was to estimate B speaker specific mixtures (one for each impostor). During testing, I score each sentence to each one of the B models and then I average the likelihoods over all speakers as the impostor likelihood. The second alternative was to estimate one model for all impostors by pooling the data from all impostors. For this alternative the number of Gaussians can be very large and because of computation constraints I limit the model order to 100. The second alternative was shown to be significantly better for the development set, as is explained in the experiments section.

The score of each sentence for a target speaker is the difference of the log likelihood of the target speaker’s model and the impostor likelihood.

3. ADAPTATION TECHNIQUES

Adaptation techniques have become an important tool in the speech recognition task [1],[2],[3] and have been shown to effectively adapt to speaker characteristics. Although speaker recognition is a different task than speech recognition, they share many issues. For example, we can think of the mixture of Gaussians as modeling the different sound classes of a speaker. Under this perspective, one may expect to have a better speaker recognition performance by building a better speech recognition system.

Because of limited data available, the true pdf of a speaker should be better approximated by using a large number of Gaussians, higher than the number estimated using standard ML techniques (such as EM). To do so, the data from other speakers must be used to model speech sounds and then appropriately map them to the current speaker.

For each one of the target speakers, I pooled the data from all the other target speakers and built a mixture of 100 Gaussians. Therefore, each target speaker has a speaker-independent (SI) system associated with him. Then I use the target speaker’s data to adapt the SI system to better fit his speech characteristics. I experimented with transformation-based as well as Bayesian approaches.

Before proceeding to explain the adaptation methods I should note that by building a SI system like the one just described I model two things: speaker variability and speech variability. This will compromise the modeling power for speaker recognition but it is a relatively easy to implement SI system.

3.1. Transformation-based approaches

In transformation-based approaches we assume that the speaker-specific data are generated from the following process:

(1)

Where y is the speaker-specific frame of dimensionality d, x is the SI frame and is considered to be hidden but with a known pdf and A and b are the transformation parameters. In the general case, A is a dxd matrix and b is a non-zero vector of dimensionality d. Under this assumption if the SI mixture is given by:

(2)

Then the adapted mixture will be described by:

(3)

We can observe that the new speaker-specific pdf will consist of 100 Gaussians, many more that the about 30 Gaussians that would be used if standard ML techniques were employed. The same transformation is used for all Gaussians, which enables robust estimation of transformation parameters. The weights are left unchanged. Also, if A is non-diagonal then the resulting covariances will be non-diagonal as well. Therefore, A is kept diagonal as our first model.

The transformation parameters can be estimated using the EM algorithm. Here I will omit the intermediate steps and I will just list the final re-estimation equations.

(4)

(5)

Where is the k-th diagonal element of A, is the k-th element of b, N is the total number of training samples, is the probability of sample to have been generated by mixture j, is the k-th precision coefficient (inverse variance) of mixture j. Initial values forand are one and zero respectively for every k. The equation for will always generate two real values. I keep the one with the least change from the previous iteration. Three EM iterations were used for all experiments.

The diagonal transformation may be too restrictive and thus the next step is to define a less restrictive transformation which can be robustly estimated as well. The following model:

(6)

(7)

is a cascade combination of a full transform F on the means followed by a diagonal transform D on the variances. Note that equation (1) as the generative model is no longer valid in this case. The re-estimation equations for the full transform are:

(8)

(9)

(10)

Where is the vector consisting of the k-th row of the full matrix F. We observe that the estimation of the full transformation F requires the solution of d linear systems with each linear system having d unknowns. Efficient and robust to numerical errors, methods can be applied to solve the linear systems (like Singular Value Decomposition) instead of inverting . Again three EM iterations are used and the initial values are the identity matrix for F and the zero vector for b.

The re-estimation equation for the diagonal transformation to variances is similar to equation (4) and will be omitted.

3.2. Bayesian approaches

Transformation-based approaches assume the same transformation (either full or diagonal) for all Gaussians and thus may be too restrictive. Bayesian approaches do not have this limitation and can re-estimate all the parameters of a mixture, assuming a prior distribution. A suitable prior in our case would be the parameter values of the SI system. An approximate Maximum A posteriori (AMAP) technique is used where the sufficient statistics for the SI are smoothed with the sufficient statistics for the speaker data. More specifically the zero, first and second order sufficient statistics are smoothed according to:

(11)

Where the second term in each equation is the summation over the speaker’s data. The constant controls the a priori weight that is used for re-estimation. If this means that the target speaker samples have more weight than samples from other speakers. So one sample from the target speaker is equivalent to samples from other speakers. This smoothing technique is called approximate because only one iteration is used, since re-calculation of for the SI system is computationally intensive. Another approximation to the true MAP approach is that the weight is kept constant for every Gaussian j. The value of can be optimized using a held-out set.

Another Bayesian approach is the hierarchical bias. Under this approach a bias vector is estimated for each one of the Gaussians. To have robust estimates of the biases we smooth them with the global bias. That is the pdf for each speaker is re-estimated using:

(12)

(13)

The weights should be dependent on the number of samples associated with Gaussian j but the best results were obtained with a constant weight, optimized on the dev set. The re-estimation equation for biases can be obtained by replacing in equation (5) and eliminating the summation over mixture components.

Note that the hierarchical bias method and AMAP can be cascaded to give AMAP a superior prior than the SI system.

4. EXPERIMENTS

First, a mixture of Gaussians is estimated for each one of the target speakers using the algorithm and the stopping criterion described in section 2. For impostor modeling I compared two alternatives, one with the average score across speaker-dependent impostors and another with a big speaker-independent impostor model, consisting of 100 Gaussians. The results on the dev set showed 11.1% equal error rate (eer) for the first alternative and 7.9% for the second. The superiority of the SI impostor model is mainly due to the large number of unseen impostors in the dev set. The SI model appeared to generalize better to unseen impostors and was used for all following experiments.

Table 1 summarizes the performance of each one of the methods in the dev set.

Diagonal transform / 37.1
AMAP / 24.6
Hierarchical bias (automatic weights) / 20.3
Hier. bias (automatic weights) + AMAP / 16.1
Full + Diagonal transform / 7.9
Hier. bias (no prior) / 7.9
Hier. bias (no prior) + AMAP / 7.9
Speaker dependent mixtures (baseline) / 7.9

Table 1. Performance of baseline and various adaptation methods in the development set

The automatic weights of the hierarchical bias are set using the heuristic rule:

and substituting this expression in equation (13). Note that for the AMAP approach the a-priori weight was set to 5, so the results are not optimum. The right way of finding would be to use a held-out set, different than the development set, but because of time constraints I just picked the arbitrary value of 5, which I used for the evaluation set as well.

For the required training set I submitted the baseline (speaker dependent mixtures) with an eer of 8.2% . For the optional training set I submitted two systems, the baseline and the hierarchical bias with no prior cascaded with AMAP. The results are shown in Table 2.

Hier. bias (no prior) + AMAP / 6.6
Speaker dependent mixtures / 7.5

Table 2. Performance of the two submitted systems trained on the optional set and tested on the evaluation set.

We observe from Table 2 that hierarchical bias + AMAP performed better than the baseline.

5. SUMMARY

A number of adaptation techniques were evaluated on the speaker recognition task. Results show that adaptation methods can outperform standard ML methods. A gain was observed for the evaluation set but no gain was reported for the development set. It would be of interest to use an SI system with a much higher number of Gaussians than 100, since I believe that 100 is far less than optimal for the 20 speakers they were estimated on.

Another possible improvement can come from using a discriminative adaptation instead of a ML adaptation. This makes sense in the speaker recognition task much more than it does in the speech recognition task. Under discriminative adaptation we seek the transformation such that it maximizes the ratio of probabilities of the target speaker model and the impostor model. That is:

Since this is the score that is used for testing, discriminative adaptation matches closer the training and testing procedure.

6. REFERENCES

[1] C.H. Lee and J.-L. Gauvain, “A Study on Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models”, IEEE Trans. Acoust. Speech and Signal Processing, vol. 39, no. 4, pp.806-814, Apr. 1991.

[2] C.J. Legetter and P.C. Woodland, “Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models”, Computer Speech and Language, pp.171-185, 1995.

[3] V.Digalakis, D. Rtichev and L. Neumeyer, “Speaker Adaptation Using Constrained Reestimation of Gaussian Mixtures”, IEEE Trans. On Speech and Audio Proc. Pp. 357-366, September 1995.

[4] A.P. Dempster, N.M. Laird and D.B Rubin, “Maximum Likelihood Estimation from Incomplete Data”, Journal of the Royal Statistical Society (B), Vol. 39, No. 1, pp. 1-38, 1977.