Development of a Voice Recognition Program
May 9, 2002
T2
Michael Kohanski, Anna Marie Lipski, Justin Tannir, Tony Yeung
ABSTRACT
The objective of the following experiment was to develop a digital voice recognition program utilizing LabView and MatLab 5.3R11 that will be able to identify people uniquely by their voice. As seen from the demonstration, the programs developed equip us with the necessary tools to record, filter, and analyze different voice samples and compare them to the archived sample. Furthermore, any threshold value can be set, depending on the security level wanted. The spectrogram program subtracts two fingerprints to get a relative difference. A large relative distance indicates little correlation between the voices. The results show that this program can accurately determine the difference between a male and a female voice and to a certain extent differentiate between different male voices. The peak comparison program is able to differentiate between four speakers saying the word “open” on multiple occasions. It compares the multiple repeats of the word to stored fingerprints of the four speakers saying, “open”. This program is a good foundation for allowing a user to identify a person based upon voice fingerprinting.
BACKGROUND
Voice recognition is not speaker-independent. It amplifies the idiosyncratic speech characteristics that individuate a person and suppresses the linguistic characteristics. The goal of this experiment was to identify, through comparative measures, the characteristic fingerprints of certain pitches and sounds. Speech recognition has been actively studied since the 1950s, however recent developments in computer and telecommunications technology have improved speech recognition capabilities. In a practical sense, speech recognition will solve problems, improve productivity, and change the way we run our lives. Academically, speech recognition holds considerable promise as well as challenges in the years to come for scientists and product developers. Two applications of digital speech processing, pitch recognition and voice fingerprinting, were modeled in order to create a working program in MatLab 5.3R11 that could differentiate among different frequencies.
Pitch recognition was modeled initially in order to create a program that could classify and distinguish between different notes, and to determine the frequencies by FFT analysis at which each note resonated. After doing this we could then further develop our program to analyze and determine the differences in human voice. The voice fingerprinting method was utilized in order to differentiate voices among those who participated. Voice fingerprint refers to the distinct manner in which the air from an individual’s lungs passes through the vocal chords and resonates off of the pallet. It is through noise filtering and Fast Fourier Transform (FFT) analysis; comprehensive comparisons can distinctly identify the owner of a sample voice. Moreover, graphical representations of the voiceprint took the form of a speech spectrogram, which consists of a representation of a spoken utterance in which time is displayed on the abscissa and frequency on the ordinate.
Voice Recognition consists of two major tasks: feature extraction and pattern recognition. Feature extraction attempts to discover characteristics of the speech signal unique to the individual speaker while pattern recognition refers to the matching of features in such a way as to determine, within probabilistic limits, whether two sets of features are from the same or different individual (peak to peak analysis of voices).
Voices can be analyzed in either the temporal domain or the frequency domain. Features of a voice in the time domain are either evident in the raw signal, or they become more evident by deriving a secondary signal that has particular properties. The major time-domain parameters of interest are duration and amplitude. Features of the voice are identified by spectral analysis in the frequency domain. The FFT is the most commonly used technique that calculates the spectrum from a signal, and it was the method of choice in the following experiment. It provides a measure of the frequencies found in a given segment of a signal by decomposing it into its sine components. Furthermore, it allows for pitch extraction and the identification of fundamental frequencies, which was an essential component of the experiment.
Areas of analysis included: parameter extraction and distance measurements. Parameter extraction, as discussed before, consists of preprocessing an electrical signal to transform it into a usable digital form, applying algorithms to extract only speaker-related information from the signal, and determining the quality of the extracted parameters. The two parameters of choice were pitch and frequency representations. Pitch allows for distinguishing between a male and a female voice, and frequency allows for the representation of a signal in various time frames. This is equivalent to a spectrogram in numerical form. The numerical form of a spectrogram was computed utilizing the FFT algorithm. It is from the distance measurements that one can determine how much two voices differ from each other. Here, a threshold value can be decided (which can be set to whatever value desired, depending on the amount of security needed), and if the distance between the unknown speaker and the known speaker is greater than the threshold value, the unknown is rejected.
Theory & Method
Peak Analysis Program
The characteristic properties of one’s voice is a function/product of many parameters including the dynamics of sound as it passes through the pharynx, the vibration of the larynx, the shape of the mouth and how sound reverberates off of the palette. Due to the uniqueness complexity of one’s voice, a voice has fingerprint qualities, i.e. no two people’s voice patterns are exactly alike. The method of FFT analysis maps the frequency fingerprint of a person’s voice, in order to systematically determine if another sample is or is not from the same subject.
This program gathers the FFT fingerprint from the subject in question and compares it to four previously stored fingerprints in order to determine if the subject’s voice matches one of the archived voices. The program itself is divided into five functions.
· wav_gather.m: This is the main function and it searches out the file(*.wav) of the subject, checks/converts the data to mono from stereo (ie takes only input from one speaker) and outputs to wav_plot.m
· wav_plot.m: Sends the raw data to wav_filter.m to be filtered. Then it gathers the filtered data and plots the FFT, spectral density and original waveform
· wav_filter.m: Filters the raw data with high pass, low pass, bandpass, bandstop, or no filter.
· peak_compare.m: Sends the FFT to peak_finder.m to get the peaks. The peaks are then returned and compared against archived fingerprints.
· Peak_finder.m: Analyzes filtered data and picks out peaks via a complex downward scan of the FFT up-to a preset threshold value.
Spectrogram Program
In addition to the frequency domain, analysis can also be performed in the time-frequency domain using the spectrogram function in Matlab. The spectrogram performs a short-time Fourier Transform on the input signal in the time domain and transforms it into an intensity frequency-time plot. The short time Fourier Transform is based on the Fourier Transform but the input signal, x(t), is multiplied by a shifted window function, w(t-τ). Equation 1 is the Short-Time Fourier Transform equation.
Instead of performing a Fourier Transform on the whole signal, the signal is segmented into many sections and a short-time Fourier Transform is performed on each windowed segment.
Figure 2 is a spectrogram of the A string (440Hz) on a violin. The top figure is the signal in a voltage-time plot and the bottom figure is the spectrogram. Each point (x1,y1) on the spectrogram indicates the intensity of the frequency at that particular time point. Pixel with red color has the highest intensity and that with blue has the lowest intensity.
Figure 1 Input signal in a voltage-time plot and in a spectrogram.
Since the spectrogram is essentially a “fingerprint” of a person’s voice, it shows particular features specific to the subject himself. For instance, the parallel bands observed in the above example show the harmonics of the A note. Based on the voice fingerprinting, different subjects will have different parallel bands as well as other features such as delays and time shifts. In this project, the spectrograms of two different subjects are subtracted from each other and the result is the relative difference in frequency intensity. The largest difference in the frequency intensity indicates that those two subjects vary the most. It is also important to note that the two voices are pre-aligned to obtain the best overlap between the two. This is achieved by using the cross-correlation function between the two voices and obtaining the best correlation coefficient and the lag time.
The following parameters were selected while using the peak compare program. Filter value is a normalized value, where an input of 0.5 corresponds to a value of 50% of the frequency. Percent difference in peaks corresponds to the maximum deviation in total number of peaks between an archive and a test sound. This will stop false positives from occurring for a test or an archive being a subset of the other. Maximum difference between the peaks corresponds to the maximum deviation in the frequency axis allowed between peaks in an archive versus a test subject for it to be considered the same peak. In general, this is a direct function the value used to define peak separation in the peak_finder.m function and has been defaulted to r/80, where r corresponds to the total number of data points in the FFT.
parameter / value/setting usedfilter type / high pass
filter value-normalized / 0.2
peak-compare / Yes
percent diff in peaks / 20
maximum diff between peak frequencies / 100
Table 1 Parameter setting for the Peak Analysis programs
METHODS & MATERIALS
Materials Programs Created
· 16 Bit Sound Card ● Wave Gather (wav_gath)
· Microphone ● Wave Filter (wav_filt)
· LabView ● Wave Plot (wav_plot)
· Tuning forks ● Peak Finder (peak_find)
· MatLab 5.3R11 ● Peak Compare (peak_compare)
Figure 2 Representation of the protocol.
The specific protocol of the voice data collection required that all four of the participants speak into the microphone while running LabView. The settings for LabView data acquisition were at a sampling rate of 44100Hz. Four different individuals spoke chosen words into the microphone at different times: “subject,” and “open.” A total of four subjects participated in the experiment. Data was then saved in LabView and then re-opened, filtered, graphed, and analyzed in MatLab 5.3R11 by FFT analysis using the programs that were created. The archived voice was compared to all other voices recorded. The overall method of data acquisition was as followed: sound wave created by person’s speech was transuded into an analog electrical signal via mic à the signal is sampled & quantized à resulting in a digital representation of the analog signal.
Figure 3 Order of Signal Analysis
After signal collection, the analysis segment of the experiment utilized the programs made for the experiment. The data collected in LabView is transferred to MatLab and first, wave gather gathers the analog signal. Then wave plot plots the FFT of the signal, wave filter filters the original signal, and finally peak comparison and peak finder are used for the analysis of the signal, along with another signal to determine the owner of the voiceprint as well as if the two signals match.
RESULTS
Peak Finder & Compare Program
Figure 4 is a raw wave file (voltage vs. time).
Figure 4 Raw signal in time domain.
Table 2 is a sample of results from the peak compare program.
Table 2 Sample of results from peak compare program
The FFT on the word “open” spoken by Mike in trial #2 is presented in figure 5. As labeled in the text box, the red marker labels the peaks obtained from the FFT directly beneath it in blue. It could be seen that the red peaks overlap almost exactly on top of the peaks from the FFT in blue. This shows that the peak finder program is able to identify the peaks on a FFT accurately.
Figure 5 An illustration of the red peaks identified by the peak finder program
The FFT on the word “open” spoken by Anna in trial #1 and #2 is presented in figure 6. As labeled in the text box, the red represents the peaks obtained from FFT of trial #1 and the blue is the FFT of trial #2. It could be seen that the red peaks overlap to a great extent to the blue peaks from the FFT directly beneath it.
Figure 6 The red peaks of Anna in trial #1 are compared to the blue peaks from trial #2
The FFT on the word “open” spoken by Anna in trial #2 and Justin in trial #1 is presented in the following Figure 7. It could be seen that the red peaks from Justin’s FFT do not match Anna’s peaks in the FFT (in blue) as closely as in the above figure, Figure 6.
Figure 7 Red peaks from Justin are compared to Blue peaks from Anna
Spectrogram Program
The relative difference in frequency intensity between two speakers is presented in the following figure. A positive relative difference shows that the first subject (listed below each column) has higher frequency intensity than the second subject. However, the difference can be the result of delays, shifts or pitches at which the word is being voiced. The error bars are the standard deviations in the measurement.