Speech recognition is one of the key tools for the next-generation user-friendly computer application that aims to simplify the lives of everyday users. A number of obstacles remain in the way of this goal and must be corrected before speech recognition can become a mainstream technology. One of these obstacles is the difficulty in distinguishing stop consonants, those sounds created by stopping the flow of air in the mouth and letting it go in a burst (/b/, /d/, /g/, /k/, /p/, and /t/). Since all stop consonants follow a similar stop-followed-by-burst pattern, and are typically unvoiced when pronounced, telling them apart is a particularly difficult problem. Building a system focused on distinguishing stop consonants may help bring the world of speech recognition one step closer to its larger goals of an ideal user interface.
The Human Beatbox is essentially a voice-to-drum synthesizer that accepts speech input which is limited to a small dictionary of sounds the system is pre-trained to recognize. Each sound in the dictionary has a pre-determined corresponding drumbeat; the Beatbox outputs that drumbeat which corresponds to the input sound detected by it. In this manner, someone without knowledge of the drums could effectively “play” the instrument with his/her mouth. While the principal theoretical premise of the project rests on detection of stop consonants, for the purposes of demonstration, we have chosen our dictionary to include the following sounds: /doo/, /pff/, /k/, /th/, and /psh/. These sounds are more representative of acapella drum sounds and contribute to a more realistic user experience. However, our program works equally well on the stop consonant dictionary previously specified. The Beatbox is armed with a graphical user interface that allows users to instantly train the program before testing it. Users may also skip training the program on themselves and directly test the system employing pre-trained data, generated principally by the creators. Further, the GUI also displays the waveform as received by the microphone and the corresponding spectrogram showing distribution of frequencies and their intensity in the input sound, allowing for a richer music experience!
Recognition of stop consonants is not a trivial problem and has been the subject of ongoing research. Stop consonants are those sounds created by stopping the flow of air in the mouth and letting it go in a burst as illustrated in Figure 1. The closure may be voiced or unvoiced, while the burst is generally unvoiced.
Figure 1: Waveform and Spectrogram of stop consonant /g/ in the word ‘good.’ 
The English language has six stop consonants, namely b, d, g, k, p and t. It is difficult to distinguish between stop consonants based on their spectral components as shown in Figure 2.
Figure 2: The average auditory spectrum of stop consonants.
From left to right: Circles - /t/, /p/, /k/; Lines - /d/, /b/, /g/ 
Stop consonants are typically short, speaker-specific and context-specific, making them particularly hard to detect . In fact, modern speech-recognition software often ignores stop-consonants and interprets words based on other sounds that are easier to identify.
This project was intended to expose us to the fascinating problem of stop consonant recognition. Furthermore, we hypothesized that our focus on examining specific frequency analyses relevant for detecting stop consonants, rather than for other types of speech, would enable us to reach higher levels of accuracy in the detection process.
Stop consonant detection is an interesting problem and has been explored using different techniques in the past. Using harmonicity as a grouping cue is a common technique in speech separation, but this technique is limited to voiced speech. Hu and Wang (2003) propose a method whereby stop consonants are separated using the acoustic properties of unvoiced signals. This method employs onset as the major grouping cue – it detects the signals through feature-based Bayesian classification and then groups by onset coincidence. Ali et al. (2001) have developed a method for automatic stop consonant identification through front-end processing using a biologically rooted property called Average Local Synchrony Detection (ALSD). It draws upon acoustic-phonetic features which are statistically found to be rich in information such as burst frequency, prominence of the burst frequency, and the maximum normalized spectral scope. Other techniques used include Hidden Markov Models and Support Vector Machines.
It should be noted however, that no attempts have been made to build a machine analogous to the beatbox, to the best of our knowledge. Thus, although the general problem may have been tackled multiple times from a theoretical perspective, there is no true predecessor to our end-product in terms of application and design.
The Beatbox was initially developed as a post-processing system containing a detection engine which handled both the signal analysis and stop consonant recognition processes. Later, when real-time functionality was added, a threshold detector was layered on top of the original system to extract the appropriate intervals of speech which were then handed over to the detection engine. We describe the theory and design of the detection engine before turning to the mechanics of the threshold detector, since the latter is not essential to our system unless real-time processing is desired. Finally, it is important to note that the ensuing discussion uses stop consonants as an illustrative example, though the demonstration system primarily relies on the revised dictionary of sounds specified in the Abstract.
The Detection Engine
The underlying detection engine of the Beatbox is comprised of three major components:
§ The Front-End accepts voice input and uses a set of Digital Signal Processing tools to pre-process, clean, and analyze the incoming audio data. This includes removal of background noise, equalization of volume levels (i.e. correct people’s tendencies to voice certain sounds louder than others), Fourier transforms and other frequency analyses. Ultimately this outputs a set of parameters which describe the most important frequency characteristics of the inputted audio signal.
§ The Back-End is essentially a pattern recognition subsystem that uses the frequency characteristics (outputted by the front-end) to determine the most likely match for the input audio data. The pattern recognition system is probabilistic in nature and is modeled using a series of Hidden Markov Models (HMMs).
A high-level system flow, illustrated in Figure 3 is outlined below:
1. Speaker voices string of drum sounds into microphone, forming a continuous, analog, real-time speech signal.
2. Speech signal is sampled and digitized.
3. Digital signal is processed to extract only the necessary information.
4. HMM-based pattern recognition subsystem probabilistically identifies and labels audio input.
5. Output of specific drum beat (or silence), and visual indication of classification
Figure 3: System Overview
The Front-End: Digital Signal Processing Techniques
The beatbox accepts voice input through a microphone. The computer sound card working in conjunction with audio sampling software converts this voice input into a digital signal. The sampled signal (the output of block (2) in Figure 3) is essentially a raw speech waveform, which represents the air compression pattern as picked up by the microphone. It is very hard to recognize patterns from this signal directly, thus necessitating extensive digital signal processing. Signal processing satisfies a dual purpose: first, the data can be split into time-slices and second, the information in these time slices can be represented so as to maximize the performance (in terms of accuracy) of the pattern recognition algorithms.
Figure 4: Signal Processing Block-Diagram
As seen in Figure 4, the first step in processing the signal involves dividing it into more manageable sub-units or “windows” for analysis. Typically, overlapping windows 20 – 30 ms in length (and shifted by 10ms) are used for speech applications, since it is generally agreed that that is the length of a meaningful phoneme (Gold and Morgan, 2000). We have however considered shrinking the window size since when limiting our inputs to /ph/, /th/ and /kh/, which are sounds of extremely short duration. Currently, the system uses a window length that corresponds to 512 samples (sampling frequency: 22,050 samples/sec) which yields a number on the lower end of that range.
The windowing step not only takes “snapshots” of the incoming signal, but it also modifies the signal. Specifically, it attenuates the strength of the signal toward the ends of the window by multiplying the signal x[n] with a “window function” w[n]. The window function used is the Hanning window, which is essentially one period of a sinusoid with a range from 0 to 1. A length-N Hanning window is defined as follows:
This function with N=1000 is shown below:
Figure 5: Hanning Window
Attenuating the ends of the signal window is a necessary step to getting sensible results from the Fast Fourier Transform to be performed in the next step. This is because the FFT assumes that the input signal is one period of a periodic signal. Therefore, the ends have to be dampened in order to minimize the discrepancy between the beginning and end of the actual signal window.
Border value attenuation in turn necessitates the use of overlapping windows, so that every point along the signal receives equal weighting. Note that the point in the signal x[n] which coincides with the crest of the Hanning window function and thus receives maximum weighting at time t, will receive minimum weighting in time t+1, when x[n] has been shifted over.
Figure 6 shows the specific steps that “windowing” a signal entails.
Figure 6: Pictorial Representation of the Windowing Process
A Fast Fourier Transform (FFT) is performed on the window (block 2 of Figure 4). The FFT treats the window as a snapshot from a periodic signal and takes the frequency content from the window. In other words, the signal is treated as a sum of complex exponentials of the form Aejω, where A is the complex amplitude (having both magnitude and phase) and ω is the angular frequency. It then outputs a vector of the complex amplitude values, A for a set of evenly spaced frequencies, ω. This block will only pass the magnitude values for A, since the complex phases are not generally useful for speech recognition.
The output values from the FFT will differ by a few orders of magnitude. In order to make the data more manageable, the amplitudes are compressed by taking the logarithm, effectively “compressing” the highest values while leaving the lower values more or less intact.
The data is now a series of power coefficients for evenly spaced frequency values. However, it has been shown experimentally that the human ear can be modeled as a filter bank, where the higher frequency filters have a much wider bandwidth than the lower frequency filters (Gold and Morgan, 2000). In order to model this behavior, the FFT output is split into frequency ranges, with the bandwidth increasing with frequency. Our particular choice in this regard is the Bark scale. In order to measure the amount of power contained in a frequency band, a weighted average of the frequencies within a band is taken, and that value is used as the characteristic power value for that frequency band. This process is called “critical band integration,” and its main purpose is a more compact representation of information. It achieves this by reducing unnecessary granularity i.e. reducing the number of frequency components in the FFT, which does not convey substantially greater information about the signal.
The FFT output for a speech window, also known as the spectrum of that window, generally has redundant information. For example, the values of adjacent frequency bands are highly correlated. In order to avoid redundancy and extract only the most important information, we take a “cepstrum” of the output from the critical band integration stage i.e. we perform a FFT on the FFT of the original audio data, and take, for example, the lower frequency values. By doing this, we capture the major features of the spectrum while ignoring the finer structure.
The final output from the signal processing stage is a vector encapsulating the spectral information from one window of the signal. This is then passed on to the pattern recognition subsystem.
The Back-End: Pattern Recognition Algorithm
The pattern recognition system is probabilistic in nature and is based on Hidden Markov Models. Our pattern recognition subsystem comprises of three HMMs, each corresponding to one of the allowed stop consonants. In any HMM there is a state space S and an observation set O. The observations are the visible nodes in the belief network while the states are the “hidden” nodes, inferred from the visible observation nodes. Certain Markov assumptions like finite context and time invariance are built into the structure of the belief network. A schematic of the belief network associated with an HMM is shown below in Figure 7. Any HMM is parameterized as follows:
§ Initial Distribution: πi = P(S1 = i), where i = state value
§ Transition Probabilities aij = P(St+1 = j | St = i)
§ Emission Probabilities bik = P(Ot = k | St = i)
Inferring these characteristics for the three HMMs is the purpose of the training phase.
Figure 7: Belief Network Associated with HMM
Each HMM corresponding to a different stop consonant, within our allowed dictionary of sounds, is trained separately. The training process for one such HMM is outlined below and also shown in Figure 8.
N samples of a certain stop consonant, say /p/, are received from the front-end signal processor. This training set will be utilized to estimate the parameters of the /p/ HMM. Ordinarily, an iterative algorithm like Expectation-Maximization (EM) would be used to infer the parameters characterizing the HMM in question. However, in order to simplify the computational complexity, we use an approximation technique involving the Viterbi algorithm. The N samples of training data are first used to deduce the N most likely state sequences that may have generated this data:
In this way, we now have a fully instantiated belief network with the state space no longer containing any “hidden” nodes. Consequently, the EM update rules can be simplified to eliminate posterior probabilities of hidden nodes given observed nodes. Instead, we use more intuitive counting methods to arrive iteratively at the parameters defining the HMM. The update rules simplify as follows:
Our observations are continuous as opposed to discrete. This means that when updating bik, the mean and covariance matrix of a Gaussian distribution have to be altered in each iteration. Thus, it is not so simple to express the update rule for bik in terms of counts, like the other parameters of the HMM, but it is not very difficult to implement in practice.
Figure 8: Training the HMM
We have also modeled the HMM such that only left – right transitions are allowed. This is easily accomplished by initializing the transition matrix such that non-allowable transitions occur with zero probability. Also, a helpful feature of the EM algorithm (and our approximation) is that zero prior probability implies zero posterior probability. So updating the transition matrix has no effect on those zero probability terms, thus preserving the left-right model.
During the actual operational / testing phase, the pattern recognition subsystem of the beatbox admits pre-processed (or real-time) voice input consisting of some collection of stop consonants. Each stop consonant is then run through each of the three HMMs. The HMM that reports the highest probability for the observed consonant will label the consonant such. (Ex. If the /p/ HMM reports highest probability then the consonant is recognized as a /p/.) In order for this to take place, P(O1, O2, …, OT) will have to be calculated. Ordinarily, this would have been accomplished using the recursive Forward Algorithm.
The Forward Algorithm in essence sums over all possible state configurations given the observation sequence to estimate its probability. However, we hypothesize that one such state configuration dominates all the other terms. This happens to be the Viterbi sequence i.e. the most likely state sequence that can be inferred from that particular observation sequence. With this idea in mind, we can pretend that the Viterbi sequence is in fact the true state sequence that generated the observation sequence, and estimate the probability of the latter as the likelihood of the observation and that specific state configuration.
The Real-Time System and Threshold Detector
In converting from the post-processing system to one that operates in real-time, we have altered the way the sound input is gathered and sampled; the underlying stop consonant detection engine (front and back end) remains essentially the same. Since the input to the microphone is continuously sampled, it is important for the system to be able to distinguish the user’s voice input from a reasonable amount of persistent background noise. This is done using a threshold detection mechanism. Essentially, when a user speaks into the microphone, we should expect to see a spike in the waveform; if this spike crosses the threshold, then it should trigger the underlying engine to begin the detection process. Conversely, the system should not be triggered by every isolated spike – only spikes of a sustained nature that may reasonably be interpreted as speech input should be handed over to the detection system.
The real-time threshold detector works in packets of 1000 samples at a time. At any instant in time, our system stores the current packet as well as three preceding packets of observed input. In order to detect the sustained spikes that correspond to user speech, we aggregate this input to the DSP front-end, progressively damping observations made further back in time. Specifically, we use the following Infinite Impulse Response (IRR) filter for this purpose:
Thus, after time τ (equal to the time constant), the impulse response of an input waveform would have decayed to 1/e. After the appropriate slice of the waveform (we have set this to include the packet that triggers the detector, the three packets preceding it and five packets following it) has been extracted, it is handed to the front-end of the engine, from whence the processed signal proceeds to the back-end pattern recognition system for detection as described above.
The Demonstration System: Graphical User Interface
The user interface is a GUI that allows for some interaction between the user and the system. The user may interactively train the system on the fly instead of using pre-trained parameters estimated on sound samples belonging to the system creators. Users also have the option to view the waveform and spectrogram generated by their voice inputs to gain a better understanding of what is being handed over to the pattern recognition algorithms under the hood for the automated detection process. Figure 9 presents a representative waveform and spectrogram obtained for each dictionary sound in the demonstration system.
Figure 9: Representative Waveforms and Corresponding Spectrograms of Dictionary Sounds
(First row from left to right: /doo/, /pff/, /psh/ Second row: /k/, /th/)
Implementation and Results
The Beatbox has been implemented in its entirety in the Matlab programming language in the Windows environment. Add-on toolboxes tailoring to signal processing, real-time implementation and GUI design have been used where necessary.
The system has been trained separately on each of the three system designers. Preliminary testing has shown 90-95% accuracy in recognition of the demonstration system dictionary sounds, when the training and testing is done by the same person. Note that only the user is the same, the data points are different, so the system is not tested on pre-trained data. If training data from another user is utilized, accuracy rates drop slightly to 80-85%, still well above randomness, indicating that our system generalizes satisfactorily (though not perfectly) to users outside the training set. Finally, if the Beatbox is tested on stop consonants per se, the accuracy levels in the two scenarios above are lower than the corresponding accuracy achieved with the demonstration system dictionary. However, they are still satisfactory and well above randomness. (Extensive testing with stop consonants has not been conducted as yet, so exact accuracy figures are withheld.) This is in line with our expectations since the demonstration system sounds are voiced and furthermore possess certain distinguishing characteristics that are relatively easier for the pattern recognition system to gauge.
While the Beatbox system performs with satisfactory accuracy on both sound dictionaries, there are a few implementation issues we would have liked to consider further had time permitted. The threshold detection mechanism while sufficient is far from perfect. Although we can set the threshold dynamically, the system often runs into problems with close-range vibrations like those resulting from movement of the microphone or switching it on/off. Since the Beatbox does not incorporate a sophisticated background noise elimination system, it can be difficult to operate it in noisy environments even with a microphone of small range. Furthermore, our decision to treat 9 packets of 1000 samples each as the boundaries of an input sound that triggers the detection engine is somewhat arbitrary and based solely on experimentation. The tradeoff for its simplicity is that we lose the flexibility of accounting for the inherent differences in duration of certain sounds. This is particularly evident in the /psh/ waveform shown in Figure 9 above. That sound, representing the cymbal, is inherently longer than any of the other sounds in our dictionary. Thus, as can be observed from the diagram, the waveform does not settle to 0 before it is cropped by the threshold detector’s 9000-samples-long input data rule. Although this does not have any material effect on detection accuracy (the ‘p’ is very short, and the majority of the sound is constituted by the more sustained /sh/, which the pattern recognition system picks up), the system often detects a single sustained /psh/ as two short repeated utterances of /psh/, outputting two cymbal clashes instead of one. We had considered alternative implementations like incorporating a feature-based classifier in the front-end to extract the most appropriate window slice to pass onto the back-end pattern recognition subsystem, but that would have meant added complexity. Another way of circumventing this problem would be to design a “master HMM” that would automatically account for silence and transitions between different sound states. Such implementations possess the scope of yielding better results.
Apart from a technical standpoint, design extensions to the Beatbox are limitless. Future directions could include harmonizing other instruments, an enhanced GUI with more graphic visualization, and achieving temporal correlation to lend our system a sense of rhythm. The last problem is partially solved with our real time design, but more can be achieved in that direction.
In closing, we would like to thank our advisor, Professor Lawrence Saul, for kindly lending us his idea for our Senior Project. We are also indebted to him for his patient support and encouragement through the vagaries of this year-long undertaking.
We also extend our thanks to Professors C.J. Taylor and Ken Laker for their thoughtful suggestions during our Senior Design conferences, and especially for their repeated emphasis on timelines, with the blessings of which we barely kept on schedule!
This project was a personally satisfying and enlightening experience for all of us. We spent a significant amount of time reading and understanding techniques in Digital Signal Processing and Artificial Intelligence to get up to speed with the requirements of this project. This will surely stand us in good stead in future undertakings of a similar nature.
Hu, Guoning and DeLiang Wang. “Separation of stop consonants.” ICASSP-03, Vol. 2, pp. 749-752. 2003.
Ali, Ahmed M.A., Jan Van der Spiegel, and Paul Mueller. “Robust Classification of Stop Consonants Using Auditory-Based Speech Processing.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2001), Salt Lake City, May, 2001.
Gold, Ben and Nelson Morgan. Speech and Audio Signal Processing. New York: John Wiley & Sons, 2000.
Denes, Peter and Elliot Pinson. The Speech Chain: The Physics and Biology of Spoken Language, 2nd edition. New York: W.H. Freeman and Company, 1993.
McClellan, James, Ronald Schafer, and Mark Yoder. DSP First: A Multimedia Approach. Upper Saddle River, NJ: Prentice Hall, 1999.
Krauss, Thomas, Loren Shure, and John Little. Signal Processing Toolbox. Natick, MA: The MathWorks, 1994
Juang, Biing Hwang, and Rabiner, Lawrence. Fundamentals of Speech Recognition. Englewood Cliffs, NJ: PTR Prentice Hall, 1993
Juneja, Amit and Epsy-Wilson, Carol. “Segmentation of Continuous Speech Using Acoustic-Phonetic Parameters and Statistical Learning.”