Speech recognition is one of the key tools for the next-generation user-friendly computer application that aims to simplify the lives of everyday users. A number of obstacles remain in the way of this goal and must be corrected before speech recognition can become a mainstream technology. One of these obstacles is the difficulty in distinguishing stop consonants, those sounds created by stopping the flow of air in the mouth and letting it go in a burst (/b/, /d/, /g/, /k/, /p/, and /t/). Since all stop consonants follow a similar stop-followed-by-burst pattern, and are typically unvoiced when pronounced, telling them apart is a particularly difficult problem. Building a system focused on distinguishing stop consonants may help bring the world of speech recognition one step closer to its larger goals of an ideal user interface.


The Human Beatbox is essentially a voice-to-drum synthesizer that accepts speech input which is limited to a small dictionary of sounds the system is pre-trained to recognize. Each sound in the dictionary has a pre-determined corresponding drumbeat; the Beatbox outputs that drumbeat which corresponds to the input sound detected by it. In this manner, someone without knowledge of the drums could effectively “play” the instrument with his/her mouth. While the principal theoretical premise of the project rests on detection of stop consonants, for the purposes of demonstration, we have chosen our dictionary to include the following sounds: /doo/, /pff/, /k/, /th/, and /psh/. These sounds are more representative of acapella drum sounds and contribute to a more realistic user experience. However, our program works equally well on the stop consonant dictionary previously specified. The Beatbox is armed with a graphical user interface that allows users to instantly train the program before testing it. Users may also skip training the program on themselves and directly test the system employing pre-trained data, generated principally by the creators. Further, the GUI also displays the waveform as received by the microphone and the corresponding spectrogram showing distribution of frequencies and their intensity in the input sound, allowing for a richer music experience!





Recognition of stop consonants is not a trivial problem and has been the subject of ongoing research. Stop consonants are those sounds created by stopping the flow of air in the mouth and letting it go in a burst as illustrated in Figure 1. The closure may be voiced or unvoiced, while the burst is generally unvoiced.


Figure 1: Waveform and Spectrogram of stop consonant /g/ in the word ‘good.’


The English language has six stop consonants, namely b, d, g, k, p and t. It is difficult to distinguish between stop consonants based on their spectral components as shown in Figure 2.


Figure 2: The average auditory spectrum of stop consonants.

From left to right: Circles - /t/, /p/, /k/; Lines - /d/, /b/, /g/


Stop consonants are typically short, speaker-specific and context-specific, making them particularly hard to detect [2]. In fact, modern speech-recognition software often ignores stop-consonants and interprets words based on other sounds that are easier to identify.


This project was intended to expose us to the fascinating problem of stop consonant recognition. Furthermore, we hypothesized that our focus on examining specific frequency analyses relevant for detecting stop consonants, rather than for other types of speech, would enable us to reach higher levels of accuracy in the detection process.



(Pictures are courtesy of Guoning Hu and DeLiang Wang: Separation of stop consonants)