Researchers and developers thus have a critical need for tools to support basic research in language technology, and for tools to support the rapid design and implementation of spoken language systems. The Center for Spoken Language Research (CSLR) of the University of Colorado addresses this need with its CSLU Toolkit. By packaging state-of-the-art speaker- and vocabulary- independent recognition technology, text-to-speech synthesis, and facial animation software within a powerful yet easy-to-use authoring environment for building spoken language systems, we are creating a research and development environment that is suitable for exploring a broad range of real-world applications.
Our goals are to provide researchers with the knowledge and tools to advance the state of the art and powerful tools that enable even inexperienced users to rapidly design, test and deploy spoken language systems. The knowledge and experience gained from the prototyping done with the CSLU Toolkit can significantly reduce the cost of the advanced engineering necessary to develop a deployable spoken language application. In addition, the CSLU Toolkit provides experienced researchers with an environment for performing research and for testing and showcasing research advances.
Speech recognition: The toolkit supports several approaches to speech recognition including artificial neural network (ANN) classifiers, hidden Markov models (HMM) and segmental systems. It comes complete with a vocabulary-independent speech recognition engine, plus several vocabulary-specific recognizers (e.g., alpha-digits). In addition, it includes all the necessary tutorials and tools for training new ANN and HMM recognizers.
Speech synthesis: The toolkit integrates the Festival text-to-speech synthesis system, developed at the University of Edinburgh (Black & Taylor, 1997). CSLU has developed a waveform-synthesis "plug-in" component (Macon et al., 1997) and six voices, including male and female versions of American English and Mexican Spanish. Festival provides a complete environment for learning, researching and developing synthetic speech, including modules for normalizing text (e.g., dealing with abbreviations), transforming text into a sequence of phonetic segments with appropriate durations, assigning prosodic contours (e.g., pitch, amplitude) to utterances, and generating speech using either diphone or unit-selection concatenative synthesis.
Facial animation: The toolkit features Baldi, an animated 3D talking head developed at the University of California, Santa Cruz. Baldi, driven by the speech recognition and synthesis components, is capable of automatically synchronizing natural or synthetic speech with realistic lip, tongue, mouth and facial movements. Baldi's capabilities have recently been extended to provide powerful tools for language training. The face can be made transparent revealing the movements of the teeth and tongue while producing speech. The orientation of the face can be changed so it can be viewed from different perspectives while speaking. Also, the basic emotions of surprise, happiness, anger, sadness, disgust, and fear can be communicated through facial expressions.
Authoring tools: The toolkit includes the Rapid Application Developer (RAD), which makes it possible to quickly design a speech application using a simple drag-and-drop interface. RAD seamlessly integrates the core technologies with other useful features such as word-spotting, barge-in, dialogue repair, telephone and microphone interfaces, and open-microphone capability. This software makes it possible for people with little or no knowledge of speech technology to develop speech interfaces and applications in a matter of minutes.
Waveform analysis tools: The toolkit provides a complete set of tools for recording, representing, displaying and manipulating speech. Signal representations such as spectrograms, pitch contours and formant tracks can be displayed and manipulated in separate windows. The display tools allow recognition results, such as phonetic or word decoding, to be displayed and time-aligned with recognized utterances. Three-dimensional arrays can also be aligned to utterances, showing, for example, the output categories of a neural network phonetic classifier.
Programming environment: The toolkit comes with complete programming environments for both C and Tcl, which incorporate a collection of software libraries and a set of API's (Schalkwyk et al., 1997). These libraries serve as basic building blocks for toolkit programming. They are portable across platforms and provide the speech, language, networking, input, output, and data transport capabilities of the toolkit. Natural language processing modules, developed in Prolog, interface with the toolkit through sockets.