Logo of The Center for Spoken Language Research
Image of a page from the Reading Tutor
  Home > People > Portable HLT Curriculum > Text-to-Speech Synthesis  

Image of CSLR logo


 
 


Portable Curriculum in Human Language Technologies

Course:       Text-to-Speech Synthesis


Instructor:      Alan W. Black
                            Personal page: http://www.cs.cmu.edu/~awb/

Affiliation:       Language Technologies Institute
                         Carnegie Mellon University, Pittsburgh, PA 15213
                             Lab website: http://www.lti.cs.cmu.edu/


General Description

The course is designed to cover all aspects of speech synthesis from both a theoretical and practical point of view. Students are given the opportunity to learn about new research in the areas of text processing, prosodic modeling and waveform synthesis as well practical experience in using existing synthesis technologies. The course consists of the following sub-parts:

  • History and general use of speech synthesis;
  • Text analysis: text conditioning, markup languages, homograph disambiguation;
  • Lexicons and letter to sound rules;
  • Prosodic modeling: phrasing, duration and intonation;
  • Waveform synthesis: diphones, and unit selection;
  • Building new voices in new languages; and,
  • Limited domain synthesis for practical applications.


Course Objectives

The objectives of this course are:

  • To allow understanding of the basic parts of speech synthesis
  • To understand the relative complexity of implementing solutions to the problems
  • To become familiar with the Festival architecture and know what it can and can't do

As the instructor is a firm believer in learning by doing, this course tries to touch on every aspect of speech synthesis from a practical view. General discussion of problems are discussed with some presentation of potential theoretical solutions. Where appropriate, substantial exercises are given which will hopefully lead to greater understanding of the actual problems.


Learning Activities

The course was based heavily around the Festival Speech Synthesis System. As Festival offers an environment for building new synthetic voices as well as an end user delivery vehicle for black box text-to-speech, it offers an ideal platform for teaching students what can be done with today's speech output technologies. Each week simple exercises are assigned involving different aspects of the system so the students can learn from practical experience how the technology worked. The system is designed such that no low-level C++ programming was required, thus opening the course to a much wider audience. In all cases, existing simple rules and functions used in Festival were presented to students for modification using the Scheme scripting language; this enables students to learn without having to delve too deeply into the complexities of the system.

In addition to synthesis techniques, the students are led into the field of building new synthetic voices in new and currently supported languages based on the released documentation and scripts that are part of the CMU FestVox Project (http://festvox.org). These scripts and tools sit on top of Festival (and the Edinburgh Speech Tools) and offer a complete environment for developing new synthetic voices.

In addition to the weekly exercises, a larger project is set towards the end of the course.


History and Background of the Course

This course, now completing its second year, is primarily desgined for entering graduate students at CMU majoring in language technologies, computer science or robotics. Although some students will continue their research in speech synthesis, most are in more general areas of speech and language processing. The attendees in the second year also included two senior undergraduates. Some of the projects completed at the end of the course have led to publications. Such projects have included, cross language limited domain synthesis (a talking clock in Chinese and Polish weather reports), Thai letter to sound rules, a talking Eliza program, complete new female US English diphone voices, horoscopes, singing synthesis, a Catalan diphone synthesizer etc.

Although the CSLU Speech Toolkit itself is not used, the Festival Speech Synthesis System is an integral part of the toolkit, so the course can be taught using the toolkit under Windows. Moreover, all voices, techniques, models, etc. developed within this course can be used directly in toolkit applications.

Development of the course was sponsored in part by an NSF CRCD (Combined Research and Curriculum Development) grant awarded to the University of Colorado at Boulder.


Links to the Course Materials

The complete course notes and slides have been made available at

      http://festvox.org/festtut/

This site will also be updated with some of the example student projects that were completed and model answers to the exercises (some are already on the general section of the FestVox site). We are continuing to update these notes and there will be new releases making it easier for both individual students to follow the course notes and institutions to uses these notes to teach there own course.

If you have questions regarding the course, please e-mail the course instructor. The following link will invoke your e-mail program:

      Alan W. Black, Ph.D.