Logo of The Center for Spoken Language Research
Image of a page from the Reading Tutor
  Graphic design image with no meaning Home > Research > Reading Project > Corpus Development  

Image of CSLR logo


 
 

Kids' Audio Speech Corpus
NSF/ITR Reading Project


Corpus Overview:

The CU Kids' Audio Speech Corpus has been developed to enable research in auditory recognition of children's speech. The goal of this two-year development effort was to collect sufficient audio and video data from preschool through middle school aged children to enable development of auditory and visual recognition systems enabling face-to-face conversational interaction with perceptive animated agents during learning tasks. The current corpus, which includes audio data from both prompted speech and spontaneous speech, was supported by the following grants:

  • NSF/IERI: EIA-0121201 - Kintsch, W., Caccamise, D., Cole, R., Olson, R., Snyder, L., "IERI: Scalable and Sustainable Technologies for Reading Instruction and Assessment,"
  • NSF/ITR: IIS-0086107 - Cole, R., van Santen, J., Movellan, J., "ITR: Creating the Next Generation of Intelligent Animated Conversational Agents,"

During the 2000-2001 school year, we developed and tested data collection protocols, in collaboration with Dr. John-Paul Hosom at the Center for Spoken Language Understanding at the Oregon Graduate Institute (OGI). This work resulted in the collection of audio and video data from approximately 200 children. During the 2001-2002 school year, we further refined the data collection protocol, and captured audio and video data from an additional 580 students in schools within the Boulder Valley School District in Colorado. The audio speech data were verified and transcribed at CSLR. Dr. Javier Movellan and his associates at University of California, San Diego are currently working on transcribing the video data, which will comprise the visual portion of this corpus. Written permission to distribute audio and visual data in this corpus was obtained from the parent or guardian of each student.

The CU Kids' audio and video speech corpora will continue to be developed over the next five years to incorporate audio and video data from schools throughout the state of Colorado. In addition to the data described here, which includes the initial release of audio data, we will be collecting speech of children reading out loud from books, and speech of children summarizing stories they have listened to or read. Our goal is to collect, analyze, understand and model the speech and visual behaviors of a diverse population of children, including children who speak Spanish and Native American languages at home. In addition, the data collection protocol will evolve over the course of the NSF ITR and IERI projects to enable a range of capabilities, including pronunciation training, recognition and transcription of natural continuous speech, and natural dialogue interaction with intelligent animated agents during reading and comprehension training.


Collection Logistics:

Location

Speech data were collected in 5 public schools within the Boulder Valley School District (BVSD), Boulder County, Colorado.

Speakers

Initial data collection efforts were focused on children between Kindergarten and fifth grade. Two of the schools included in the collection have a high percentage of non-native, predominately Hispanic English speakers.

Auxiliary information about each child is collected for the sole purpose of gathering statistics to be used in research. This information, collected mostly from school records, is summarized in the table below.

Information

Purpose

Age

Acoustic Normalization, dissfluency modeling

Second language(s)

Accent normalization research

Birthplace (state, country)

Accent normalization research

Grade in school

Identification of speaker group

Speaker number (unique)

Identification of speaker

Table 1: Auxilary information collected for each child

Schools

The speech data were collected from students in five schools in the Boulder Colorado area (summarized in Table 2). During subsequent years, data collection will move to other cities and towns throughout Colorado. Recording is being done in different locations within schools (instead of a sound attenuated booth) in order to maximize, to the extent possible, the similarity between the recording conditions and "real-life" usage of speech technologies developed with such data. In total, 780 children were recorded. Due to several transcription issues, 660 speakers were retained in the final audio corpus.

Grade

Foothill

Fireside

Eisenhower

Sanchez

Pioneer

Total

Kindergarten

15

24

 0

12

18

69

First

27

37

33

32

23

152

Second

23

34

37

21

32

147

Third

45

17

40

24

13

139

Fourth

26

18

51

25

25

145

Fifth

35

38

19

29

7

128

Total by School

171

168

180

143

118

780

Table 2: Number of children by grade level and Boulder Valley School collected

Recording and Collection Specifications

The audio portion of the corpus was recorded in relative quiet schoolroom environments (e.g., utility rooms, libraries) at 16-kHz sampling frequency with 16 bits per sample, using a Sound-Blaster audio card. Recording was done using one of three types of microphones: a commonly available head-mounted noise-canceling microphone LABtec LVA-8450, an array microphone CNnetcom-Voice Array Microphone VA_2000, and a commonly-available desktop farfield microphone. Recording was done in a room within the school instead of a soundproof booth in order to maximize, to the extent possible, the similarity between the corpus and "real-life" conditions envisioned for the reading tutors. Each waveform was automatically checked for clipping and appropriate energy levels after each utterance was produced during the recording session; in addition, each recorded waveform was displayed on the screen immediately after recording to allow manual checking for waveform cutoff. If the automatically determined recording quality was unacceptable, the child was asked to repeat the utterance; only the final repeated utterance was saved. If the recording was otherwise acceptable, the child and experimenter had the opportunity to re-record an utterance to correct pronunciation errors or waveform-cutoff errors. In this case, both the original and new versions of the waveform were saved.

The corpus consists of three types of speech: prompted, read, and spontaneous speech. An animated face with natural speech was used to elicit the prompted speech; for read speech, the text to be read was displayed on the screen in large letters. Prior to data collection of each task, the animated face gave instructions about what would happen and what was expected from the subject.

File Formats

Each utterance was written to an audio file in RIFF format. All filenames have a two-letter code to identify the corpus ("CC" for the comprehensive children's speech corpus), and codes that identify the grade, gender, sub-protocol, child, utterance, attempt number, and microphone type; each code field is separated by a dash. The format is CC-GG-PP-NNNNN-UUUUU-A-C.wav where "GG" indicates the grade, "PP" indicates the sub-protocol, "NNNNN" specifies the unique speaker number, "UUUUU" specifies the utterance number, "A" indicates which attempt has been made at this utterance, and "C" identifies the microphone channel. The ranges of each of these fields are given in Table 2. Each utterance is assigned a unique number; in the case of spontaneous speech, the number indicates the topic instead of the expected word or word sequence.

The auxiliary information was written to a text file in ASCII format. The filenames are of the format CC-GG-PP-NNNNN-info.txt, where "CC" indicates the comprehensive children's speech corpus, "GG" indicates the grade, "PP" indicates the sub-protocol, "NNNNN" provides the speaker identification number, and the keyword "info" indicates that the file contains auxiliary information about the speaker.

Field

Possible Values

Grade

00 (kindergarten), 01, 02, 03, 04, 05, 06, 07

Sub-Protocol

01, 02, …, 30

Speaker ID

00001 through Y (Y is the total number of speakers in the corpus), in decimal format

Utterance ID

00001 through Z (Z is the total number of utterances in the corpus

Attempt

1, 2, 3, …

Channel

0, 1, 2


Table 3: Explanation of file naming conventions in the CU Kids' corpus

Video Collection

Simultaneously with the audio recording, a video recording was made of the child's face. The video recordings are not included as part of the CU Kids' audio corpus. For collection, Canon digital Elura 2mc video recorder with minimal machinery noise was used. (The noise was, in informal tests, inaudible). The video recorder was placed next to the computer monitor on the top left side of the screen (from the user's perspective) so that the child's mouth was visible at all times during speech. In preliminary recordings, the video recorder's internal microphone was used to record audio data with the video data; in later recordings, the array microphone input to the computer sound card will be split and sent to the video camera as well, to ensure that the video and audio data contain the same signal.

Transcription Guidelines

Transcriptions of the data include word-level text The word-level text transcriptions include mispronunciations, false starts, wrong words, and filled pauses. A summary of transcription markups are provided in Table 7.

Inter-Labeler Reliability

To assess reliability among transcribers, we compared transcriptions produced independently by transcribers for 40 speakers. This amounted to 2744 utterances. The results of the comparisons are displayed below. Labeler disagreement rate is measured using one transcription as a reference standard and the other as a hypothesis (swapping reference and hypothesis does not change the results). The NIST Sctk-1.2 scoring package was used to measure the disagreements (a summazation of the substitution, deletion, and insertion insertion errors) between the two transcriptions.

Entire Corpus

% Disagreement

Raw Comparison of Transcriptions

20.4

After removing punctuation symbols

17.8

After removing markps (<BN>, <NPS>, etc)

11.1

After removing partial words

9.5

After removal of spontaneous speech

7.1

Spontaneous Speech Only

 

Raw Comparison of Transcriptions

23.6

After removing punctuation symbols

23.2

After removing markps (<BN>, <NPS>, etc)

20.4

After removing partial words

19.6

Table 4: Interlabler reliability for 40 speakers in the CU Kids' Corpus

In general, we see that there is roughly a 20% disagreement between the raw text between two transcribers. However, 2.2% of this disagreement is simply differences in punctuation and an addition 6.7% of the differences are due to disagreements in markups such as background noise () or non-primary speaker (). Disagreements of where to mark a word cutoff account for an additional 1.6% of the error (e.g., "part-" vs. "partia-" for the word "partial").


Data Collection Protocol Overview:

The protocol was recorded in the order of prompted speech, read speech, and spontaneous speech. A summary of the data collected by data type, grade level, and school is shown in Table 5. A summary of the protocol used is provided in Table 6. The protocol sentences and words are provided as an external file (protocol-spec.txt) included on this CDROM.

Catagory

Kind.

1st

2nd

3rd

4th

5th

Total

Individual Phonemes & Sounds

85

164

176

136

113

143

817

Letter & Numbers

178

340

360

280

238

304

1700

Commands, Descriptions, Misc.

170

309

324

239

204

255

1501

Words Related to Math or Time

161

301

318

245

219

276

1520

Digit Strings

261

476

510

398

343

421

2409

Names & Teddy Categories

326

478

617

326

302

445

2494

Words to Elicit Emotions

130

226

253

109

133

157

1008

Difficult Words

0

5

59

15

17

38

134

Isolated Words

2208

4146

4394

3360

2903

3683

20694

Prompted Common Sentences

274

458

514

209

278

308

2041

Prompted (not read) Sentences

163

234

257

136

152

167

1109

Read Sentences

159

622

753

628

534

687

3383

Isolated Words from Sentences