![]() |
![]() |
| Home > Research > Reading Project > Corpus Development | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Kids' Audio Speech Corpus
|
|
Information |
Purpose |
|
Age |
Acoustic
Normalization, dissfluency modeling |
|
Second
language(s) |
Accent
normalization research |
|
Birthplace
(state, country) |
Accent
normalization research |
|
Grade
in school |
Identification
of speaker group |
|
Speaker
number (unique) |
Identification
of speaker |
Table 1: Auxilary information collected for each child
Schools
The speech data were collected from students in five schools in the Boulder Colorado area (summarized in Table 2). During subsequent years, data collection will move to other cities and towns throughout Colorado. Recording is being done in different locations within schools (instead of a sound attenuated booth) in order to maximize, to the extent possible, the similarity between the recording conditions and "real-life" usage of speech technologies developed with such data. In total, 780 children were recorded. Due to several transcription issues, 660 speakers were retained in the final audio corpus.
|
Grade |
Foothill |
Fireside |
Eisenhower |
Sanchez |
Pioneer |
Total |
|
Kindergarten |
15 |
24 |
0 |
12 |
18 |
69 |
|
First |
27 |
37 |
33 |
32 |
23 |
152 |
|
Second |
23 |
34 |
37 |
21 |
32 |
147 |
|
Third |
45 |
17 |
40 |
24 |
13 |
139 |
|
Fourth |
26 |
18 |
51 |
25 |
25 |
145 |
|
Fifth |
35 |
38 |
19 |
29 |
7 |
128 |
|
Total by School |
171 |
168 |
180 |
143 |
118 |
780 |
Table 2: Number of children by grade level and Boulder Valley School collected
Recording and Collection Specifications
The audio portion of the corpus was recorded in relative quiet schoolroom environments (e.g., utility rooms, libraries) at 16-kHz sampling frequency with 16 bits per sample, using a Sound-Blaster audio card. Recording was done using one of three types of microphones: a commonly available head-mounted noise-canceling microphone LABtec LVA-8450, an array microphone CNnetcom-Voice Array Microphone VA_2000, and a commonly-available desktop farfield microphone. Recording was done in a room within the school instead of a soundproof booth in order to maximize, to the extent possible, the similarity between the corpus and "real-life" conditions envisioned for the reading tutors. Each waveform was automatically checked for clipping and appropriate energy levels after each utterance was produced during the recording session; in addition, each recorded waveform was displayed on the screen immediately after recording to allow manual checking for waveform cutoff. If the automatically determined recording quality was unacceptable, the child was asked to repeat the utterance; only the final repeated utterance was saved. If the recording was otherwise acceptable, the child and experimenter had the opportunity to re-record an utterance to correct pronunciation errors or waveform-cutoff errors. In this case, both the original and new versions of the waveform were saved.
The corpus consists of three types of speech: prompted, read, and spontaneous speech. An animated face with natural speech was used to elicit the prompted speech; for read speech, the text to be read was displayed on the screen in large letters. Prior to data collection of each task, the animated face gave instructions about what would happen and what was expected from the subject.
File Formats
Each utterance was written to an audio file in RIFF format. All filenames have a two-letter code to identify the corpus ("CC" for the comprehensive children's speech corpus), and codes that identify the grade, gender, sub-protocol, child, utterance, attempt number, and microphone type; each code field is separated by a dash. The format is CC-GG-PP-NNNNN-UUUUU-A-C.wav where "GG" indicates the grade, "PP" indicates the sub-protocol, "NNNNN" specifies the unique speaker number, "UUUUU" specifies the utterance number, "A" indicates which attempt has been made at this utterance, and "C" identifies the microphone channel. The ranges of each of these fields are given in Table 2. Each utterance is assigned a unique number; in the case of spontaneous speech, the number indicates the topic instead of the expected word or word sequence.
The auxiliary information was written to a text file in ASCII format.
The filenames are of the format CC-GG-PP-NNNNN-info.txt, where "CC" indicates
the comprehensive children's speech corpus, "GG" indicates the grade,
"PP" indicates the sub-protocol, "NNNNN" provides the speaker identification
number, and the keyword "info" indicates that the file contains auxiliary
information about the speaker.
|
Field |
Possible
Values |
|
Grade |
00 (kindergarten), 01, 02, 03, 04, 05, 06, 07 |
|
Sub-Protocol |
01, 02, …, 30 |
|
Speaker ID |
00001 through Y
(Y is the total number
of speakers in the corpus), in decimal format |
|
Utterance ID |
00001 through Z
(Z is the total number
of utterances in the corpus |
|
Attempt |
1, 2, 3, … |
|
Channel |
0, 1, 2 |
Table 3: Explanation of file naming conventions in the CU Kids' corpus
Video Collection
Simultaneously with the audio recording, a video recording was made of the child's face. The video recordings are not included as part of the CU Kids' audio corpus. For collection, Canon digital Elura 2mc video recorder with minimal machinery noise was used. (The noise was, in informal tests, inaudible). The video recorder was placed next to the computer monitor on the top left side of the screen (from the user's perspective) so that the child's mouth was visible at all times during speech. In preliminary recordings, the video recorder's internal microphone was used to record audio data with the video data; in later recordings, the array microphone input to the computer sound card will be split and sent to the video camera as well, to ensure that the video and audio data contain the same signal.
Transcription Guidelines
Transcriptions of the data include word-level text The word-level text transcriptions include mispronunciations, false starts, wrong words, and filled pauses. A summary of transcription markups are provided in Table 7.
Inter-Labeler Reliability
To assess reliability among transcribers, we compared transcriptions produced independently by transcribers for 40 speakers. This amounted to 2744 utterances. The results of the comparisons are displayed below. Labeler disagreement rate is measured using one transcription as a reference standard and the other as a hypothesis (swapping reference and hypothesis does not change the results). The NIST Sctk-1.2 scoring package was used to measure the disagreements (a summazation of the substitution, deletion, and insertion insertion errors) between the two transcriptions.
|
Entire
Corpus |
%
Disagreement |
|
Raw
Comparison of Transcriptions |
20.4 |
|
After
removing punctuation symbols |
17.8 |
|
After
removing markps (<BN>, <NPS>,
etc) |
11.1 |
|
After
removing partial words |
9.5 |
|
After
removal of spontaneous speech |
7.1 |
|
Spontaneous
Speech Only |
|
|
Raw Comparison
of Transcriptions |
23.6 |
|
After
removing punctuation symbols |
23.2 |
|
After
removing markps (<BN>, <NPS>,
etc) |
20.4 |
|
After
removing partial words |
19.6 |
Table 4: Interlabler reliability for 40 speakers in the CU Kids' Corpus
In general, we see that there is roughly a 20% disagreement
between the raw text between two transcribers. However, 2.2% of this disagreement
is simply differences in punctuation and an addition 6.7% of the differences
are due to disagreements in markups such as background noise (
The protocol was recorded in the order of prompted speech, read speech, and spontaneous speech. A summary of the data collected by data type, grade level, and school is shown in Table 5. A summary of the protocol used is provided in Table 6. The protocol sentences and words are provided as an external file (protocol-spec.txt) included on this CDROM.
|
Catagory |
Kind. |
1st
|
2nd |
3rd |
4th |
5th |
Total |
|
Individual
Phonemes & Sounds |
85 |
164 |
176 |
136 |
113 |
143 |
817 |
|
Letter
& Numbers |
178 |
340 |
360 |
280 |
238 |
304 |
1700 |
|
Commands,
Descriptions, Misc. |
170 |
309 |
324 |
239 |
204 |
255 |
1501 |
|
Words
Related to Math or Time |
161 |
301 |
318 |
245 |
219 |
276 |
1520 |
|
Digit
Strings |
261 |
476 |
510 |
398 |
343 |
421 |
2409 |
|
Names
& Teddy Categories |
326 |
478 |
617 |
326 |
302 |
445 |
2494 |
|
Words
to Elicit Emotions |
130 |
226 |
253 |
109 |
133 |
157 |
1008 |
|
Difficult
Words |
0 |
5 |
59 |
15 |
17 |
38 |
134 |
|
Isolated
Words |
2208 |
4146 |
4394 |
3360 |
2903 |
3683 |
20694 |
|
Prompted
Common Sentences |
274 |
458 |
514 |
209 |
278 |
308 |
2041 |
|
Prompted
(not read) Sentences |
163 |
234 |
257 |
136 |
152 |
167 |
1109 |
|
Read
Sentences |
159 |
622 |
753 |
628 |
534 |
687 |
3383 |
|
Isolated
Words from Sentences |