Audiogram-Aided Speech Synthesis

Summary

I believe that text-to-speech synthesis can improve understanding by adaptation to the hearing limits of the listener, as represented by the listener's audiogram, possibly as corrected by hearing-aid. I first wrote this 2003-03-06, and updated it 2005-02-19 to add reference to epenthesis.

Background

Speeded speech delivery is important when the only information delivery medium is audio, which is basically time-linear. When much material needs to be reviewed, the equivalent of visual skimming is desired. With it is needed a convenient way for the user to understand content at a more rapid rate, with rate adjustment and repositioning to backup and reread the interesting content.

I note that proper content structure and navigation aids that exploit that structure can also help, as are permitted with digital talking books that have searchable textual content that supports the narration.

If the source for speeded speech delivery is existing speech, one approach is just playback at faster rate, with corresponding proportional pitch increase. Long ago, that effect was described as "the chipmunk voice." One consequence is that the listener who has high-frequency hearing loss has a hard time understanding.

A better approach today uses a non-pitch-shifting algorithm that depends on reducing redundant intervals -- of vowel phonemes and pauses, and preserves inflection. That is appropriate for audio-books. Also perhaps for voice mail.

Question

Do any of the constant-pitch algorithms for generating speeded speech accept the audio-gram of the user, either unaided or aided by hearing aid, to selectively condition the processing of phonemes? For example, stretching, rather than squeezing the plosives and fricatives (that have high frequency components) could adjust to the high frequency cutoff of the user. This seems important to intelligibility and possibly permit even more speedup.

Digital Talking Book Application

I am particularly interested in how this affects listeners to digital talking books, where many of the blind listeners also have uncorrectable loss of high frequency hearing. Linear reading rate as high as understandable would seem to be a natural way to listen, presuming that rapid slowdown, backup, replay are easy. The synthesis algorithm exploiting the user's audio-gram might allow even faster listening with understanding.

I realize that speech speedup applied to the audio stream primarily reduces the duration of vowel and silence intervals. I am unaware of any exploitation of the potential for stretching the plosives/fricatives to reduce their high-frequency components.

Do you know of any algorithms that do that? How complex are they? This would seem easier for text-to-speech synthesis, than applied to speedup an existing narrative audio source.

Do users of synthetic voice with high-frequency hearing loss generally use male voices, as they have lower fundamental frequencies?

The lower fundamental takes longer to start. So that would work against speedup. [A skillful organist playing a 32' organ rank pipe on the pedals plays it in advance so it is voiced at the same time as the pipes of smaller ranks played on the keyboards.] Would that technique help in the text-to-speech synthesis?

A technique used by the noted choral conductor Robert Shaw was to have the singers add to the ends of plosives a dwelling vowel phoneme, to make such plosives and fricatives in words stand out. That is called

For example,

"breadth"
might be sung as "bareadatha." I've approximated the extended phonemes to reduce their 'plosiveness'. This seems to improve the intelligibility.

That helps explain Shaw's special reputation for preparing great choral performances.

I note that the telephone network has a high-frequency cutoff at about 3 KHz. So, what is delivered over the phone has already eliminated the speech content that helps distinguish among plosives. Phone users are accustomed to making this adjustment, often relying on context.

I discuss epenthesis in much more detail on another page.

Harvey Bingham
Invited Expert, World Wide Web -- Web Accessibility Initiative,
Digital Talking Books XML application design consultant for the Library of Congress National Library Service for the Blind and Physically Handicapped,
the international Digital Audio Information System (DAISY) Consortium
, the Recording for the Blind and Dyslexic,
the U.S. Library of Congress National Library Service for the Blind and Physically Handicapped, and the
National Information Standards Organization. See the approved ANSI/NISO entire digital talking book standard at:

Standard ANSI/NISO Z39.86-2002, ISSN: 1041-5653, File Specifications for the Digital Talking Book.
DAISY Consortium