Text-to-Speech Synthesis Can Exploit the Listener's Audiogram

Harvey Bingham, 2003-01-31

Summary

I believe that text-to-speech synthesis can be individualized to better match the hearing limits of the listener. I am unaware of any products other than hearing aids that allow this. I believe the speech synthesis could and should provide this choice for those that are not so aided.

Telephone as Delivery Medium for Voice Web Browsing

Telephone information delivery is becoming an increasingly important use of synthesized speech. Telephone voice is already high-frequency limited. But for some that limit is well above their hearing cutoff. Amplification is not enough, unless it is made frequency selective, as the hearing aid industry exploits.

Speech Synthesis

Intelligibility without undo delay is a goal for speech synthesis. To achieve that goal, speeded speech is desirable, without shifting the fundamental pitch frequency proportionately.

Do any of the non-pitch-shifting algorithms for generating speeded speech accept the audiogram of the user, to selectively condition the processing of phonemes? For example, stretching, rather than squeezing the plosives and fricatives (that have high frequency components) might effectively lower the effective frequency components used for recognition.

An algorithm to achieve this could adjust to the high frequency cutoff of the user. This seems important to intelligibility and possibly to increasing the maximum speedup that preserves intelligibility.

I am interested in how this potential affects listeners to digital talking books, where many blind listeners also have loss of high frequency hearing. Linear reading rate as high as understandable, without pitch shift, would seem to be a preferred way to listen, presuming that rapid slowdown, backup, replay, and rate adjustments are easy.

Current DAISY digital talking book players do permit dynamic rate control. The use is to speedup listening to an existing narrative audio source.

I realize that normal speech speedup primarily reduces the duration of vowel and inter-word silence intervals. I am unaware of the potential for stretching the plosives/fricatives to reduce their high-frequency components.

A text-to-speech synthesis algorithm exploiting the user's audiogram might allow even faster listening. Do you know of any algorithms that do that? How complex are they? Are they included in any commercial systems that you are aware of?

I believe that users of synthetic voice with high-frequency hearing loss generally use male voices, as they have lower fundamental frequencies. There is a trade-off with speedup.

The lower fundamental takes longer to start. So that would work against speedup. [for confirmation of that onset phenomenon, A skillful organist playing a 32' organ rank pipe on the pedals plays it enough in advance that it is voiced at the same time as the pipes of the smaller ranks with faster onset times played on the keyboards.

Would that technique of adjusting the onset, depending on pitch help in the text-to-speech synthesis?

Vowel Epenthesis

I discuss my proposed use of vowel epenthesis Audiogram-Aided Speech Synthesis on a separate page.

In The Forest of Rhetoric
by Dr. O. Gideon Burton, gives this definition of epenthesis:"

Epenthesis
"The addition of a letter, sound, or syllable to the middle of a word. A kind of metaplasm."

Note: Epenthesis is sometimes employed in order to accomodate meter in verse; sometimes, to facilitate easier articulation of a word's sound. It can, of course, be accidental, and a vice of speech.

See that reference for examples and hundreds of definitions of other rhetorical terms.

Singing

The noted choral conductor Robert Shaw frequently used vowel epenthesis, sometimes applying it to the ends of words as well. For example, he taught the singers to add to the ends of plosives a dwelling vowel phoneme, to make such plosives (and fricatives) in words stand out. For example,

"breadth"
might be sung as "bareadatha." I've approximated the extended phonemes to reduce their 'plosiveness'. This seems to improve intelligibility.

That helps explain Shaw's special reputation for preparing great choral performances by many choral groups.

I've a selfish reason to explore this: I have about 60db hearing loss in each ear above about 3khz. I use high frequency boost equalization settings for help on audio systems.

If the available bandwidth, or the speech synthesis processing has already eliminated high frequency components, as on the telephone, there is nothing remaining to boost. So the approach I suggest may have the synthesis effect of bringing back into the audible bandwidth those otherwise lost high-frequency components of normal speech.

Some Issues to be Resolved

  1. If the augmented text-to-speech becomes comfortable for the listener, who may adapt to it and come to depend on it, then its absence in normal human speech might become an obstacle to understanding and dialog.
  2. A user with hearing aids might use the audiograms as corrected as input to the text-to-speech generation.
  3. Synthesis using the Speech Synthesis Markup Language will only get mono sound. Which ear's corrected audiogram should be used? The better one? Should both ears get the same correction, or should different correction be applied to deliver a stereo synthesized signal to the different ears?

References

Speech Synthesis Markup Language Version 1.0 W3C Recommendation 7 September 2004

Harvey Bingham

Invited Expert
World Wide Web -- Web Accessibility Initiative
Consultant on Digital Talking Books XML application design
Library of Congress NLS
National Information Standards Organization
DAISY Consortium