(lisppad speech)

Library (lisppad speech) provides a speech synthesis API which parses text and converts it into audible speech. The conversion is based on factors like the language, the voice, and a range of parameters which are all aggregated by speaker objects.

Speech synthesis

(speak text) (speak text speaker)

Speaks the given string text using with the speaker object providing all speech synthesis parameters. If speaker is not provided, the value of parameter object current-speaker is used.

(phonemes text) (phonemes text speaker)

Converts the given natural language string text into a string of phonemes using the given speaker. If speaker is not provided, the value of parameter object current-speaker is used.

Speakers can be configured to speak phonemes instead of natural language via procedure speaker-interpret-phonemes!.

Speakers

A speaker is an object defining speech synthesis parameters. There is a current speaker which is used by default, unless a speaker is explicitly specified for the various procedures that require a speaker parameter.

A speaker object has the following components:

  • an immutable voice,

  • a mutable speaking rate,

  • a mutable speaking volume,

  • a flag determining whether the speaker interprets text or phonemes,

  • a flag determining how numbers are interpreted, as well as

  • a speaking pitch.

current-speaker

Defines the current speaker, which is used as a default by all functions for which the speaker argument is optional. If there is no current speaker, this parameter is set to #f.

(speaker? obj)

Returns #t if obj is a speaker object; otherwise #f is returned.

(make-speaker) (make-speaker voice)

Returns a new speaker for the given voice. If voice is not provided, a default voice, specified at the operating system level, is being used. Speakers are stateful objects which can be configured with a number of procedures: set-speaker-rate!, set-speaker-volume!, set-speaker-interpret-phonemes!, set-speaker-interpret-numbers!, and set-speaker-pitch!.

(speaker-voice) (speaker-voice speaker)

Returns the voice of speaker. If speaker is not provided, the parameter object current-speaker is used.

(speaker-rate) (speaker-rate speaker)

Returns the speaking rate of speaker. If speaker is not provided, the parameter object current-speaker is used.

(set-speaker-rate! rate) (set-speaker-rate! rate speaker)

Sets the speaking rate of speaker to number rate. If speaker is not provided, the parameter object current-speaker is used.

(speaker-volume) (speaker-volume speaker)

Returns the volume of speaker as a flonum ranging from 0.0 to 1.0. If speaker is not provided, the parameter object current-speaker is used.

(set-speaker-volume! volume) (set-speaker-volume! volume speaker)

Sets the volume of speaker to number volume which is a flonum between 0.0 and 1.0. If speaker is not provided, the parameter object current-speaker is used.

(speaker-interpret-phonemes) (speaker-interpret-phonemes speaker)

Returns #t if speaker interprets phonemes instead of natural language text. If speaker is not provided, the parameter object current-speaker is used.

(set-speaker-interpret-phonemes! phoneme?) (set-speaker-interpret-phonemes! phoneme? speaker)

If boolean argument phoneme? is #f, speaker is configured to interpret natural language. If phoneme? is set to any other value, the speaker is interpreting phonemes instead. If speaker is not provided, the parameter object current-speaker is used.

(speaker-interpret-numbers) (speaker-interpret-numbers speaker)

Returns #t if speaker interprets numbers as a natural language speaker would do ("100" is spoken as "hundred"). If it returns #f, speaker decomposes numbers into a sequence of digits and speaks them individually ("100" is spoken as "one zero zero"). If speaker is not provided, the parameter object current-speaker is used.

(set-speaker-interpret-numbers! natural?) (set-speaker-interpret-numbers! natural? speaker)

Sets the number interpretation of speaker to boolean natural?. If natural? is #t speaker will interpret numbers as a natural language speaker would do ("100" is spoken as "hundred"). If natural? is #f, speaker decomposes numbers into a sequence of digits and speaks them individually ("100" is spoken as "one zero zero"). If speaker is not provided, the parameter object current-speaker is used.

(speaker-pitch) (speaker-pitch speaker)

Returns the pitch of speaker as a pair of two flonums: the car is the base of the pitch, and the cdr is the modulation of the pitch. If speaker is not provided, the parameter object current-speaker is used.

(set-speaker-pitch! pitch) (set-speaker-pitch! pitch speaker)

Sets the pitch of speaker to the pair of flonums pitch whose car is the base of the pitch, and the cdr is the modulation of the pitch. If speaker is not provided, the parameter object current-speaker is used.

Voices

Voices are provided by the operating system and library (lispkit speech) does not have an explicit representation as objects. Symbols are used as identifiers for voices. For example, com.apple.speech.synthesis.voice.Alex refers to the default US voice.

A voice has the following characteristics:

  • Name (string)

  • Age (fixnum)

  • Gender (male or female)

  • Locale (symbol, e.g. en_US)

Library (lispkit system) provides means to handle locales, including language and country codes.

(voice) (voice name) (voice id)

Returns a symbol identifying the voice specified by the arguments of voice. If no argument is provided, an indentifier for the default voice is returned. If a name string is provided, then an identifier for a voice whose name is name is returned, or #f if no such voice exists. If an id symbol is provided, then an identifier for a voice whose identifier matches id is returned, or #f if no such voice exists.

(available-voices) (available-voices lang) (available-voices lang gender)

Returns a list of symbols identifying voices matching the given language filter lang and gender filter gender. Both lang and gender are symbols. lang should either be a language or locale identifier. It can also be set to #f if only a gender filter is needed. gender should either be symbol male or female.

(available-voices 'en)
⇒ (com.apple.speech.synthesis.voice.Alex com.apple.speech.synthesis.voice.daniel com.apple.speech.synthesis.voice.fiona com.apple.speech.synthesis.voice.Fred com.apple.speech.synthesis.voice.karen com.apple.speech.synthesis.voice.moira com.apple.speech.synthesis.voice.rishi com.apple.speech.synthesis.voice.samantha com.apple.speech.synthesis.voice.tessa com.apple.speech.synthesis.voice.veena)
(available-voices (locale "en" "GB"))
⇒ (com.apple.speech.synthesis.voice.daniel)

(available-voice? obj)

Returns #t if obj is a symbol identifying an available voice, otherwise #f is returned. This procedure fails if obj is neither a symbol nor the value #f.

(voice-name voice)

Returns the name of the voice identified by symbol voice.

(voice-age voice)

Returns the age of the voice identified by symbol voice.

(voice-gender voice)

Returns the gender of the voice identified by symbol voice.

(voice-locale voice)

Returns the locale of the voice identified by symbol voice.

Last updated