Type
SSML item
Available from
Any level tab in the Prompt File Editor. For more information about the Prompt File Editor, see Using the Prompt File Editor.
Purpose
The Prosody item is a Synthesized Speech Markup Language (SSML) element that makes it possible to control various aspects of Text-to-Speech (TTS) synthesis. This element means you can get more natural-sounding speech synthesis from the TTS engine.
Behavior
Based on the properties you set, the Prosody item alters the rendering of TTS speech synthesis. All properties are optional, but if you use the Prosody item, you must use at least one property. The Prosody item has six basic properties:
Note:
The specific effects of many of these properties might vary from one TTS engine
to another.
- Pitch - This property controls the baseline pitch of the speech output. Increasing this property raises the baseline pitch. Decreasing this property lowers the baseline pitch.
- Range - This property controls the pitch range, or variability, of the speech output. Increasing this property increases the range of pitches the TTS engine produces. Decreasing this property decreases the pitch range.
- Rate - This property controls the speaking rate at which the TTS engine produces the output. Increasing this property speeds up the synthesized speech. Decreasing this property slows down the synthesized speech.
- Duration - This property is the desired amount of time it takes the TTS engine to read the contents of the TTS element. The value is stated in seconds (s) or milliseconds (ms). The text to be spoken is either compressed or expanded to fit into the duration you select. For example, if the text would normally take five seconds to speak, you set this property to four seconds, the system increases the rate to make the reading of the text fit into four seconds.
- The Duration property takes precedence over the Rate property.
- Volume - This property controls the volume, or loudness, of the speech output. Increasing this property makes the TTS output louder. Decreasing this property makes the TTS output quieter.
- Contour - This property sets the actual pitch contour for the speech output. This contour is accomplished by means of a series of percent/pitch value pairs. For details, see the "Properties" section.
- The Contour property takes precedence over both the Pitch and Range properties.
For additional details on the properties and how they behave, see the next section, "Properties."
Note:
The Prosody item and its properties function correctly only with SSML-compliant
speech synthesis engines. The Microsoft Speech SDK, which is used by Dialog
Designer during application simulation, is not an SSML-compliant speech
synthesis engine, so any settings you make with this item are ignored. For more
information about the SSML standard, see the Speech Synthesis Markup
Language (SSML) Version 1.0 W3C Recommendation.
Properties
All properties are optional, but if you use the Prosody item, you must use at least one property. Note that all units, such as Hz and st, are case-sensitive.
- Pitch - Select one of the following settings:
Pitch Setting
|
Description
|
---|
default
| This setting has no effect on the output. This setting uses the default baseline pitch of the TTS server.
|
x-low
low
medium
high
x-high
| These settings represent a range of pitch options. The specific application of these properties varies according to the TTS server.
|
custom
| With this setting, you can define the baseline pitch that you want, as a modification of the TTS server default. When you select this setting, Dialog Designer automatically adds the Custom Pitch [Hz or st] property to the Property view.
|
Custom Pitch [Hz or st]
| This setting is available only when the custom setting for Pitch is selected. With it, you can fine tune the baseline pitch by raising or lowering the pitch relative to the server default:
- To raise or lower the pitch based on frequency, enter a number followed by Hz. For example, if you set this to +8000Hz the system raises the baseline pitch by 8000 cycles per second. If you set this to -500Hz, the system lowers the pitch by 500 cycles per second.
- To raise or lower the pitch by semitones, enter a positive or negative number followed by st. Each whole number represents a semitone on the diatonic scale. Positive numbers raise the pitch. Negative numbers lower the pitch. For example, if you enter +3st in this field, the baseline pitch is raised by three semitones. If you enter -1.5st, the baseline pitch is lowered by one and a half semitones.
The correct format to enter these values is a positive (+) or negative (-) sign, followed by a number, followed by Hz or st, with no spaces. Numbers can be of the format "n", "n.", ".n" or "n.n", where n represents any sequence of one or more digits.
|
- Range - Select one of the following settings:
Range Setting
|
Description
|
---|
default
| This setting has no effect on the output. This setting uses the default pitch range of the TTS server.
|
x-low
low
medium
high
x-high
| These settings represent a range of pitch range options. The specific application of these properties varies according to the TTS server.
|
custom
| With this setting, you can define the pitch range that you want, as a modification of the TTS server default. When you select this setting, Dialog Designer automatically adds the Custom Range [Hz or st] property to the Property view.
|
Custom Range [Hz or st]
| This setting is available only when the custom setting for Range is selected. With it, you can fine tune the pitch range by increasing or decreasing the range relative to the server default:
- To increase or decrease the pitch range based on frequency, enter a number followed by Hz. For example, if you set this to +8000Hz the system increases the pitch range by 8000 cycles per second. If you set this to -500Hz, the system lowers the pitch range by 500 cycles per second.
- To increase or decrease the pitch by semitones, enter a positive or negative number followed by st. Each whole number represents a semitone on the diatonic scale. Positive numbers increase the pitch range. Negative numbers decrease the pitch range. For example, if you enter +3st in this field, the pitch range is increased by three semitones. If you enter -1.5st, the baseline pitch is decreased by one and a half semitones.
The correct format to enter these values is a positive (+) or negative (-) sign, followed by a number, followed by Hz or st, with no spaces. Numbers can be of the format "n", "n.", ".n" or "n.n", where n represents any sequence of one or more digits.
|
- Rate - Select one of the following settings:
Rate Setting
|
Description
|
---|
default
| This setting has no effect on the output. This setting uses the default speaking rate of the TTS server.
|
x-slow
slow
medium
fast
x-fast
| These settings represent a range of speaking rate options. The specific application of these properties varies according to the TTS server.
|
custom
| With this setting, you can define the speaking rate that you want, as a modification of the TTS server default. When you select this setting, Dialog Designer automatically adds the Custom Rate [positive float] property to the Property view.
|
Custom Rate [positive float]
| This setting is available only when the custom setting for Rate is selected. With it, you can fine tune the speaking rate by increasing or decreasing the rate relative to the server default. The number you enter in this field acts as a multiplier on the default rate. For example, a setting of 1 or 100% in this field means there is no change to the default rate. A setting of 2 or 200% in this field makes the speaking rate twice as fast as the default rate. A setting of 0.5 or 50% in this field makes the speaking rate half as fast as the default rate.
The correct format to enter these values is "n", "n.", ".n" or "n.n", where n represents any sequence of one or more digits.
|
- Duration - Select one of the following settings:
Duration Setting
|
Description
|
---|
250ms
500ms
750ms
1s
2s
3s
4s
5s
| These settings represent a range of duration options in milliseconds (ms) or seconds (s).
This setting overrides any Rate setting you might have.
|
custom
| With this setting, you can define the exact duration that you want. When you select this setting, Dialog Designer automatically adds the Custom Duration [s or ms] property to the Property view.
|
Custom Duration [s or ms]
| This setting is available only when the custom setting for Duration is selected. With it, you can set the exact duration for the text to be spoken.
The correct format to enter these values is "n", "n.", ".n" or "n.n", where n represents any sequence of one or more digits. The number must be followed, with no space in between, by s for seconds or ms for milliseconds.
|
- Volume - Select one of the following settings:
Note:
The Volume property uses a range of 0.0 (silent) to 100.0 (full volume).
Volume Setting
|
Description
|
---|
default
| This setting is the same as setting the volume to 100.0, or full volume.
|
silent
| This setting is the same as setting the volume to 0.0.
|
x-soft
soft
medium
loud
x-loud
| These settings represent a range of volume options. The specific application of these properties varies according to the TTS server.
|
custom
| With this setting, you can define the exact volume that you want. When you select this setting, Dialog Designer automatically adds the Custom Volume [float] property to the Property view.
|
Custom Volume [float]
| This setting is available only when the custom setting for Volume is selected. With it, you can set the exact volume you want for the text to be spoken.
The correct format to enter these values is a positive (+) or negative (-) sign, followed by a number. Numbers can be of the format "n", "n.", ".n" or "n.n", where n represents any sequence of one or more digits.
|
- Contour - Enter in this field the pitch contour settings you want to use for the text to be spoken.
- A pitch contour is defined by a set of time-position/target-pitch pairs. These pairs must in the form: (time-position,target-pitch). The time-position value is a percentage of the total time period for the text to be spoken. The target-pitch value is the relative pitch for the text to be spoken at that point in time. The algorithm for interpolating the pitch from one target to the next is specific to each TTS engine.
- For example, suppose you want the TTS server to speak the phrase, "Good morning." In normal conversation, the relative pitch of the syllable "morn-" is higher than the word "Good." The pitch of the final syllable, "-ing," then, drops below the initial word "Good." So, to create a pitch contour for this phrase, you might create the following (time-position, target-pitch) pairs:
- (0%,+20Hz)(40%,+30%)(75%,-10Hz)
- In this example, the initial pitch for the phrase, and the word "Good," is set at 20 cycles per second above the baseline pitch. At a time position 40% of the way into the phrase, the pitch rises to 30% above where the phrase started. This takes place at about the point where the server speaks the syllable "morn-". The final syllable, "-ing," starts about 75% of the way into the phrase, in terms of time. This final syllable drops to a point, then, 10 cycles per second below the baseline pitch.