WholeWord speech recognition accuracy

Avaya Logo

WholeWord speech recognition accuracy

Overview

The accuracy of WholeWord speech recognition depends not only on the recognition algorithms, but also on the models, grammars, DIPs, prompt structure, calling environment, user behavior, and the recognized data itself. Each of these factors can impact recognition accuracy positively or negatively. Also, measures of accuracy must be based across the entire calling population. Therefore, any attempt to measure accuracy must include a statistically representative sample of the calling population.

Positive influences on WholeWord speech recognition accuracy

The items described below have a positive impact on WholeWord speech recognition accuracy.

Isolated word recognition

Isolated word recognition is very high. The smaller the number of choices in an isolated word recognition type, the better the accuracy. For example, "US English Digits 1 to 3" is more accurate than "US English Digits 1 to 5", which in turn is more accurate than using "US English 1 digit (0-9 and `oh')".

Fixed-length digit string

For connected-digit recognition, a fixed-length recognition type provides better accuracy than a variable-length recognition type. If possible, avoid the use of variable-length strings in WholeWord speech recognition applications.

Validation of data

Try to verify the recognized result against a database or a host field. This helps improve the overall accuracy of an application, especially when a longer string is input.

Reprompt

If the keyword is not spoken, and the system does not misinterpret extraneous words for a keyword, the system can reprompt the caller. If the accuracy measurement is based on a WholeWord speech recognition application with a confirmation and reprompt step, the accuracy increases.

Prompt structure

The prompt structure can greatly affect accuracy by promoting a clearly articulated response, helping the caller to barge in at the appropriate time or to wait until the prompt is complete before talking when barge-in is disabled, and providing consistent instructions on what the caller should say to get the desired result.

Menu prompts
For best results, build menu prompts with the following structure:

<desired result> <action required>

Examples:

"To hear your checking account balance, say 1."
"To hear your savings account balance, say 2."

Placing the action required at the end of the prompt helps eliminate the possibility that the caller might forget what is required while listening to the description of the desired result. In addition, if you want to encourage your callers to barge in when they hear their desired result, you can add a small pause after the action-required phrase.
Yes and no prompts
Structure yes and no prompts as yes and no questions. For example:

"Would you like to hear your order number again?"

If the caller does not respond to the prompt, the follow-up prompt could be as follows:

"Would you like to hear your order number again? Please say `yes' or `no'."

This wording is more natural than the following:

"To hear your order number again, say `yes'. Otherwise, say `no'."

To encourage the use of barge-in, add a pause of about 1.5 seconds following the action required phrase. For example:

"Would you like to hear your order again? (pause)
Please say `yes' or `no'."

Calling experience and informative prompts

In an application where the calling population is closed and callers are experienced or trained to use the application, recognition accuracy improves.

Lengthy prompts that provide detailed instructions on how to respond may improve accuracy, but are generally unacceptable unless the application has infrequent users. Users who interact with system prompts infrequently (for example, once or twice a year) are more willing to listen to a lengthy prompt than those who do so frequently.

Custom grammars and DIPS

Custom grammars improve the recognizer's ability to "score" the candidate by selectively limiting the recognition possibilities. The recognizer assigns a score to each input based on closeness of match to the models for the selected grammar. Custom DIPS help to further process the recognition result with information that is unavailable to the recognizer.

Negative influences on WholeWord speech recognition accuracy

The items described below have a negative impact on WholeWord speech recognition accuracy.

Environment

A very noisy environment, such as an airport or train station, can cause recognition accuracy problems. In certain cases, speech data can be collected to build custom word models based on the noisy environment to improve recognition accuracy.

Extraneous words within responses

The system can sometimes misinterpret extra words that are spoken alongside the keyword if they have the same characteristics as the key word.

Information type

Attempting to recognize data that is not normally spoken in the form of the digits 0 through 9 adversely affects accuracy. For example, dollar amounts and days of the month are not usually spoken in digit form 0 through 9. To speak the date December 15 using digits, the caller would have to say "1-2-1-5." Training callers to speak information in this format can increase application accuracy. However, if callers also attempt to speak natural numbers, such as "fifteen," speech recognition will not work.

The Natural Number Speech Recognition package, available from Avaya, does allows you to accept and use more natural caller responses, such as "December 15th" instead of "1-2-1-5" or "twenty-two dollars and thirty-seven cents" instead of "2-2-3-7". You can also use the Natural Language Speech Recognition offer to get this same kind of flexibility (for more information, see Using NLSR in voice applications).

Regional and national accents and dialects

Although WholeWord speech recognition is based on thousands of speech samples per word, the system can still misinterpret strong regional or national accents or dialects.

Connected-digit string length

Connected-digit string recognition can be thought of as a sequence of single-digit recognitions performed as one operation. For example, assume that the per-digit accuracy is X % and that a digit string of one digit will be correct X % of the time. Taking into consideration that this is a probabilistic, exponential model, when longer digit strings are used, the overall expected accuracy will be X n %. Therefore, a two-digit string will have an overall expected accuracy of X 2 % and a 10-digit string will have an overall accuracy of X 10 %. As a result, string accuracies are affected by the length of the string. Shorter string lengths are more accurate than longer string lengths. In addition, individual digit accuracies, as well as overall string accuracies, vary according to the language and noise conditions of different national networks.

Connected-digit string accuracy can be maximized in various ways:

Accuracy is always better for shorter strings than for longer strings.
Fixed-length strings are more accurate than variable length strings since the recognizer knows to look for " X " number of digits.
With custom programming, it is possible to further improve the accuracy of an application by having the recognizer return a list of possible strings. When these can be validated against external information such as comparing potential account number strings against a database of valid account numbers, the correct string can frequently be chosen.
The recognizer can also be given a custom digit string grammar that can guide the recognizer when the digit string must conform to specific digit sequence rules. To obtain custom grammars, contact your Avaya representative.

For a WholeWord speech recognition string of digits, the per-digit accuracy is comparable to isolated word recognition. However, the accuracy of the whole string is lower than the per-digit accuracy, and steadily decreases as more digits are added.

Application-related limitations

The capability of the system and WholeWord speech recognition is application dependent. If the system is under engineered for a particular application, it may not perform satisfactorily.

Specific application-related factors that affect the number of supported WholeWord speech recognition channels include:

The percentage of time spent recognizing speech input
The percentage of callers who use touchtone entries, which require fewer hardware and software resources
The number of simultaneous speech recognition calls expected
The use of barge-in with WholeWord speech recognition, which increases the hardware and software resources required to process each transaction.