Cloud STT

Use the cloud STT node to convert a caller's audio or DTMF responses to text in real time by using Google Cloud Speech-to-Text.

Name

Give the node a name.

Text-to-Speech

Specify each of the prompts to play to the caller.

The Add Pause, Volume, Pitch, Emphasis, Say As, and Rate menus above the TTS prompt apply SSML parameters. See Speech Synthesis Markup Language (SSML).

Prompt Field Name

Description

Prompt

The caller hears this prompt first. The prompt tells the caller what to do.

Fallback Prompt

The caller hears the fallback prompt when the system fails to hear the caller's response, or the caller's response is not understood. The fallback prompt should provide the caller with greater context on how to respond to the prompt.

This is an example fallback prompt. This is a natural language system. Just say what you would like to do. For example, send me to the billing department or to speak to a representative.

No Input Prompt

The caller hears the no input prompt, followed by the fallback prompt, when the caller does not respond to the prompt.

No Match Prompt

The caller hears the no match prompt, followed by the fallback prompt, when the caller's response to the prompt is not understood.

Cloud Speech-to-Text uses natural language so no match indicates a high transcription confidence score.

Phrase Hints

Use phrase hints to improve speech recognition accuracy. Enter phrase hints in the Phrase Hints field or in a datastore or both. If you use a datastore, optionally use boost values to prioritize one phrase over another. For more about the datastore structure, see Phrase Hints Datastore.

Field

Description

Phrase Hints

Enter a comma-separated list of keywords to improve speech recognition accuracy.

For example, if the caller's response is likely to include a month of the year, include a comma-separated list of months in the phrase hints.

The list can include variables, which enable the phrase hints to change dynamically according to the variable values.

Phrase Hints Datastore Select the datastore that stores the phrase hints.
Datastore Phrase Hints Column Select the column in the datastore with the list of words or phrases.
Datastore Boost Column

Select the column in the datastore with the boost values.

To select the column dynamically at call time, if the datastore has multiple boost columns, select a variable to store the name of the column.

Note:

Studio supports Google Cloud Speech-to-Text class tokens.

Speech Recognizer

Field

Description

Language

Select the language. This field is required and the languages available are dependent on the provider.

Language Variable

Select a language variable to dynamically override the ASR selected language at call time. This is useful when building multilingual voice tasks.

See Supported Language Codes.

Assign transcribed text to variable

Select the variable to store the caller's transcribed response best matched with the greatest confidence score. If the variable has not been created yet, type the name of the variable.

Assign confidence score to variable

Select the variable to store the confidence level associated with the transcribed text. If the variable has not been created yet, type the name of the variable. Use of this variable is optional. You can use it to ignore transcriptions with a low confidence. The confidence score is a numeric value between 0 and 1: 0 meaning no confidence and 1 meaning complete confidence. An example confidence score is 0.45.

Assign recording file to audio variable

Select the variable to store the caller's raw verbatim audio as an audio file. If the variable has not been created yet, type the name of the variable.

Speech Controls

Select the box to enable Barge In.

Barge In enables the caller to interrupt the system and progress to the next prompt. It enables experienced callers to move rapidly through the system to get to the information that they want. You may want to disable barge in at key times, such as when your prompts or menu systems change.

Confidence Level

The confidence level is a representation of the system's confidence when interpreting caller input. A higher confidence level requires a higher accuracy in caller input.

Inter-Digit Timeout

Select the time in seconds to wait between each DTMF key press. The longer the timeout, the greater the allowable time between each key press. When the timeout is reached, Studio finalizes the collected DTMF input so far and moves on to the next node in the call flow.

Maximum Alternative Transcriptions

Studio returns the best matched transcription response when the setting is 1.

Increase the setting from 1 to access multiple transcriptions. The maximum number of transcriptions is 10.

To access the transcriptions and their corresponding confidence scores, see Accessing Multiple Transcriptions.

The Google ASR determines the actual number of transcriptions available, which may be fewer than the number you set in this field.

Event Handler

Each event handler has a Count value specifying the number of attempts to run before triggering the event handler. The event handler routes the call to a task canvas. If the canvas has not been created yet, type the name of the canvas.

Event Handler

Description

No Input Event Handler Select a task canvas. The call routes to the canvas if no input is detected from the caller after multiple attempts.
No Match Event Handler Select a task canvas. The call routes to the canvas if no match is detected from the caller after multiple attempts.

Advanced ASR Settings

Do not tune unless you have a clear understanding of how these settings affect speech recognition. Generally speaking, the default settings are the best. To return a setting to its default value, remove the value from the field and click outside the field.

Settings

Description

No Input Timeout

Wait time, in milliseconds, from when the prompt finishes to when the system directs the call to the No Input Event Handler as it has been unable to detect the caller’s speech.

Speech Complete Timeout

Speech Incomplete Timeout

Use these settings for responses with an interlude to ensure the system listens until the caller's speech is finished.

Speech Complete Timeout measures wait time, in milliseconds, from when the caller stops talking to when the system initiates an end-of-speech event. It should be longer than the Inter Result Timeout to prevent overlaps. To customize Speech Complete Timeout, disable Single Utterance.

Speech Incomplete Timeout measures wait time, in milliseconds, from when incoherent background noise begins and continues uninterrupted to when the system initiates an end-of-speech event.

Speech Start Timeout

Wait time, in milliseconds, from when the prompt starts to play to when the system begins to listen for caller input. This is similar to the scenario where barge in is enabled.

Inter Result Timeout

Wait time, in milliseconds, from when the caller stops talking to when the system initiates an end-of-speech event as it has been unable to detect interim results.

The typical use case would be for a caller reading out numbers. The caller might pause between the digits.

It is recommended to keep the value shorter than Speech Complete Timeout to avoid overlaps.

Inter Result Timeout does not reset if there is background noise. Speech Complete Timeout does reset if there is background noise. If there is background noise, Inter Result Timeout may be more reliable in determining when the speech is complete.

Disabled when the value is 0. Set from 500ms to 3000ms based on the maximum pause time in the caller response. To customize Inter Result Timeout, disable Single Utterance.

Barge In Sensitivity

Raising the sensitivity requires the caller to speak louder above background noise. Applicable when Barge In is enabled.

Auto Punctuation

Select to add punctuation to the caller's speech.

Profanity Filter

Select to remove profanities from the caller's speech.

Single Utterance

A single utterance is a string of things said without long pauses. A single utterance can be yes or no or a request, like Can I book an appointment?, or I need help with support.

The single utterance setting is selected by default.

Disable Single Utterance to customize Speech Complete Timeout and Inter Result Timeout.

You may decide to disable the single utterance setting if the caller is expected to pause as part of the conversation. For example, the caller may read out a sequence of numbers and pause in appropriate places.

Recognizer Model

Enter the recognizer model in the space provided.

Note:

Studio does not validate the recoginzer model. Use the letter case provided by the ASR. If the ASR fails to recognize the recognizer model, the Cloud-STT node fails with an event recorded in the system log. The event is CALLER_RESPONSE where param1 is the current node, param2 is the next node, and param3 is the recognizer error. See also System Log Events.

Google Cloud Speech-to-Text

Studio supports these Google Cloud Dialogflow voice models.

The voice models available depend on the language selected.

Settings

Description

Latest Long

Use for long content.

Examples include media like video or spontaneous speech and conversations.

You can use this in place of the default model.

Latest Short

Use for short utterances a few seconds in length.

Examples include capture commands.

Consider this model instead of ASR Command and Search.

Phone Call

Use to transcribe audio from a phone call.

Typically, audio from a phone call is recorded at 8,000Hz sampling rate.

ASR Command and Search

Use for shorter audio clips.

Examples include voice commands and voice search.

ASR Default

Use if your audio does not fit the other models.

You can use this for long-form audio recordings that feature a single speaker only.

ASR default produces transcription results for any type of audio, including audio from video clips. Ideally the audio is high-fidelity, recorded at 16,000Hz or greater sampling rate.

Do not use when Single Utterance is selected.

Note:

When Single Utterance is selected, set the Recognizer Model to Phone Call or ASR Command and Search. Single utterance cannot be used with other recognizer models.

Confirmation

Use the confirmation tab to prompt the caller to confirm their response.

Condition

Description

Not Required

The caller does not confirm their response. This is the default behavior.

Required

The caller is required to confirm their response.

Speech Recognizer Confidence

The caller is required to confirm their response if it falls bellow the speech recognizer confidence threshold.

The confidence level is representative of the system's confidence when interpreting the caller's response. Raise the Confidence Level to require a higher level of accuracy from the caller.

Configure these settings for the case where the caller confirms their response.

Settings

Description

Barge In

Select the box to enable Barge In.

Barge In enables the caller to confirm their response before the confirmation prompt has finished playing. It enables the caller to move rapidly through the call flow. The caller may miss the confirmation prompt repeating their response.

Confirmation Prompt

The caller hears the confirmation prompt. Include [user_input] to play the response the prompt is confirming. If there are multiple response values, [user_input] captures the first value.

Example:

You entered: [user_input]. Say Yes or press 1 if this is correct. Otherwise say No or press 2.

No Input Prompt

The caller hears the no input prompt when the caller does not confirm.

No Match Prompt

The caller hears the no match prompt when the caller's confirmation response is not understood.

Maximum Number of Attempts to Confirm

After the maximum number of attempts to confirm the response the call flows to the No Match Event Handler.