Cloud STT

Use the cloud STT node to convert a caller's audio or DTMF responses to text in real time by using Google Cloud Speech-to-Text or Deepgram.

Name

Give the node a name.

Text-to-Speech

Specify each of the prompts to play to the caller.

The Add Pause, Volume, Pitch, Emphasis, Say As, and Rate menus above the TTS prompt apply SSML parameters. See Text-to-Speech Prompts.

Prompt Field Name

Description

Prompt

The caller hears this prompt first. The prompt tells the caller what to do.

Fallback Prompt

The caller hears the fallback prompt when the system fails to hear the caller's response, or the caller's response is not understood. The fallback prompt should provide the caller with greater context on how to respond to the prompt.

This is an example fallback prompt. This is a natural language system. Just say what you would like to do. For example, send me to the billing department or to speak to a representative.

No Input Prompt

The caller hears the no input prompt, followed by the fallback prompt, when the caller does not respond to the prompt.

No Match Prompt

The caller hears the no match prompt, followed by the fallback prompt, when the caller's response to the prompt is not understood.

Cloud Speech-to-Text uses natural language so no match indicates a high transcription confidence score.

Phrase Hints

Use phrase hints to improve speech recognition accuracy. Enter phrase hints in the Phrase Hints field or in a datastore or both. If you use a datastore, optionally use boost values to prioritize one phrase over another. For more about the datastore structure, see Phrase Hints Datastore.

Field

Description

Phrase Hints

Enter a comma-separated list of keywords to improve speech recognition accuracy.

For example, if the caller's response is likely to include a month of the year, include a comma-separated list of months in the phrase hints.

The list can include variables, which enable the phrase hints to change dynamically according to the variable values.

Phrase Hints Datastore Select the datastore that stores the phrase hints.
Datastore Phrase Hints Column Select the column in the datastore with the list of words or phrases.
Datastore Boost Column

Select the column in the datastore with the boost values.

To select the column dynamically at call time, if the datastore has multiple boost columns, select a variable to store the name of the column.

Note:

Studio supports Google Cloud Speech-to-Text class tokens.

Speech Recognizer

Field

Description

Provider

Google is selected by default. You could alternatively select Deepgram.

Language

Select the language you expect the caller to speak. This field is required and the languages that are available depend on the provider.

Speech Recognition Model

The models that are available depend on the provider and language.

To list models, clear the field, and then click in the field. Select from the available options. Start typing to reduce the list of models to those that match your typed text.

Note:

Studio does not validate the model. Use the letter case on display. If the provider fails to recognize the model, the Cloud-STT node fails with an event recorded in the system log. The event is CALLER_RESPONSE where param1 is the current node, param2 is the next node, and param3 is the recognizer error. See also System Log Events.

Google Cloud Speech-to-Text

Studio supports these Google Cloud Dialogflow voice models.

Settings

Description

latest_long

Use for long content.

Examples include media like video or spontaneous speech and conversations.

You can use this in place of the default model.

latest_short

Use for short utterances a few seconds in length.

Examples include capture commands.

Consider this model instead of command_and_search.

phone_call

Use to transcribe audio from a phone call.

Typically, audio from a phone call is recorded at 8,000Hz sampling rate.

command_and_search

Use for shorter audio clips.

Examples include voice commands and voice search.

default

Use if your audio does not fit the other models.

You can use this for long-form audio recordings that feature a single speaker only.

Default produces transcription results for any type of audio, including audio from video clips. Ideally the audio is high-fidelity, recorded at 16,000Hz or greater sampling rate.

Do not use when Single Utterance on the Advanced ASR Settings tab is selected.

Deepgram

Each language is supported by one model. If you change the language, you may need to change the model.

Enter Deepgram models in the form <model option>_<model>. For example, enter Deepgram model base with model option phonecall as phonecall_base.

The following models have been tested to work with Studio.

  • 2-general_base

  • 2-general_enhanced

  • 2-general_nova

  • 2-nova_medical

  • 2-nova_meeting

  • 2-phonecall_base

  • 2-phonecall_enhanced

  • 2-phonecall_nova

  • conversationalai_base

  • finance_base

  • finance_enhanced

  • general_base

  • general_enhanced

  • general_nova

  • meeting_base

  • meeting_enhanced

  • medical_nova

  • phonecall_base

  • phonecall_enhanced

  • phonecall_nova

  • video_base

  • voicemail_base

For more information about Deepgram models and options, see the Deepgram documentation. See Deepgram models and Models and Languages Overview.

Note:

The form of expressing Deepgram models in Studio differs from the standard <model>-<model option>.

Deepgram deprecated the use of model tiers in favor of model options.

Alternative Languages

Select up to three alternative languages the caller may speak.

This assists with speech recognition.

Override Language Variable

Override the language selected by the ASR. This is useful when building multilingual tasks and the language spoken by the caller has already been established on the call. Select the variable containing the language spoken by the caller. This feature is supported when Google is the selected ASR.

See Supported Language Codes.

Assign detected language to variable

Select the variable to store the language spoken by the caller as detected by the ASR.

Assign transcribed text to variable

Select the variable to store the transcribed text. If the variable has not been created yet, type the name of the variable.

The variable stores the text transcribed by the ASR with the greatest confidence score.

Assign confidence score to variable

Select the variable to store the confidence score associated with the transcribed text. If the variable has not been created yet, type the name of the variable. Use of this variable is optional. You can use it to ignore transcriptions with a low confidence.

The confidence score is a numeric value between 0 and 1: 0 meaning no confidence and 1 meaning complete confidence. An example confidence score is 0.45.

Assign recording file to audio variable

Select the variable to store the caller's raw verbatim audio as an audio file. If the variable has not been created yet, type the name of the variable.

Speech Controls

Select the box to enable Barge In.

Barge In enables the caller to interrupt the system and progress to the next prompt. It enables experienced callers to move rapidly through the system to get to the information that they want. You may want to disable barge in at key times, such as when your prompts or menu systems change.

Minimum Transcription Confidence Score

When the confidence score is less than or equal to the Minimum Transcription Confidence Score, the call is directed to the No Match Event Handler.

The confidence score is a representation of the system's confidence when interpreting caller input.

The confidence score is a numeric value between 0 and 1: 0 meaning no confidence and 1 meaning complete confidence. An example confidence score is 0.45.

Inter-Digit Timeout

Select the time in seconds to wait between each DTMF key press. The longer the timeout, the greater the allowable time between each key press. When the timeout is reached, Studio finalizes the collected DTMF input so far and moves on to the next node in the call flow.

Maximum Alternative Transcriptions

Studio returns the best matched transcription response when the setting is 1.

Increase the setting from 1 to access multiple transcriptions. The maximum number of transcriptions is 10.

To access the transcriptions and their corresponding confidence scores, see Accessing Multiple Transcriptions.

This is specific to Google Cloud Speech-to-Text. It is not available when the selected provider is Deepgram. The Google ASR determines the actual number of transcriptions available, which may be fewer than the number you set in this field.

Event Handler

Each event handler has a Count value specifying the number of attempts to run before triggering the event handler. The event handler routes the call to a task canvas. If the canvas has not been created yet, type the name of the canvas.

Event Handler

Description

No Input Event Handler Select a task canvas. The call routes to the canvas if no input is detected from the caller after multiple attempts.
No Match Event Handler Select a task canvas. The call routes to the canvas if no match is detected from the caller after multiple attempts.

Advanced ASR Settings

Do not tune unless you have a clear understanding of how these settings affect speech recognition. Generally speaking, the default settings are the best. To return a setting to its default value, remove the value from the field and click outside the field.

Note:

Studio supports maximum 5 minutes of caller audio when using Google Cloud Speech-to-Text. See https://cloud.google.com/speech-to-text/quotas.

Settings

Description

No Input Timeout

Wait time, in milliseconds, from when the prompt finishes to when the system directs the call to the No Input Event Handler as it has been unable to detect the caller’s speech.

Speech Complete Timeout

Speech Incomplete Timeout

Use these settings for responses with an interlude to ensure the system listens until the caller's speech is finished.

Speech Complete Timeout measures wait time, in milliseconds, from when the caller stops talking to when the system initiates an end-of-speech event. It should be longer than the Inter Result Timeout to prevent overlaps. To customize Speech Complete Timeout, turn off Single Utterance.

Speech Incomplete Timeout measures wait time, in milliseconds, from when incoherent background noise begins and continues uninterrupted to when the system initiates an end-of-speech event.

Speech Start Timeout

Wait time, in milliseconds, from when the prompt starts to play to when the system begins to listen for caller input. This is similar to the scenario where barge in is enabled.

Inter Result Timeout

Wait time, in milliseconds, from when the caller stops talking to when the system initiates an end-of-speech event as it has been unable to detect interim results.

The typical use case would be for a caller reading out numbers. The caller might pause between the digits.

It is recommended to keep the value shorter than Speech Complete Timeout to avoid overlaps.

Inter Result Timeout does not reset if there is background noise. Speech Complete Timeout does reset if there is background noise. If there is background noise, Inter Result Timeout may be more reliable in determining when the speech is complete.

By default, Inter Result Timeout is set to 0 and Single Utterance is turned on.

To customize Inter Result Timeout, turn off Single Utterance. Set Inter Result Timeout from 500ms to 3000ms based on the maximum pause time in the caller response.

When Single Utterance is turned off and the selected ASR is Google Cloud Dialogflow, Inter Result Timeout cannot be set to 0. The value of Inter Result Timeout changes from 0 to 1000ms.

Barge In Sensitivity

Raising the sensitivity requires the caller to speak louder above background noise. Applicable when Barge In is enabled. The scale is logarithmic. See Lumenvox Sensitivity Settings.

Auto Punctuation

Select to add punctuation to the caller's speech.

Profanity Filter

Select to remove profanities from the caller's speech.

This is specific to Google Cloud Speech-to-Text. It is not available when the selected provider is Deepgram.

Single Utterance

A single utterance is a string of things said without long pauses. A single utterance can be yes or no or a request, like Can I book an appointment?, or I need help with support.

The single utterance setting is turned on by default.

Turn off Single Utterance to customize Speech Complete Timeout and Inter Result Timeout.

You may decide to turn off Single Utterance if the caller is expected to pause as part of the conversation. For example, the caller may read out a sequence of numbers and pause in appropriate places.

Note:

When Single Utterance is selected and Google is the selected provider, set the Speech Recognition Model on the Speech Recognizer tab to phone_call or command_and_search. Single utterance cannot be used with other models.

Confirmation

Use the confirmation tab to prompt the caller to confirm their response.

Condition

Description

Not Required

The caller does not confirm their response. This is the default behavior.

Required

The caller is required to confirm their response.

Speech Recognizer Confidence

The caller is required to confirm their response if it falls bellow the speech recognizer minimum transcription confidence score.

The confidence score is representative of the system's confidence when interpreting the caller's response. Raise the Minimum Transcription Confidence Score to require a higher level of accuracy from the caller.

Configure these settings for the case where the caller confirms their response.

Settings

Description

Barge In

Select the box to enable Barge In.

Barge In enables the caller to confirm their response before the confirmation prompt has finished playing. It enables the caller to move rapidly through the call flow. The caller may miss the confirmation prompt repeating their response.

Confirmation Prompt

The caller hears the confirmation prompt. Include [user_input] to play the response the prompt is confirming. If there are multiple response values, [user_input] captures the first value.

Example:

You entered: [user_input]. Say Yes or press 1 if this is correct. Otherwise say No or press 2.

No Input Prompt

The caller hears the no input prompt when the caller does not confirm.

No Match Prompt

The caller hears the no match prompt when the caller's confirmation response is not understood.

Maximum Number of Attempts to Confirm

After the maximum number of attempts to confirm the response the call flows to the No Match Event Handler.

No Input Timeout

Wait time, in milliseconds, from when the prompt finishes to when the system directs the call to the No Input Event Handler as it has been unable to detect the caller’s speech.