ondewo/s2t/speech-to-text.proto

Top

Services

Speech2Text

Speech-to-text service

Service Methods

TranscribeFile

rpc TranscribeFile (TranscribeFileRequest) returns (TranscribeFileResponse)
Transcribes an audio file

TranscribeStream

rpc TranscribeStream (stream TranscribeStreamRequest) returns (stream TranscribeStreamResponse)
Transcribes an audio stream.

GetS2tPipeline

rpc GetS2tPipeline (S2tPipelineId) returns (Speech2TextConfig)
Gets a speech to text pipeline corresponding to the id specified in S2tPipelineId. If no corresponding id is found, raises ModuleNotFoundError in server.

CreateS2tPipeline

rpc CreateS2tPipeline (Speech2TextConfig) returns (S2tPipelineId)
Creates a new speech to text pipeline from a Speech2TextConfig and registers the new pipeline in the server.

DeleteS2tPipeline

rpc DeleteS2tPipeline (S2tPipelineId) returns (.google.protobuf.Empty)
Deletes a pipeline corresponding to the id parsed in S2TPipelineId. If no corresponding id is found, raises ModuleNotFoundError in server.

UpdateS2tPipeline

rpc UpdateS2tPipeline (Speech2TextConfig) returns (.google.protobuf.Empty)
Updates a pipeline with the id specified in Speech2TextConfig with the new config. If no corresponding id is found, raises ModuleNotFoundError in server.

ListS2tPipelines

rpc ListS2tPipelines (ListS2tPipelinesRequest) returns (ListS2tPipelinesResponse)
Lists all speech to text pipelines.

ListS2tLanguages

rpc ListS2tLanguages (ListS2tLanguagesRequest) returns (ListS2tLanguagesResponse)
Returns a message containing a list of all languages for which there exist pipelines.

ListS2tDomains

rpc ListS2tDomains (ListS2tDomainsRequest) returns (ListS2tDomainsResponse)
Returns a message containing a list of all domains for which there exist pipelines.

GetServiceInfo

rpc GetServiceInfo (.google.protobuf.Empty) returns (S2TGetServiceInfoResponse)
Returns a message containing the version of the running speech to text server.

ListS2tLanguageModels

rpc ListS2tLanguageModels (ListS2tLanguageModelsRequest) returns (ListS2tLanguageModelsResponse)
Given a list of pipeline ids, returns a list of LanguageModelPipelineId messages containing the pipeline id and a list of the language models loaded in the pipeline.

CreateUserLanguageModel

rpc CreateUserLanguageModel (CreateUserLanguageModelRequest) returns (.google.protobuf.Empty)
Create a user language model.

DeleteUserLanguageModel

rpc DeleteUserLanguageModel (DeleteUserLanguageModelRequest) returns (.google.protobuf.Empty)
Delete a user language model.

AddDataToUserLanguageModel

rpc AddDataToUserLanguageModel (AddDataToUserLanguageModelRequest) returns (.google.protobuf.Empty)
Add data to a user language model.

TrainUserLanguageModel

rpc TrainUserLanguageModel (TrainUserLanguageModelRequest) returns (.google.protobuf.Empty)
Train a user language model.
Messages

AcousticModels

AcousticModels contains information about different types of acoustic models.

FieldTypeLabelDescription
type string

Type of the acoustic model.

wav2vec Wav2Vec

Configuration for the Wav2Vec model.

wav2vec_triton Wav2VecTriton

Configuration for the Wav2Vec model using Triton.

whisper Whisper

Configuration for the Whisper model.

whisper_triton WhisperTriton

Configuration for the Whisper model using Triton.

s2t_cloud_service_amazon S2tCloudServiceAmazon

Amazon web service cloud service inference settings.

s2t_cloud_service_deepgram S2tCloudServiceDeepgram

Deepgram cloud service inference settings.

s2t_cloud_service_google S2tCloudServiceGoogle

Google cloud service inference settings.

s2t_cloud_service_microsoft S2tCloudServiceMicrosoft

Microsoft Azure cloud service inference settings.

AddDataToUserLanguageModelRequest

AddDataToUserLanguageModelRequest is used to request the addition of data to a user-specific language model.

FieldTypeLabelDescription
language_model_name string

Name of the language model to which to add data. Example: "user_lm_1"

zipped_data bytes

Zip file containing data in the form of text files. Example: A zip file with text files containing sentences or phrases in the target language.

CkptFile

CkptFile contains information about checkpoint files.

FieldTypeLabelDescription
path string

Path to the checkpoint file.

CreateUserLanguageModelRequest

CreateUserLanguageModelRequest is used to request the creation of a new user-specific language model.

FieldTypeLabelDescription
language_model_name string

Name of the language model to create. Example: "user_lm_1"

DeleteUserLanguageModelRequest

DeleteUserLanguageModelRequest is used to request the deletion of a user-specific language model.

FieldTypeLabelDescription
language_model_name string

Name of the language model to delete. Example: "user_lm_1"

LanguageModelPipelineId

LanguageModelPipelineId contains information about a pipeline and its available language models.

FieldTypeLabelDescription
pipeline_id string

A pipeline ID. Example: "pipeline_1"

model_names string repeated

A list of all available language models for the corresponding pipeline ID. Example: ["model_1", "model_2"]

LanguageModels

LanguageModels contains information about language models.

FieldTypeLabelDescription
path string

Path to the directory of language models.

beam_size int64

Beam size for the search algorithm.

default_lm string

Default language model to be selected if none is given.

beam_search_scorer_alpha float

Weight for the language model scorer (alpha).

beam_search_scorer_beta float

Weight for the word insertion penalty (beta).

ListS2tDomainsRequest

Request message to list available domains. Optionally also filters can be set.

FieldTypeLabelDescription
languages string repeated

Filter for languages

pipeline_owners string repeated

Filter for pipeline owner

ListS2tDomainsResponse

Response message to list available domains

FieldTypeLabelDescription
domains string repeated

domains available. Example: ["medical", "finance"]

ListS2tLanguageModelsRequest

ListS2tLanguageModelsRequest is used to request a list of available language models for specified pipelines.

FieldTypeLabelDescription
ids string repeated

List of pipeline IDs to retrieve their available language models. Example: ["pipeline_1", "pipeline_2"]

ListS2tLanguageModelsResponse

ListS2tLanguageModelsResponse is used to return the available language models for specified pipelines.

FieldTypeLabelDescription
lm_pipeline_ids LanguageModelPipelineId repeated

Response is a list of LanguageModelPipelineId, where each element contains a pipeline ID and its associated language models. Example: [{pipeline_id: "pipeline_1", model_names: ["model_1", "model_2"]}, {pipeline_id: "pipeline_2", model_names: ["model_3"]}]

ListS2tLanguagesRequest

ListS2tLanguagesRequest is used to request a list of available languages. Optionally, filters can be set.

FieldTypeLabelDescription
domains string repeated

Filter for domains. Example: ["medical", "finance"]

pipeline_owners string repeated

Filter for pipeline owners. Example: ["ondewo", "partner_company"]

ListS2tLanguagesResponse

Response message to list available languages

FieldTypeLabelDescription
languages string repeated

available languages

ListS2tPipelinesRequest

Request to list all speech-to-text pipelines. Optionally also filter criteria can be set

FieldTypeLabelDescription
languages string repeated

Filter for languages

pipeline_owners string repeated

Filter for pipeline owners

domains string repeated

Filter for domains

registered_only bool

If true, return only registered pipelines. Default false: return registered and persisted (from config files) configs.

ListS2tPipelinesResponse

ListS2tPipelinesResponse is used to return a list of all speech-to-text pipelines.

FieldTypeLabelDescription
pipeline_configs Speech2TextConfig repeated

A list of Speech2TextConfig message instances containing the configuration of each pipeline. Example: [{id: "pipeline_1", description: {language: "en"}, active: true, ...}, {id: "pipeline_2", description: {language: "fr"}, active: true, ...}]

Logging

Logging contains configuration for logging.

FieldTypeLabelDescription
type string

Type of logging.

path string

Path for logging.

PostProcessing

PostProcessing contains the configuration for post-processing.

FieldTypeLabelDescription
pipeline string repeated

List of names of active post-processors.

post_processors PostProcessors

Post-processor configurations.

PostProcessingOptions

Configuration of the post-processing options

FieldTypeLabelDescription
spelling_correction bool

Whether to use spelling correction

normalize bool

Whether to disable normalization

config PostProcessing

Post-processing configuration specifying the active post-processors in the pipeline, as well as their individual configuration. If not set, all values are replaced by the ones in current pipeline.

PostProcessors

PostProcessors contains configurations for post-processors.

FieldTypeLabelDescription
sym_spell SymSpell

Configuration of the SymSpell spelling correction.

normalization S2TNormalization

Configuration of the normalization object.

PtFiles

PtFiles contains information about PT files.

FieldTypeLabelDescription
path string

Path to the PT files.

step string

Step for the PT files.

Pyannote

Pyannote contains configuration for the Pyannote voice activity detection model.

Library: [pyannote-audio](https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/voice_activity_detection.ipynb)

FieldTypeLabelDescription
model_name string

Full name of the Pyannote model.

min_audio_size int64

Minimum audio size for processing.

min_duration_off float

Fill inactive regions shorter than that many seconds. Example [notebook](https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/voice_activity_detection.ipynb)

min_duration_on float

Remove active regions shorter than that many seconds Example [notebook](https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/voice_activity_detection.ipynb)

triton_server_host string

Host name of triton inference server that serves the Pyannote model

triton_server_port int64

Port number of triton inference server that serves the Pyannote model

S2TDescription

S2TDescription contains descriptive information about the speech-to-text pipeline.

FieldTypeLabelDescription
language string

Language of the speech-to-text system.

pipeline_owner string

Owner of the pipeline.

domain string

Domain of the speech-to-text system.

comments string

Comments about the system.

S2TGetServiceInfoResponse

S2TGetServiceInfoResponse is used to return version information about the speech-to-text service.

FieldTypeLabelDescription
version string

Version number based on semantic versioning, e.g. "4.2.0".

S2TInference

S2TInference contains information about inference models used in the speech-to-text pipeline.

FieldTypeLabelDescription
acoustic_models AcousticModels

Configuration for the acoustic models.

language_models LanguageModels

Configuration for the language models.

inference_backend InferenceBackend

Configuration for the inference backend.

S2TNormalization

S2TNormalization contains configuration for the speech-to-text normalization.

FieldTypeLabelDescription
language string

Language for normalization of transcriptions.

pipeline string repeated

List of names of active normalizations.

S2tCloudProviderConfig

Configuration for cloud provider settings for Speech-to-Text (S2T).

FieldTypeLabelDescription
s2t_cloud_provider_config_amazon S2tCloudProviderConfigAmazon optional

Optional. Configuration for Amazon web service speech-to-text provider.

s2t_cloud_provider_config_deepgram S2tCloudProviderConfigDeepgram optional

Optional. Configuration for DeepGram speech-to-text provider.

s2t_cloud_provider_config_google S2tCloudProviderConfigGoogle optional

Optional. Configuration for Google speech-to-text provider.

s2t_cloud_provider_config_microsoft S2tCloudProviderConfigMicrosoft optional

Optional. Configuration for Microsoft Azure speech-to-text provider.

S2tCloudProviderConfigAmazon

Configuration details specific to the Amazon web service speech-to-text provider.

FieldTypeLabelDescription
enable_partial_results_stabilization bool optional

Optional. Enables or disables partial_results_stabilization feature. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/streaming-partial-results.html#streaming-partial-result-stabilization

partial_results_stability string optional

Optional. You can use this field to set the stability level of the transcription results. A higher stability level means that the transcription results are less likely to change. Higher stability levels can come with lower overall transcription accuracy. Acceptable values: ["low", "medium", "high"]. Defaults to "high" if not set explicitly. More details at: https://aws.amazon.com/blogs/machine-learning/amazon-transcribe-now-supports-partial-results-stabilization-for-streaming-audio/

language_model_name string optional

Optional. The name of your customize language model you want to use. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/custom-language-models.html

vocabulary_name string optional

Optional. The name of your customize language model you want to use. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/custom-vocabulary.html

S2tCloudProviderConfigDeepgram

Configuration details specific to the Deepgram speech-to-text provider.

FieldTypeLabelDescription
punctuate bool optional

Optional. Enables or disables punctuate feature of Deepgram to add punctuations to the resulted transcript. More details at: https://developers.deepgram.com/docs/punctuation

smart_format bool optional

Optional. Enables or disables smart_format feature of Deepgram transcription result to improve readability. More details at: https://developers.deepgram.com/docs/smart-format

numerals bool optional

Optional. Enables or disables numerals feature of Deepgram to convert numbers to numeric form in the resulted transcript. More details at: https://developers.deepgram.com/docs/numerals

measurements bool optional

Optional. Enables or disables measurements feature of Deepgram to convert measurement units (i.e. Kilogram) to abbreviated form (i.e. Kg) in the resulted transcript. More details at: https://developers.deepgram.com/docs/measurements

dictation bool optional

Optional. Enables or disables dictation feature of Deepgram to convert spoken dictation commands into their corresponding punctuation marks. More details at: https://developers.deepgram.com/docs/dictation

S2tCloudProviderConfigGoogle

Configuration details specific to the Google speech-to-text provider.

FieldTypeLabelDescription
enable_automatic_punctuation bool optional

Optional. Enables or disables automatic_punctuation feature of Google s2t to add punctuations to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/automatic-punctuation

enable_word_time_offsets bool optional

Optional. Enables or disables word_time_offsets feature of Google s2t to add word-level timestamps (time-offset) to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/async-time-offsets

enable_word_confidence bool optional

Optional. Enables or disables word_confidence feature of Google s2t to add word-level confidence scores to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/word-confidence

transcript_normalization bool optional

Optional. Enables or disables transcript_normalization feature of Google s2t to automatically replace parts of the transcript with phrases of your choosing. More details at: https://cloud.google.com/speech-to-text/v2/docs/reference/rpc/google.cloud.speech.v2#transcriptnormalization

max_alternatives int32 optional

Optional. Maximum number of recognition hypotheses to be returned, may be returned fewer than max_alternatives. Valid values are 0-30. A value of 0 or 1 will return a maximum of one. If omitted, will return a maximum of one.

S2tCloudProviderConfigMicrosoft

Configuration details specific to the Microsoft Azure speech-to-text provider.

FieldTypeLabelDescription
use_fast_transcription_api bool optional

Optional. Enables or disables the Microsoft Azure fast transcription API. It is faster than SDK but is in preview version. More details at: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/fast-transcription-create

use_detailed_output_format bool optional

Optional. Enables or disables the `detailed` format for the result of Microsoft Azure s2t service to add timestamps and confidences to the resulted transcript.

S2tCloudServiceAmazon

S2tCloudServiceAmazon message contains settings for the Amazon web service Cloud service inference.

FieldTypeLabelDescription
language string

Language of the audio to transcribe by Amazon web service s2t cloud service. It should be 4-letter language code (BCP-47) e.g. 'en-US' or 'de-DE'.

streaming_available bool

Specifies if streaming mode of Amazon web service speech to text is available for the selected language, otherwise batch mode transcription is used. See the list of languages and available transcription modes at: https://docs.aws.amazon.com/transcribe/latest/dg/supported-languages.html

enable_partial_results_stabilization bool

Enables or disables partial_results_stabilization feature. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/streaming-partial-results.html#streaming-partial-result-stabilization

partial_results_stability string

You can use this field to set the stability level of the transcription results. A higher stability level means that the transcription results are less likely to change. Higher stability levels can come with lower overall transcription accuracy. Defaults to "high" if not set explicitly.

language_model_name string

The name of your customize language model you want to use. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/custom-language-models.html

vocabulary_name string

The name of your customize language model you want to use. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/custom-vocabulary.html

S2tCloudServiceDeepgram

S2tCloudServiceDeepgram message contains settings for the Deepgram Cloud service inference.

FieldTypeLabelDescription
model_name string

Model name from one of the speech-to-text models provided by Deepgram for the desired use-case. Provided model names and details at: https://developers.deepgram.com/docs/model

language string

Language of the audio to transcribe by Deepgram s2t cloud service. It should be 4-letter language code (BCP-47) e.g. 'en-US' or 'de-DE'.

punctuate bool

Enables or disables punctuate feature of Deepgram to add punctuations to the resulted transcript. More details at: https://developers.deepgram.com/docs/punctuation

smart_format bool

Enables or disables smart_format feature of Deepgram transcription result to improve readability. More details at: https://developers.deepgram.com/docs/smart-format

numerals bool

Enables or disables numerals feature of Deepgram to convert numbers to numeric form in the resulted transcript. More details at: https://developers.deepgram.com/docs/numerals

measurements bool

Enables or disables measurements feature of Deepgram to convert measurement units (i.e. Kilogram) to abbreviated form (i.e. Kg) in the resulted transcript. More details at: https://developers.deepgram.com/docs/measurements

dictation bool

Enables or disables dictation feature of Deepgram to convert spoken dictation commands into their corresponding punctuation marks. More details at: https://developers.deepgram.com/docs/dictation

S2tCloudServiceGoogle

S2tCloudServiceGoogle message contains settings for the Google Cloud service inference.

FieldTypeLabelDescription
model_name string

Model name from one of the speech-to-text models provided by Google for the desired use-case. Provided model names and details at: https://cloud.google.com/speech-to-text/docs/transcription-model

language string

Language of the audio to transcribe by Google s2t cloud service. It should be 4-letter language code (BCP-47) e.g. 'en-US' or 'de-DE'.

enable_automatic_punctuation bool

Enables or disables automatic_punctuation feature of Google s2t to add punctuations to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/automatic-punctuation

enable_word_time_offsets bool

Enables or disables word_time_offsets feature of Google s2t to add word-level timestamps (time-offsets) to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/async-time-offsets

enable_word_confidence bool

Enables or disables word_confidence feature of Google s2t to add word-level confidence scores to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/word-confidence

transcript_normalization bool

Enables or disables transcript_normalization feature of Google s2t to automatically replace parts of the transcript with phrases of your choosing. More details at: https://cloud.google.com/speech-to-text/v2/docs/reference/rpc/google.cloud.speech.v2#transcriptnormalization

max_alternatives int32

Maximum number of recognition hypotheses to be returned. The server may return fewer than max_alternatives. Valid values are 0-30. A value of 0 or 1 will return a maximum of one. If omitted, will return a maximum of one.

S2tCloudServiceMicrosoft

S2tCloudServiceMicrosoft message contains settings for the Microsoft Azure Cloud service inference.

FieldTypeLabelDescription
language string

Language of the audio to transcribe by Microsoft Azure s2t cloud service. It should be 4-letter language code (BCP-47) e.g. 'en-US' or 'de-DE'.

use_fast_transcription_api bool

Enables or disables the Microsoft Azure fast transcription API. It is faster than SDK but is in preview version. More details at: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/fast-transcription-create

use_detailed_output_format bool

Enables or disables the `detailed` format for the result of Microsoft Azure s2t service to add timestamps and confidences to the resulted transcript.

S2tPipelineId

The pipeline id for a specific pipeline configuration

FieldTypeLabelDescription
id string

id of the model that will generate audio

Speech2TextConfig

Speech2TextConfig is a configuration message for the speech-to-text pipeline

FieldTypeLabelDescription
id string

Unique identifier for the configuration.

description S2TDescription

Description of the speech-to-text system.

active bool

Indicates if the configuration is active.

inference S2TInference

Configuration for inference models.

streaming_server StreamingServer

Configuration for the streaming server.

voice_activity_detection VoiceActivityDetection

Configuration for voice activity detection.

post_processing PostProcessing

Configuration for post-processing.

logging Logging

Configuration for logging.

StreamingServer

StreamingServer contains information about the streaming server.

FieldTypeLabelDescription
host string

Hostname of the streaming server.

port int64

Port number of the streaming server.

output_style string

Output style for the streaming server.

streaming_speech_recognition StreamingSpeechRecognition

Configuration for streaming speech recognition.

StreamingSpeechRecognition

StreamingSpeechRecognition contains information about streaming speech recognition settings.

FieldTypeLabelDescription
transcribe_not_final bool

Indicates whether to transcribe non-final results.

decoding_method string

Decoding method for speech recognition.

sampling_rate int64

Sampling rate for audio input.

min_audio_chunk_size int64

Minimum audio chunk size for processing.

next_chunk_timeout float

Timeout between audio chunks; if exceeded, the stream will be stopped.

turn_detection TurnDetectionOptions

Configuration of the options to turn-detection in utterances

SymSpell

SymSpell contains configuration for the SymSpell spelling correction.

FieldTypeLabelDescription
dict_path string

Path to the dictionary used by symspell

max_dictionary_edit_distance int64

The maximal edit-distance to consider for spelling correction (affects performance - bigger number takes longer to process!)

prefix_length int64

The length (number of characters) of the prefix to consider for filtering

TrainUserLanguageModelRequest

TrainUserLanguageModelRequest is used to request the training of a user-specific language model.

FieldTypeLabelDescription
language_model_name string

Name of the language model to train. Example: "user_lm_1"

order int64

Order n of the ngram. Example: 3 (for trigram model)

TranscribeFileRequest

A request to transcribe an audio file

FieldTypeLabelDescription
audio_file bytes

wav file to transcribe

config TranscribeRequestConfig

The configuration to override the default configuration

TranscribeFileResponse

The response message for a transcribe file request

FieldTypeLabelDescription
transcriptions Transcription repeated

List of transcriptions with confidence level

time float

The time the transcription took

audio_uuid string

id of the transcribed audio file

TranscribeRequestConfig

Configuration for a request to transcribe audio

FieldTypeLabelDescription
s2t_pipeline_id string

Required. id of the pipeline (model setup) that will generate audio

decoding Decoding

Optional. decoding type

language_model_name string

Name of the language model

post_processing PostProcessingOptions

The postprocessing options

utterance_detection UtteranceDetectionOptions

The utterance detection options

pyannote Pyannote

Voice activity detection with pyannote

return_options TranscriptionReturnOptions

The transcribe return options

language string optional

Optional. Specify language of transcription to return

task string optional

Optional. Specify task of s2t model, e.g. 'transcribe' and 'translate'

s2t_service_config google.protobuf.Struct optional

Optional. s2t_service_config provides the configuration of the service such as API key, bearer tokens, JWT, and other header information as key value pairs, e.g.,

MY_API_KEY='LKJDIFe244LKJOI'
A. For Amazon S2T service, the following arguments should be passed in form of a dict: A.1 aws_access_key_id (required) Access key id to access Amazon WEb Service. A.2 aws_secret_access_key (required) Secret access key to access Amazon WEb Service. A.3 region (required) Region name of Amazon Server. Example: s2t_config_service={'aws_access_key_id': 'YOUR_AWS_ACCESS_KEY_ID', 'aws_secret_access_key': 'YOUR_AWS_SECRET_ACCESS_KEY', 'region': 'YOUR_AMAZON_SERVER_REGION_NAME'} B. For Deepgram S2T service, the following argument should be passed in form of a dict: B.1 api_key (required) API key of Deepgram account to access Deepgram S2T service. Example: s2t_config_service={'api_key': 'YOUR_DEEPGRAM_API_KEY'} C. For Google cloud S2T service, the following arguments should be passed in form of a dict: C.1- api_key (required) API key of Google cloud to access its S2T service. C.2- api_endpoint (optional) Regional API endpoint of Google cloud S2T service. (Defaults to 'eu-speech.googleapis.com') Example: s2t_config_service={'api_key': 'YOUR_GOOGLE_CLOUD_API_KEY', 'api_endpoint': 'YOUR_GOOGLE_CLOUD_API_ENDPOINT'} D. For Microsoft Azure S2T service, the following arguments should be passed in form of a dict: D.1 subscription_key (required) Subscription key to access Microsoft Azure Service. D.2 region (required) Region name of Microsoft Azure Server. Example: s2t_config_service={'subscription_key': 'YOUR_MICROSOFT_AZURE_SUBSCRIPTION_KEY', 'region': 'YOUR_MICROSOFT_AZURE_SERVER_REGION_NAME'} Note: ondewo-s2t will raise an error if you don't pass any of the required arguments above.

s2t_cloud_provider_config S2tCloudProviderConfig optional

Optional. Defines the cloud provider's specific configuration for using speech to text cloud services The default value is None.

TranscribeStreamRequest

Request to transcribe an audio stream

FieldTypeLabelDescription
audio_chunk bytes

wav file to transcribe

end_of_stream bool

if it's the final chunk of the stream

config TranscribeRequestConfig

The configuration to override the default configuration

mute_audio bool

Whether or not to mute the audio signal. Defaults to false.

TranscribeStreamResponse

The response message of a stream transcription

FieldTypeLabelDescription
transcriptions Transcription repeated

List of transcriptions with confidence level

time float

The time the transcription took

final bool

Whether or not this transcription is final (transcribed texts might change if transcription is started before the end of an utterance).

return_audio bool

is audio bytes of the utterance in response

audio bytes

audio bytes of the transcribed utterance

utterance_start bool

is it a start of the utterance

audio_uuid string

id of the transcribed audio file

config TranscribeRequestConfig

The configuration for the transcription

Transcription

The transcription message

FieldTypeLabelDescription
transcription string

The transcribed text

confidence_score float

The corresponding confidence score. The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct.

words WordDetail repeated

List of the words of transcription with their confidence scores and probable alternatives

alternatives TranscriptionAlternative repeated

List of alternative transcriptions, confidence scores, words timings and alternative words

TranscriptionAlternative

FieldTypeLabelDescription
transcript string

The alternative transcribed text

confidence float

The corresponding confidence score to the alternative transcript.

words WordDetail repeated

A list of word-specific information for each recognized word, including word timings, confidence score of the word and alternative words.

TranscriptionReturnOptions

Configuration of the return values of a transcribe request

FieldTypeLabelDescription
return_start_of_speech bool

should server make response indicating that the beginning of the speech was detected

return_audio bool

should s2t server return audio bytes of transcribed utterance

return_confidence_score bool

Whether or not to return confidence scores

return_alternative_transcriptions bool

Whether or not to return alternative results from beam-search

return_alternative_transcriptions_nr int32

Optional. Number of alternative transcriptions results from beam-search or greedy-search

return_alternative_words bool

Whether or not to return alternative results from beam-search

return_alternative_words_nr int32

Optional. Number of alternative words to results

return_word_timing bool

Optional. Whether or not to return timestamps of start and end of the words. Only used in TranscribeFile.

TurnDetectionOptions

Configuration of the options to turn-detection in utterances

FieldTypeLabelDescription
active bool optional

Optional. Indicates if the turn-detection feature is active.

full_utterance_deployment bool optional

Optional. Whether to transcribe the whole utterance when turn moment is detected. It is helpful to increase accuracy of transcriptions in cost of drop in speed. If deactivated, it just transcribe from last short silence period and concatenates the transcriptions of small audio chunks between tiny silences.

llm_grpc_host string optional

Optional. Host name or IP address of the server that serves the LLM for turn-detection purpose.

llm_grpc_port int32 optional

Optional. Port number of the server that serves the LLM for turn-detection purpose.

llm_request_timeout float optional

Optional. Duration of request timeout in seconds to get result of request to LLM for turn-detection purpose. If the timeout occurs, result of turn-detection considered as False.

UtteranceDetectionOptions

Configuration of the options to detect utterances

FieldTypeLabelDescription
transcribe_not_final bool

Return also immediate transcription results

next_chunk_timeout float

if time between audio chunks exceeds next_chunk_timeout, stream will be stopped

turn_detection TurnDetectionOptions

The turn detection options

VoiceActivityDetection

VoiceActivityDetection contains information about voice activity detection settings.

FieldTypeLabelDescription
active string

Indicates if voice activity detection is active.

sampling_rate int64

Sampling rate for voice activity detection.

pyannote Pyannote

Configuration for the Pyannote model.

Wav2Vec

Wav2Vec contains information about the Wav2Vec model.

FieldTypeLabelDescription
model_path string

Path to the model.

use_gpu bool

Indicates if GPU is used.

Wav2VecTriton

Wav2VecTriton contains information about the Wav2Vec model using Triton.

FieldTypeLabelDescription
processor_path string

Path to the processor.

triton_model_name string

Name of the Triton model.

triton_model_version string

Version of the Triton model.

check_status_timeout int64

Timeout for checking model status.

triton_server_host string

Host name of triton inference server that serves the Wav2VecTriton model

triton_server_port int64

Port number of triton inference server that serves the Wav2VecTriton model

Whisper

Whisper contains information about the Whisper model.

FieldTypeLabelDescription
model_path string

Path to the model.

use_gpu bool

Indicates if GPU is used.

language string

Default language of the model.

task string

Default task of the model.

WhisperTriton

WhisperTriton contains information about the Whisper model using Triton.

FieldTypeLabelDescription
processor_path string

Path to the processor.

triton_model_name string

Name of the Triton model.

triton_model_version string

Version of the Triton model.

check_status_timeout int64

Timeout for checking model status.

language string

Default language of the model.

task string

Default task of the model. E.g., transcribe, translate, etc.

triton_server_host string

Host name of triton inference server that serves the WhisperTriton model

triton_server_port int64

Port number of triton inference server that serves the WhisperTriton model

WordAlternative

FieldTypeLabelDescription
word string

The recognized word corresponding to this set of information.

confidence float

The corresponding confidence score to the alternative word.

WordDetail

WordDetail provides word-specific information for recognized words.

FieldTypeLabelDescription
start_time float

The start time of the spoken word relative to the beginning of the audio. The accuracy of the time offset can vary, and this is an experimental feature.

end_time float

The end time of the spoken word relative to the beginning of the audio. The accuracy of the time offset can vary, and this is an experimental feature.

word string

The recognized word corresponding to this set of information.

confidence float

The corresponding confidence score to the word.

word_alternatives WordAlternative repeated

List of alternative words and confidence scores of each.

Enums

Decoding

The decoding configuration

NameNumberDescription
DEFAULT 0

decoding will be defined by the pipeline config

GREEDY 1

greedy decoding will be used independently on pipeline config

BEAM_SEARCH_WITH_LM 2

beam search will be used independently on pipeline config

BEAM_SEARCH 3

beam search without LM head, to configure decoding mode for seq2seq models.

InferenceBackend

NameNumberDescription
INFERENCE_BACKEND_UNKNOWN 0

Not set

INFERENCE_BACKEND_PYTORCH 1

Run pytorch model

INFERENCE_BACKEND_FLAX 2

Run flax model

INFERENCE_BACKEND_CLOUD_SERVICE_AMAZON 3

Run Amazon S2T cloud service

INFERENCE_BACKEND_CLOUD_SERVICE_DEEPGRAM 4

Run Deepgram S2T cloud service

INFERENCE_BACKEND_CLOUD_SERVICE_GOOGLE 5

Run Google S2T cloud service

INFERENCE_BACKEND_CLOUD_SERVICE_MICROSOFT 6

Run Microsoft Azure S2T cloud service

Scalar Value Types

.proto TypeNotesC++JavaPythonGoC#PHPRuby
double double double float float64 double float Float
float float float float float32 float float Float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int int32 int integer Bignum or Fixnum (as required)
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long int64 long integer/string Bignum
uint32 Uses variable-length encoding. uint32 int int/long uint32 uint integer Bignum or Fixnum (as required)
uint64 Uses variable-length encoding. uint64 long int/long uint64 ulong integer/string Bignum or Fixnum (as required)
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int int32 int integer Bignum or Fixnum (as required)
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long int64 long integer/string Bignum
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int uint32 uint integer Bignum or Fixnum (as required)
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long uint64 ulong integer/string Bignum
sfixed32 Always four bytes. int32 int int int32 int integer Bignum or Fixnum (as required)
sfixed64 Always eight bytes. int64 long int/long int64 long integer/string Bignum
bool bool boolean boolean bool bool boolean TrueClass/FalseClass
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode string string string String (UTF-8)
bytes May contain any arbitrary sequence of bytes. string ByteString str []byte ByteString string String (ASCII-8BIT)