Speech-to-text service
rpc TranscribeFile (TranscribeFileRequest) returns (TranscribeFileResponse)Transcribes an audio file
rpc TranscribeStream (stream TranscribeStreamRequest) returns (stream TranscribeStreamResponse)Transcribes an audio stream.
rpc GetS2tPipeline (S2tPipelineId) returns (Speech2TextConfig)Gets a speech to text pipeline corresponding to the id specified in S2tPipelineId. If no corresponding id is found, raises ModuleNotFoundError in server.
rpc CreateS2tPipeline (Speech2TextConfig) returns (S2tPipelineId)Creates a new speech to text pipeline from a Speech2TextConfig and registers the new pipeline in the server.
rpc DeleteS2tPipeline (S2tPipelineId) returns (.google.protobuf.Empty)Deletes a pipeline corresponding to the id parsed in S2TPipelineId. If no corresponding id is found, raises ModuleNotFoundError in server.
rpc UpdateS2tPipeline (Speech2TextConfig) returns (.google.protobuf.Empty)Updates a pipeline with the id specified in Speech2TextConfig with the new config. If no corresponding id is found, raises ModuleNotFoundError in server.
rpc ListS2tPipelines (ListS2tPipelinesRequest) returns (ListS2tPipelinesResponse)Lists all speech to text pipelines.
rpc ListS2tLanguages (ListS2tLanguagesRequest) returns (ListS2tLanguagesResponse)Returns a message containing a list of all languages for which there exist pipelines.
rpc ListS2tDomains (ListS2tDomainsRequest) returns (ListS2tDomainsResponse)Returns a message containing a list of all domains for which there exist pipelines.
rpc GetServiceInfo (.google.protobuf.Empty) returns (S2TGetServiceInfoResponse)Returns a message containing the version of the running speech to text server.
rpc ListS2tLanguageModels (ListS2tLanguageModelsRequest) returns (ListS2tLanguageModelsResponse)Given a list of pipeline ids, returns a list of LanguageModelPipelineId messages containing the pipeline id and a list of the language models loaded in the pipeline.
rpc CreateUserLanguageModel (CreateUserLanguageModelRequest) returns (.google.protobuf.Empty)Create a user language model.
rpc DeleteUserLanguageModel (DeleteUserLanguageModelRequest) returns (.google.protobuf.Empty)Delete a user language model.
rpc AddDataToUserLanguageModel (AddDataToUserLanguageModelRequest) returns (.google.protobuf.Empty)Add data to a user language model.
rpc TrainUserLanguageModel (TrainUserLanguageModelRequest) returns (.google.protobuf.Empty)Train a user language model.
AcousticModels contains information about different types of acoustic models.
Field | Type | Label | Description |
type | string | Type of the acoustic model. |
|
wav2vec | Wav2Vec | Configuration for the Wav2Vec model. |
|
wav2vec_triton | Wav2VecTriton | Configuration for the Wav2Vec model using Triton. |
|
whisper | Whisper | Configuration for the Whisper model. |
|
whisper_triton | WhisperTriton | Configuration for the Whisper model using Triton. |
|
s2t_cloud_service_amazon | S2tCloudServiceAmazon | Amazon web service cloud service inference settings. |
|
s2t_cloud_service_deepgram | S2tCloudServiceDeepgram | Deepgram cloud service inference settings. |
|
s2t_cloud_service_google | S2tCloudServiceGoogle | Google cloud service inference settings. |
|
s2t_cloud_service_microsoft | S2tCloudServiceMicrosoft | Microsoft Azure cloud service inference settings. |
AddDataToUserLanguageModelRequest is used to request the addition of data to a user-specific language model.
Field | Type | Label | Description |
language_model_name | string | Name of the language model to which to add data. Example: "user_lm_1" |
|
zipped_data | bytes | Zip file containing data in the form of text files. Example: A zip file with text files containing sentences or phrases in the target language. |
CkptFile contains information about checkpoint files.
Field | Type | Label | Description |
path | string | Path to the checkpoint file. |
CreateUserLanguageModelRequest is used to request the creation of a new user-specific language model.
Field | Type | Label | Description |
language_model_name | string | Name of the language model to create. Example: "user_lm_1" |
DeleteUserLanguageModelRequest is used to request the deletion of a user-specific language model.
Field | Type | Label | Description |
language_model_name | string | Name of the language model to delete. Example: "user_lm_1" |
LanguageModelPipelineId contains information about a pipeline and its available language models.
Field | Type | Label | Description |
pipeline_id | string | A pipeline ID. Example: "pipeline_1" |
|
model_names | string | repeated | A list of all available language models for the corresponding pipeline ID. Example: ["model_1", "model_2"] |
LanguageModels contains information about language models.
Field | Type | Label | Description |
path | string | Path to the directory of language models. |
|
beam_size | int64 | Beam size for the search algorithm. |
|
default_lm | string | Default language model to be selected if none is given. |
|
beam_search_scorer_alpha | float | Weight for the language model scorer (alpha). |
|
beam_search_scorer_beta | float | Weight for the word insertion penalty (beta). |
Request message to list available domains. Optionally also filters can be set.
Field | Type | Label | Description |
languages | string | repeated | Filter for languages |
pipeline_owners | string | repeated | Filter for pipeline owner |
Response message to list available domains
Field | Type | Label | Description |
domains | string | repeated | domains available. Example: ["medical", "finance"] |
ListS2tLanguageModelsRequest is used to request a list of available language models for specified pipelines.
Field | Type | Label | Description |
ids | string | repeated | List of pipeline IDs to retrieve their available language models. Example: ["pipeline_1", "pipeline_2"] |
ListS2tLanguageModelsResponse is used to return the available language models for specified pipelines.
Field | Type | Label | Description |
lm_pipeline_ids | LanguageModelPipelineId | repeated | Response is a list of LanguageModelPipelineId, where each element contains a pipeline ID and its associated language models. Example: [{pipeline_id: "pipeline_1", model_names: ["model_1", "model_2"]}, {pipeline_id: "pipeline_2", model_names: ["model_3"]}] |
ListS2tLanguagesRequest is used to request a list of available languages. Optionally, filters can be set.
Field | Type | Label | Description |
domains | string | repeated | Filter for domains. Example: ["medical", "finance"] |
pipeline_owners | string | repeated | Filter for pipeline owners. Example: ["ondewo", "partner_company"] |
Response message to list available languages
Field | Type | Label | Description |
languages | string | repeated | available languages |
Request to list all speech-to-text pipelines. Optionally also filter criteria can be set
Field | Type | Label | Description |
languages | string | repeated | Filter for languages |
pipeline_owners | string | repeated | Filter for pipeline owners |
domains | string | repeated | Filter for domains |
registered_only | bool | If true, return only registered pipelines. Default false: return registered and persisted (from config files) configs. |
ListS2tPipelinesResponse is used to return a list of all speech-to-text pipelines.
Field | Type | Label | Description |
pipeline_configs | Speech2TextConfig | repeated | A list of Speech2TextConfig message instances containing the configuration of each pipeline. Example: [{id: "pipeline_1", description: {language: "en"}, active: true, ...}, {id: "pipeline_2", description: {language: "fr"}, active: true, ...}] |
Logging contains configuration for logging.
Field | Type | Label | Description |
type | string | Type of logging. |
|
path | string | Path for logging. |
PostProcessing contains the configuration for post-processing.
Field | Type | Label | Description |
pipeline | string | repeated | List of names of active post-processors. |
post_processors | PostProcessors | Post-processor configurations. |
Configuration of the post-processing options
Field | Type | Label | Description |
spelling_correction | bool | Whether to use spelling correction |
|
normalize | bool | Whether to disable normalization |
|
config | PostProcessing | Post-processing configuration specifying the active post-processors in the pipeline, as well as their individual configuration. If not set, all values are replaced by the ones in current pipeline. |
PostProcessors contains configurations for post-processors.
Field | Type | Label | Description |
sym_spell | SymSpell | Configuration of the SymSpell spelling correction. |
|
normalization | S2TNormalization | Configuration of the normalization object. |
PtFiles contains information about PT files.
Field | Type | Label | Description |
path | string | Path to the PT files. |
|
step | string | Step for the PT files. |
Pyannote contains configuration for the Pyannote voice activity detection model.
Library: [pyannote-audio](https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/voice_activity_detection.ipynb)
Field | Type | Label | Description |
model_name | string | Full name of the Pyannote model. |
|
min_audio_size | int64 | Minimum audio size for processing. |
|
min_duration_off | float | Fill inactive regions shorter than that many seconds. Example [notebook](https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/voice_activity_detection.ipynb) |
|
min_duration_on | float | Remove active regions shorter than that many seconds Example [notebook](https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/voice_activity_detection.ipynb) |
|
triton_server_host | string | Host name of triton inference server that serves the Pyannote model |
|
triton_server_port | int64 | Port number of triton inference server that serves the Pyannote model |
S2TDescription contains descriptive information about the speech-to-text pipeline.
Field | Type | Label | Description |
language | string | Language of the speech-to-text system. |
|
pipeline_owner | string | Owner of the pipeline. |
|
domain | string | Domain of the speech-to-text system. |
|
comments | string | Comments about the system. |
S2TGetServiceInfoResponse is used to return version information about the speech-to-text service.
Field | Type | Label | Description |
version | string | Version number based on semantic versioning, e.g. "4.2.0". |
S2TInference contains information about inference models used in the speech-to-text pipeline.
Field | Type | Label | Description |
acoustic_models | AcousticModels | Configuration for the acoustic models. |
|
language_models | LanguageModels | Configuration for the language models. |
|
inference_backend | InferenceBackend | Configuration for the inference backend. |
S2TNormalization contains configuration for the speech-to-text normalization.
Field | Type | Label | Description |
language | string | Language for normalization of transcriptions. |
|
pipeline | string | repeated | List of names of active normalizations. |
Configuration for cloud provider settings for Speech-to-Text (S2T).
Field | Type | Label | Description |
s2t_cloud_provider_config_amazon | S2tCloudProviderConfigAmazon | optional | Optional. Configuration for Amazon web service speech-to-text provider. |
s2t_cloud_provider_config_deepgram | S2tCloudProviderConfigDeepgram | optional | Optional. Configuration for DeepGram speech-to-text provider. |
s2t_cloud_provider_config_google | S2tCloudProviderConfigGoogle | optional | Optional. Configuration for Google speech-to-text provider. |
s2t_cloud_provider_config_microsoft | S2tCloudProviderConfigMicrosoft | optional | Optional. Configuration for Microsoft Azure speech-to-text provider. |
Configuration details specific to the Amazon web service speech-to-text provider.
Field | Type | Label | Description |
enable_partial_results_stabilization | bool | optional | Optional. Enables or disables partial_results_stabilization feature. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/streaming-partial-results.html#streaming-partial-result-stabilization |
partial_results_stability | string | optional | Optional. You can use this field to set the stability level of the transcription results. A higher stability level means that the transcription results are less likely to change. Higher stability levels can come with lower overall transcription accuracy. Acceptable values: ["low", "medium", "high"]. Defaults to "high" if not set explicitly. More details at: https://aws.amazon.com/blogs/machine-learning/amazon-transcribe-now-supports-partial-results-stabilization-for-streaming-audio/ |
language_model_name | string | optional | Optional. The name of your customize language model you want to use. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/custom-language-models.html |
vocabulary_name | string | optional | Optional. The name of your customize language model you want to use. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/custom-vocabulary.html |
Configuration details specific to the Deepgram speech-to-text provider.
Field | Type | Label | Description |
punctuate | bool | optional | Optional. Enables or disables punctuate feature of Deepgram to add punctuations to the resulted transcript. More details at: https://developers.deepgram.com/docs/punctuation |
smart_format | bool | optional | Optional. Enables or disables smart_format feature of Deepgram transcription result to improve readability. More details at: https://developers.deepgram.com/docs/smart-format |
numerals | bool | optional | Optional. Enables or disables numerals feature of Deepgram to convert numbers to numeric form in the resulted transcript. More details at: https://developers.deepgram.com/docs/numerals |
measurements | bool | optional | Optional. Enables or disables measurements feature of Deepgram to convert measurement units (i.e. Kilogram) to abbreviated form (i.e. Kg) in the resulted transcript. More details at: https://developers.deepgram.com/docs/measurements |
dictation | bool | optional | Optional. Enables or disables dictation feature of Deepgram to convert spoken dictation commands into their corresponding punctuation marks. More details at: https://developers.deepgram.com/docs/dictation |
Configuration details specific to the Google speech-to-text provider.
Field | Type | Label | Description |
enable_automatic_punctuation | bool | optional | Optional. Enables or disables automatic_punctuation feature of Google s2t to add punctuations to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/automatic-punctuation |
enable_word_time_offsets | bool | optional | Optional. Enables or disables word_time_offsets feature of Google s2t to add word-level timestamps (time-offset) to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/async-time-offsets |
enable_word_confidence | bool | optional | Optional. Enables or disables word_confidence feature of Google s2t to add word-level confidence scores to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/word-confidence |
transcript_normalization | bool | optional | Optional. Enables or disables transcript_normalization feature of Google s2t to automatically replace parts of the transcript with phrases of your choosing. More details at: https://cloud.google.com/speech-to-text/v2/docs/reference/rpc/google.cloud.speech.v2#transcriptnormalization |
max_alternatives | int32 | optional | Optional. Maximum number of recognition hypotheses to be returned, may be returned fewer than max_alternatives. Valid values are 0-30. A value of 0 or 1 will return a maximum of one. If omitted, will return a maximum of one. |
Configuration details specific to the Microsoft Azure speech-to-text provider.
Field | Type | Label | Description |
use_fast_transcription_api | bool | optional | Optional. Enables or disables the Microsoft Azure fast transcription API. It is faster than SDK but is in preview version. More details at: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/fast-transcription-create |
use_detailed_output_format | bool | optional | Optional. Enables or disables the `detailed` format for the result of Microsoft Azure s2t service to add timestamps and confidences to the resulted transcript. |
S2tCloudServiceAmazon message contains settings for the Amazon web service Cloud service inference.
Field | Type | Label | Description |
language | string | Language of the audio to transcribe by Amazon web service s2t cloud service. It should be 4-letter language code (BCP-47) e.g. 'en-US' or 'de-DE'. |
|
streaming_available | bool | Specifies if streaming mode of Amazon web service speech to text is available for the selected language, otherwise batch mode transcription is used. See the list of languages and available transcription modes at: https://docs.aws.amazon.com/transcribe/latest/dg/supported-languages.html |
|
enable_partial_results_stabilization | bool | Enables or disables partial_results_stabilization feature. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/streaming-partial-results.html#streaming-partial-result-stabilization |
|
partial_results_stability | string | You can use this field to set the stability level of the transcription results. A higher stability level means that the transcription results are less likely to change. Higher stability levels can come with lower overall transcription accuracy. Defaults to "high" if not set explicitly. |
|
language_model_name | string | The name of your customize language model you want to use. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/custom-language-models.html |
|
vocabulary_name | string | The name of your customize language model you want to use. More details at: https://docs.aws.amazon.com/transcribe/latest/dg/custom-vocabulary.html |
S2tCloudServiceDeepgram message contains settings for the Deepgram Cloud service inference.
Field | Type | Label | Description |
model_name | string | Model name from one of the speech-to-text models provided by Deepgram for the desired use-case. Provided model names and details at: https://developers.deepgram.com/docs/model |
|
language | string | Language of the audio to transcribe by Deepgram s2t cloud service. It should be 4-letter language code (BCP-47) e.g. 'en-US' or 'de-DE'. |
|
punctuate | bool | Enables or disables punctuate feature of Deepgram to add punctuations to the resulted transcript. More details at: https://developers.deepgram.com/docs/punctuation |
|
smart_format | bool | Enables or disables smart_format feature of Deepgram transcription result to improve readability. More details at: https://developers.deepgram.com/docs/smart-format |
|
numerals | bool | Enables or disables numerals feature of Deepgram to convert numbers to numeric form in the resulted transcript. More details at: https://developers.deepgram.com/docs/numerals |
|
measurements | bool | Enables or disables measurements feature of Deepgram to convert measurement units (i.e. Kilogram) to abbreviated form (i.e. Kg) in the resulted transcript. More details at: https://developers.deepgram.com/docs/measurements |
|
dictation | bool | Enables or disables dictation feature of Deepgram to convert spoken dictation commands into their corresponding punctuation marks. More details at: https://developers.deepgram.com/docs/dictation |
S2tCloudServiceGoogle message contains settings for the Google Cloud service inference.
Field | Type | Label | Description |
model_name | string | Model name from one of the speech-to-text models provided by Google for the desired use-case. Provided model names and details at: https://cloud.google.com/speech-to-text/docs/transcription-model |
|
language | string | Language of the audio to transcribe by Google s2t cloud service. It should be 4-letter language code (BCP-47) e.g. 'en-US' or 'de-DE'. |
|
enable_automatic_punctuation | bool | Enables or disables automatic_punctuation feature of Google s2t to add punctuations to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/automatic-punctuation |
|
enable_word_time_offsets | bool | Enables or disables word_time_offsets feature of Google s2t to add word-level timestamps (time-offsets) to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/async-time-offsets |
|
enable_word_confidence | bool | Enables or disables word_confidence feature of Google s2t to add word-level confidence scores to the resulted transcript. More details at: https://cloud.google.com/speech-to-text/docs/word-confidence |
|
transcript_normalization | bool | Enables or disables transcript_normalization feature of Google s2t to automatically replace parts of the transcript with phrases of your choosing. More details at: https://cloud.google.com/speech-to-text/v2/docs/reference/rpc/google.cloud.speech.v2#transcriptnormalization |
|
max_alternatives | int32 | Maximum number of recognition hypotheses to be returned. The server may return fewer than max_alternatives. Valid values are 0-30. A value of 0 or 1 will return a maximum of one. If omitted, will return a maximum of one. |
S2tCloudServiceMicrosoft message contains settings for the Microsoft Azure Cloud service inference.
Field | Type | Label | Description |
language | string | Language of the audio to transcribe by Microsoft Azure s2t cloud service. It should be 4-letter language code (BCP-47) e.g. 'en-US' or 'de-DE'. |
|
use_fast_transcription_api | bool | Enables or disables the Microsoft Azure fast transcription API. It is faster than SDK but is in preview version. More details at: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/fast-transcription-create |
|
use_detailed_output_format | bool | Enables or disables the `detailed` format for the result of Microsoft Azure s2t service to add timestamps and confidences to the resulted transcript. |
The pipeline id for a specific pipeline configuration
Field | Type | Label | Description |
id | string | id of the model that will generate audio |
Speech2TextConfig is a configuration message for the speech-to-text pipeline
Field | Type | Label | Description |
id | string | Unique identifier for the configuration. |
|
description | S2TDescription | Description of the speech-to-text system. |
|
active | bool | Indicates if the configuration is active. |
|
inference | S2TInference | Configuration for inference models. |
|
streaming_server | StreamingServer | Configuration for the streaming server. |
|
voice_activity_detection | VoiceActivityDetection | Configuration for voice activity detection. |
|
post_processing | PostProcessing | Configuration for post-processing. |
|
logging | Logging | Configuration for logging. |
StreamingServer contains information about the streaming server.
Field | Type | Label | Description |
host | string | Hostname of the streaming server. |
|
port | int64 | Port number of the streaming server. |
|
output_style | string | Output style for the streaming server. |
|
streaming_speech_recognition | StreamingSpeechRecognition | Configuration for streaming speech recognition. |
StreamingSpeechRecognition contains information about streaming speech recognition settings.
Field | Type | Label | Description |
transcribe_not_final | bool | Indicates whether to transcribe non-final results. |
|
decoding_method | string | Decoding method for speech recognition. |
|
sampling_rate | int64 | Sampling rate for audio input. |
|
min_audio_chunk_size | int64 | Minimum audio chunk size for processing. |
|
next_chunk_timeout | float | Timeout between audio chunks; if exceeded, the stream will be stopped. |
|
turn_detection | TurnDetectionOptions | Configuration of the options to turn-detection in utterances |
SymSpell contains configuration for the SymSpell spelling correction.
Field | Type | Label | Description |
dict_path | string | Path to the dictionary used by symspell |
|
max_dictionary_edit_distance | int64 | The maximal edit-distance to consider for spelling correction (affects performance - bigger number takes longer to process!) |
|
prefix_length | int64 | The length (number of characters) of the prefix to consider for filtering |
TrainUserLanguageModelRequest is used to request the training of a user-specific language model.
Field | Type | Label | Description |
language_model_name | string | Name of the language model to train. Example: "user_lm_1" |
|
order | int64 | Order n of the ngram. Example: 3 (for trigram model) |
A request to transcribe an audio file
Field | Type | Label | Description |
audio_file | bytes | wav file to transcribe |
|
config | TranscribeRequestConfig | The configuration to override the default configuration |
The response message for a transcribe file request
Field | Type | Label | Description |
transcriptions | Transcription | repeated | List of transcriptions with confidence level |
time | float | The time the transcription took |
|
audio_uuid | string | id of the transcribed audio file |
Configuration for a request to transcribe audio
Field | Type | Label | Description |
s2t_pipeline_id | string | Required. id of the pipeline (model setup) that will generate audio |
|
decoding | Decoding | Optional. decoding type |
|
language_model_name | string | Name of the language model |
|
post_processing | PostProcessingOptions | The postprocessing options |
|
utterance_detection | UtteranceDetectionOptions | The utterance detection options |
|
pyannote | Pyannote | Voice activity detection with pyannote |
|
return_options | TranscriptionReturnOptions | The transcribe return options |
|
language | string | optional | Optional. Specify language of transcription to return |
task | string | optional | Optional. Specify task of s2t model, e.g. 'transcribe' and 'translate' |
s2t_service_config | google.protobuf.Struct | optional | Optional. s2t_service_config provides the configuration of the service such as API key, bearer tokens, JWT, and other header information as key value pairs, e.g.,
A. For Amazon S2T service, the following arguments should be passed in form of a dict:
A.1 aws_access_key_id (required) Access key id to access Amazon WEb Service.
A.2 aws_secret_access_key (required) Secret access key to access Amazon WEb Service.
A.3 region (required) Region name of Amazon Server.
Example:
s2t_config_service={'aws_access_key_id': 'YOUR_AWS_ACCESS_KEY_ID', 'aws_secret_access_key':
'YOUR_AWS_SECRET_ACCESS_KEY', 'region': 'YOUR_AMAZON_SERVER_REGION_NAME'}
B. For Deepgram S2T service, the following argument should be passed in form of a dict:
B.1 api_key (required) API key of Deepgram account to access Deepgram S2T service.
Example:
s2t_config_service={'api_key': 'YOUR_DEEPGRAM_API_KEY'}
C. For Google cloud S2T service, the following arguments should be passed in form of a dict:
C.1- api_key (required) API key of Google cloud to access its S2T service.
C.2- api_endpoint (optional) Regional API endpoint of Google cloud S2T service. (Defaults to
'eu-speech.googleapis.com')
Example:
s2t_config_service={'api_key': 'YOUR_GOOGLE_CLOUD_API_KEY', 'api_endpoint': 'YOUR_GOOGLE_CLOUD_API_ENDPOINT'}
D. For Microsoft Azure S2T service, the following arguments should be passed in form of a dict:
D.1 subscription_key (required) Subscription key to access Microsoft Azure Service.
D.2 region (required) Region name of Microsoft Azure Server.
Example:
s2t_config_service={'subscription_key': 'YOUR_MICROSOFT_AZURE_SUBSCRIPTION_KEY', 'region':
'YOUR_MICROSOFT_AZURE_SERVER_REGION_NAME'}
Note: ondewo-s2t will raise an error if you don't pass any of the required arguments above. |
s2t_cloud_provider_config | S2tCloudProviderConfig | optional | Optional. Defines the cloud provider's specific configuration for using speech to text cloud services The default value is None. |
Request to transcribe an audio stream
Field | Type | Label | Description |
audio_chunk | bytes | wav file to transcribe |
|
end_of_stream | bool | if it's the final chunk of the stream |
|
config | TranscribeRequestConfig | The configuration to override the default configuration |
|
mute_audio | bool | Whether or not to mute the audio signal. Defaults to false. |
The response message of a stream transcription
Field | Type | Label | Description |
transcriptions | Transcription | repeated | List of transcriptions with confidence level |
time | float | The time the transcription took |
|
final | bool | Whether or not this transcription is final (transcribed texts might change if transcription is started before the end of an utterance). |
|
return_audio | bool | is audio bytes of the utterance in response |
|
audio | bytes | audio bytes of the transcribed utterance |
|
utterance_start | bool | is it a start of the utterance |
|
audio_uuid | string | id of the transcribed audio file |
|
config | TranscribeRequestConfig | The configuration for the transcription |
The transcription message
Field | Type | Label | Description |
transcription | string | The transcribed text |
|
confidence_score | float | The corresponding confidence score. The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. |
|
words | WordDetail | repeated | List of the words of transcription with their confidence scores and probable alternatives |
alternatives | TranscriptionAlternative | repeated | List of alternative transcriptions, confidence scores, words timings and alternative words |
Field | Type | Label | Description |
transcript | string | The alternative transcribed text |
|
confidence | float | The corresponding confidence score to the alternative transcript. |
|
words | WordDetail | repeated | A list of word-specific information for each recognized word, including word timings, confidence score of the word and alternative words. |
Configuration of the return values of a transcribe request
Field | Type | Label | Description |
return_start_of_speech | bool | should server make response indicating that the beginning of the speech was detected |
|
return_audio | bool | should s2t server return audio bytes of transcribed utterance |
|
return_confidence_score | bool | Whether or not to return confidence scores |
|
return_alternative_transcriptions | bool | Whether or not to return alternative results from beam-search |
|
return_alternative_transcriptions_nr | int32 | Optional. Number of alternative transcriptions results from beam-search or greedy-search |
|
return_alternative_words | bool | Whether or not to return alternative results from beam-search |
|
return_alternative_words_nr | int32 | Optional. Number of alternative words to results |
|
return_word_timing | bool | Optional. Whether or not to return timestamps of start and end of the words. Only used in TranscribeFile. |
Configuration of the options to turn-detection in utterances
Field | Type | Label | Description |
active | bool | optional | Optional. Indicates if the turn-detection feature is active. |
full_utterance_deployment | bool | optional | Optional. Whether to transcribe the whole utterance when turn moment is detected. It is helpful to increase accuracy of transcriptions in cost of drop in speed. If deactivated, it just transcribe from last short silence period and concatenates the transcriptions of small audio chunks between tiny silences. |
llm_grpc_host | string | optional | Optional. Host name or IP address of the server that serves the LLM for turn-detection purpose. |
llm_grpc_port | int32 | optional | Optional. Port number of the server that serves the LLM for turn-detection purpose. |
llm_request_timeout | float | optional | Optional. Duration of request timeout in seconds to get result of request to LLM for turn-detection purpose. If the timeout occurs, result of turn-detection considered as False. |
Configuration of the options to detect utterances
Field | Type | Label | Description |
transcribe_not_final | bool | Return also immediate transcription results |
|
next_chunk_timeout | float | if time between audio chunks exceeds next_chunk_timeout, stream will be stopped |
|
turn_detection | TurnDetectionOptions | The turn detection options |
VoiceActivityDetection contains information about voice activity detection settings.
Field | Type | Label | Description |
active | string | Indicates if voice activity detection is active. |
|
sampling_rate | int64 | Sampling rate for voice activity detection. |
|
pyannote | Pyannote | Configuration for the Pyannote model. |
Wav2Vec contains information about the Wav2Vec model.
Field | Type | Label | Description |
model_path | string | Path to the model. |
|
use_gpu | bool | Indicates if GPU is used. |
Wav2VecTriton contains information about the Wav2Vec model using Triton.
Field | Type | Label | Description |
processor_path | string | Path to the processor. |
|
triton_model_name | string | Name of the Triton model. |
|
triton_model_version | string | Version of the Triton model. |
|
check_status_timeout | int64 | Timeout for checking model status. |
|
triton_server_host | string | Host name of triton inference server that serves the Wav2VecTriton model |
|
triton_server_port | int64 | Port number of triton inference server that serves the Wav2VecTriton model |
Whisper contains information about the Whisper model.
Field | Type | Label | Description |
model_path | string | Path to the model. |
|
use_gpu | bool | Indicates if GPU is used. |
|
language | string | Default language of the model. |
|
task | string | Default task of the model. |
WhisperTriton contains information about the Whisper model using Triton.
Field | Type | Label | Description |
processor_path | string | Path to the processor. |
|
triton_model_name | string | Name of the Triton model. |
|
triton_model_version | string | Version of the Triton model. |
|
check_status_timeout | int64 | Timeout for checking model status. |
|
language | string | Default language of the model. |
|
task | string | Default task of the model. E.g., transcribe, translate, etc. |
|
triton_server_host | string | Host name of triton inference server that serves the WhisperTriton model |
|
triton_server_port | int64 | Port number of triton inference server that serves the WhisperTriton model |
Field | Type | Label | Description |
word | string | The recognized word corresponding to this set of information. |
|
confidence | float | The corresponding confidence score to the alternative word. |
WordDetail provides word-specific information for recognized words.
Field | Type | Label | Description |
start_time | float | The start time of the spoken word relative to the beginning of the audio. The accuracy of the time offset can vary, and this is an experimental feature. |
|
end_time | float | The end time of the spoken word relative to the beginning of the audio. The accuracy of the time offset can vary, and this is an experimental feature. |
|
word | string | The recognized word corresponding to this set of information. |
|
confidence | float | The corresponding confidence score to the word. |
|
word_alternatives | WordAlternative | repeated | List of alternative words and confidence scores of each. |
The decoding configuration
Name | Number | Description |
DEFAULT | 0 | decoding will be defined by the pipeline config |
GREEDY | 1 | greedy decoding will be used independently on pipeline config |
BEAM_SEARCH_WITH_LM | 2 | beam search will be used independently on pipeline config |
BEAM_SEARCH | 3 | beam search without LM head, to configure decoding mode for seq2seq models. |
Name | Number | Description |
INFERENCE_BACKEND_UNKNOWN | 0 | Not set |
INFERENCE_BACKEND_PYTORCH | 1 | Run pytorch model |
INFERENCE_BACKEND_FLAX | 2 | Run flax model |
INFERENCE_BACKEND_CLOUD_SERVICE_AMAZON | 3 | Run Amazon S2T cloud service |
INFERENCE_BACKEND_CLOUD_SERVICE_DEEPGRAM | 4 | Run Deepgram S2T cloud service |
INFERENCE_BACKEND_CLOUD_SERVICE_GOOGLE | 5 | Run Google S2T cloud service |
INFERENCE_BACKEND_CLOUD_SERVICE_MICROSOFT | 6 | Run Microsoft Azure S2T cloud service |
.proto Type | Notes | C++ | Java | Python | Go | C# | PHP | Ruby |
double | double | double | float | float64 | double | float | Float | |
float | float | float | float | float32 | float | float | Float | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | long | int/long | int64 | long | integer/string | Bignum |
uint32 | Uses variable-length encoding. | uint32 | int | int/long | uint32 | uint | integer | Bignum or Fixnum (as required) |
uint64 | Uses variable-length encoding. | uint64 | long | int/long | uint64 | ulong | integer/string | Bignum or Fixnum (as required) |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | long | int/long | int64 | long | integer/string | Bignum |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 | uint | integer | Bignum or Fixnum (as required) |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | long | int/long | uint64 | ulong | integer/string | Bignum |
sfixed32 | Always four bytes. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
sfixed64 | Always eight bytes. | int64 | long | int/long | int64 | long | integer/string | Bignum |
bool | bool | boolean | boolean | bool | bool | boolean | TrueClass/FalseClass | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | String | str/unicode | string | string | string | String (UTF-8) |
bytes | May contain any arbitrary sequence of bytes. | string | ByteString | str | []byte | ByteString | string | String (ASCII-8BIT) |