Riva Multilingual Offline ASR with Whisper and Canary for Offline ASR
Riva Multilingual Offline ASR with Whisper and Canary for Offline ASR
NVIDIA has consistently developed automatic speech recognition (ASR) models that set the benchmark in the industry. Earlier versions of NVIDIA Riva, a collection of GPU-accelerated speech and translation AI microservices for ASR, TTS, and NMT, support English-Spanish and English-Japanese code-switching ASR models based on the Conformer architecture, along with a model supporting multiple common languages in the EMEA region (namely British English, European Spanish, French, Italian, Standard German, and Armenian) based on the Parakeet architecture.
New Developments in Riva 2.18.0
Recently, NVIDIA released the Riva 2.18.0 container and SDK to keep evolving its speech AI models. With this new release, we now offer the following:
- Support for Parakeet, the streaming multilingual ASR
- Support for OpenAI’s Whisper-Large and HuggingFace’s Distil-Whisper-Large models for offline ASR and Any-to-English AST
- The NVIDIA Canary models for offline ASR, Any-to-English, English-to-Any, and Any-to-Any AST
- A new SSML tag that tells a Megatron NMT model not to translate the enclosed text
- A new DNT dictionary that tells a Megatron NMT model how to translate specified words or phrases
Automatic Speech Translation (AST)
Automatic speech translation (AST) is the translation of speech in one language to text in another language without intermediate transcription in the first language.
Riva Multilingual Offline ASR with Whisper and Canary for Offline ASR
Riva’s new support of Whisper for offline multilingual ASR enables you to transcribe audio recordings in dozens of languages. Whisper can also translate audio from any of the supported languages into English automatically, instead of transcribing the audio in the source language and subsequently translating the transcription to English.
Launching a Riva Server with Whisper Capabilities
To launch a Riva server with Whisper capabilities, ensure that the following variables are set as indicated:
service_enabled_asr=true
asr_acoustic_model=("whisper") # or "distil_whisper" for lower memory requirements
asr_acoustic_model_variant=("large") # the default "" will probably also work
riva_model_loc=""
Run the riva_init.sh script provided in the same directory to download the models in RMIR form and deploy versions of those models optimized for your particular GPU architecture. Then, run the riva_start.sh script to launch the Riva server.
NIM Microservice Implementations
NIM microservice versions of Whisper and Canary (both 1B and 0.6B-Turbo) are also available. To launch either the Whisper or Canary NIM microservice on your own system, choose the Docker tab of the model’s landing page and follow the instructions. In either case, you must generate an NGC API key and export it as an environmental variable, NGC_API_KEY.
Running NIM Microservice with Docker
Here’s the Docker run command for the Whisper NIM microservice:
docker run -it --rm --name=riva-asr \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_TAGS_SELECTOR=name=whisper-large-v3 \
nvcr.io/nim/nvidia/riva-asr:1.3.0
To run the Canary NIM microservice instead, replace whisper-large-v3 with canary-1b or canary-0-6b-turbo in the docker run command. Irrespective of the ASR or AST model used, running a NIM microservice on your own system in this manner leaves the terminal hanging. You must use a different terminal or a different interface entirely to run inference with the Whisper or Canary NIM microservice.
Inference with Riva Server
When the Riva server is launched, you can submit inference calls to it with C++ or Python APIs. We use Python examples for the rest of this post.
Conclusion
Riva’s new support of Whisper for offline multilingual ASR enables you to transcribe audio recordings in dozens of languages. Additionally, Riva’s support of Canary for offline ASR and AST provides more options for language translation. With the new Riva 2.18.0 release, we continue to advance the field of speech AI, offering more models and features to support a wide range of use cases.
Frequently Asked Questions (FAQs)
Q: What is Riva?
A: Riva is a collection of GPU-accelerated speech and translation AI microservices for ASR, TTS, and NMT.
Q: What is ASR?
A: ASR (Automatic Speech Recognition) is the recognition of spoken language into text.
Q: What is AST?
A: AST (Automatic Speech Translation) is the translation of speech in one language to text in another language without intermediate transcription in the first language.
Q: What are the new features in Riva 2.18.0?
A: The new features in Riva 2.18.0 include support for Parakeet, Whisper-Large, and Distil-Whisper-Large models for offline ASR and Any-to-English AST, as well as the NVIDIA Canary models for offline ASR, Any-to-English, English-to-Any, and Any-to-Any AST.
Q: How do I launch a Riva server with Whisper capabilities?
A: To launch a Riva server with Whisper capabilities, ensure that the following variables are set as indicated: service_enabled_asr=true, asr_acoustic_model=("whisper"), and asr_acoustic_model_variant=("large"). Then, run the riva_init.sh and riva_start.sh scripts to download and launch the Riva server.
Q: How do I run a NIM microservice with Docker?
A: To run a NIM microservice with Docker, choose the Docker tab of the model’s landing page and follow the instructions. In either case, you must generate an NGC API key and export it as an environmental variable, NGC_API_KEY.

