Audio & Speech Models

Run and fine-tune open-source audio models on Mixtrain — from industry-leading speech recognition to multilingual transcription.

Example Models

Model	Provider	Parameters	Description
Parakeet TDT v3	NVIDIA	600M	Multilingual ASR supporting 25 languages with automatic language detection
Parakeet TDT v2	NVIDIA	600M	English ASR with industry-leading 6.05% WER, 50x faster than alternatives
Canary Qwen	NVIDIA	2.5B	Top of the Hugging Face Open ASR Leaderboard (5.63% WER)
Whisper Large v3	OpenAI	1.5B	Multilingual speech recognition and translation

Quick Start

from mixtrain import Model

# Load a speech recognition model
model = Model("parakeet-tdt-v3")

# Transcribe audio
result = model.run({
    "audio": "my-workspace/recordings/meeting.wav"
})

NVIDIA Parakeet

Parakeet is NVIDIA's family of FastConformer-based ASR models, offering the best accuracy-to-speed ratio in open-source speech recognition.

Parakeet TDT v3 extends v2 with multilingual support for 25 European languages, automatic language detection, and transcription of audio up to 24 minutes in a single pass (or up to 3 hours with local attention).

Parakeet TDT v2 delivers a 6.05% word error rate on English with an RTFx of 3,380 — trained on NVIDIA's 120,000-hour Granary dataset.

from mixtrain import Workflow

# Fine-tune for domain-specific vocabulary with a training workflow
workflow = Workflow("parakeet-finetune")
result = workflow.run(
    dataset="my-workspace/medical-transcripts",
    base_model="parakeet-tdt-v3",
    steps=10000,
)

NVIDIA Canary Qwen

Canary Qwen 2.5B currently tops the Hugging Face Open ASR Leaderboard with 5.63% WER. It offers strong multilingual support and is well-suited for high-accuracy batch transcription workloads.

model = Model("canary-qwen-2.5b")

result = model.run({
    "audio": "my-workspace/recordings/interview.wav"
})

OpenAI Whisper

Whisper Large v3 remains a popular choice for multilingual transcription and translation, supporting 100+ languages with solid accuracy across diverse audio conditions.

model = Model("whisper-large-v3")

result = model.run({
    "audio": "my-workspace/recordings/podcast.wav",
    "language": "auto"
})