Trelis Research

Speech-to-Text Model Comparison

Open weights vs proprietary, WER benchmarks, pricing, and fine-tuning support for production STT models.

Open Weights Weights available for download
Proprietary API-only access
API-Optimized Optimized variant, API-only
Yes Supported
Pending Coming soon
No Not supported

Model Comparison

Model Provider Access License Params FLEURS WER API Price/min Key Features
Voxtral Family (Mistral AI)
Voxtral Mini Transcribe V2 Mistral AI API-Optimized Proprietary ~3B ~4.0% $0.003 Diarization, word timestamps, context biasing, 13 languages, 3hr audio
Voxtral Realtime Mistral AI Open Weights Apache 2.0 4B ~4.0% $0.006 Streaming, sub-200ms latency, 13 languages, edge-deployable
Voxtral Small (24B) Mistral AI Open Weights Apache 2.0 24B ~4.9% $0.003 Audio understanding, Q&A, summarization, function calling, 32k context
Voxtral Mini Transcribe Mistral AI API-Optimized Proprietary ~3B ~5.3% $0.001 Cheapest option, transcription-optimized
Voxtral Mini (3B) Mistral AI Open Weights Apache 2.0 3B ~6.9% $0.001 Audio understanding, Q&A, summarization, edge-friendly, 32k context
Whisper Family (OpenAI)
Whisper large-v3 OpenAI Open Weights MIT 1.5B ~8.3% Self-hosted Word timestamps, 99 languages, mature ecosystem, whisper.cpp, faster-whisper
Whisper large-v3-turbo OpenAI Open Weights MIT 809M ~8.5% Self-hosted 2x faster than v3, word timestamps, 99 languages, great for fine-tuning
Proprietary APIs
GPT-4o mini Transcribe OpenAI Proprietary Proprietary N/A ~5.7% $0.003 OpenAI API, easy integration
Gemini 2.5 Flash Google Proprietary Proprietary N/A ~7.0% ~$0.003 Multimodal, long context, audio understanding
ElevenLabs Scribe v2 ElevenLabs Proprietary Proprietary N/A ~4.9% $0.010 Diarization, word timestamps, 99 languages
Deepgram Nova Deepgram Proprietary Proprietary N/A N/A ~$0.008 Diarization, streaming, custom vocabulary
AssemblyAI Universal AssemblyAI Proprietary Proprietary N/A N/A ~$0.002 Diarization, sentiment, topic detection
Other Open Models
Kyutai STT (1B / 2.6B) Kyutai Open Weights CC-BY 4.0 1B / 2.6B N/A Self-hosted Streaming, word timestamps, voice prompting, Rust server

Fine-Tuning Support

Model Transformers (Inference) Transformers (Fine-tuning) Unsloth PEFT / LoRA Trainer Type Trelis Studio ADVANCED-audio Notes
Whisper large-v3 / Turbo Yes Yes Yes Yes Seq2SeqTrainer Yes Yes Most mature fine-tuning ecosystem. Unsloth gives ~30% VRAM savings. Target modules: q_proj, v_proj, k_proj, out_proj, fc1, fc2
Voxtral Mini (3B) Yes Yes Pending Yes Trainer (causal LM) Yes Yes Requires custom data collator with apply_transcription_request. Target modules: q/k/v/o_proj + gate/up/down_proj. Audio tower frozen.
Voxtral Small (24B) Yes Yes Pending Yes Trainer (causal LM) No No Same approach as Mini but requires more VRAM. Multi-GPU recommended.
Voxtral Realtime (4B) Yes No No No N/A No No Transformers supports inference only, not fine-tuning yet. Streaming architecture.
Kyutai STT Yes No No No Custom (Trelis) No Yes Custom fine-tuning script by Trelis (not standard transformers/Unsloth). Candle conversion needed for Rust server deployment.

Notes

Trelis Fine-Tuning Results (Trelis/llm-lingo, 6 validation samples)