Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification. Code repo | Model card | Paper
The code and model weights are open-source under the MIT license, so it’s free to use for private and commercial applications.
It can be used as a standalone cli or as a python library.
There’s a combination of 9 models available:
|
|
Specs
Size | Parameters | English-only model | Multilingual model | Model file size | VRAM required |
---|---|---|---|---|---|
tiny | 39 M | ✓ | ✓ | ~ 76 MB | ~ 1 GB |
base | 74 M | ✓ | ✓ | ~ 145 MB | ~ 1 GB |
small | 244 M | ✓ | ✓ | ~ 484 MB | ~ 2 GB |
medium | 769 M | ✓ | ✓ | ~ 1.5 GB | ~ 5 GB |
large | 1550 M | ˣ | ✓ | ~ 3.1 GB | ~ 10 GB |