Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification. Code repo | Model card | Paper
The code and model weights are open-source under the MIT license, so it’s free to use for private and commercial applications.
It can be used as a standalone cli or as a python library.
There’s a combination of 9 models available:
|Size||Parameters||English-only model||Multilingual model||Model file size||VRAM required|
|tiny||39 M||✓||✓||~ 76 MB||~ 1 GB|
|base||74 M||✓||✓||~ 145 MB||~ 1 GB|
|small||244 M||✓||✓||~ 484 MB||~ 2 GB|
|medium||769 M||✓||✓||~ 1.5 GB||~ 5 GB|
|large||1550 M||ˣ||✓||~ 3.1 GB||~ 10 GB|