Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification. Code repo | Model card | Paper

  • The code and model weights are open-source under the MIT license, so it’s free to use for private and commercial applications.

  • It can be used as a standalone cli or as a python library.

  • There’s a combination of 9 models available:

1
usage: whisper [-h] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large}]

Specs

SizeParametersEnglish-only modelMultilingual modelModel file sizeVRAM required
tiny39 M~ 76 MB~ 1 GB
base74 M~ 145 MB~ 1 GB
small244 M~ 484 MB~ 2 GB
medium769 M~ 1.5 GB~ 5 GB
large1550 Mˣ~ 3.1 GB~ 10 GB