Open AI, the company behind GPT-4, DALL-E, and ChatGPT also has an AI model that can convert speech to text with astonishing accuracy. It’s named Whisper. Whisper can also translate speech from several languages into English. In the future, it’s expected to also translate speech into languages other than English. In this brief post, we’re going to cover how this product works, including from within the Open AI Playground and by programmatically calling the API. The full product description for Whisper can be found at https://openai.com/research/whisper.
If you’re unfamiliar with the playground, read our article title Learn The OpenAI Playground Step-By-Step written by wordbot’s cofounder Kevin Sims. In a nutshell, the playground allows you to play with Open AI’s AI models in a sandbox environment.
To demonstrate using Whisper in the playground, I’m going to record my voice on my mac and upload the audio file into the playground. I’m only do this so I can include the audio file in this post and include it in the API call example later. In reality, when playing with Whisper in the playground, I tend to simply use the microphone and record in real time to avoid having to upload an audio file. Below is a screenshot of how to use the microphone to capture your audio.
I created an audio file of me talking about our blog. It’s embedded below.
Next, I uploaded this audio file using the microphone and upload option in the playground.
Even my last name it correctly transcribed.
Now let’s see how to do the same thing, but by programmatically calling the API instead of using the Open AI playground / sandbox.
If you are creating your own website, plugin, or app and would like to include speech to text functionality via Whisper you can easily do so with the Transcriptions API. The API is very easy and straightforward to use. For the full API documentation, visit https://platform.openai.com/docs/guides/speech-to-text/quickstart.
I’m not actually going to write software here, but rather show you a simple screenshot from the docs at the link above. In our example, you would simply replace the audio file with mine and make the call and get the transcribed text in return. All languages such as Python, C#, PHP, JavaScript, and others support calling restful APIs, so you should have no trouble converting the below Python example to your language of choice.
The power of Whisper and the Transcriptions API is unbelievable. I hope this small blog post has given you a solid overview of what it can do and helped point you in the correct direction to learn more and experiment.