OpenAI Launches GPT-4o: Faster, Multimodal AI for Enhanced Voice and Visual Interaction

The world was expecting Google’s search rival, but it was a complete surprise from Sam Altman and Co. OpenAI never fails to surprise its audience; in their recent Spring update live event, they introduced the GPT-4o model for both the free and paid versions of ChatGPT.

This is the first time that we are really making a huge step forward when it comes to the ease of use
Mira Murati, Technology Chief Open AI

ChatGPT-4o (“o” for “omni”), OpenAI’s newest flagship AI model. The new model promises a GPT-4 level intelligence at a faster pace. According to the reports, the new model comes with an innovation of voice assistants with natural and emotional understanding and vision capabilities.

This will be a revolutionary step towards more natural-human interaction. If we talk about the capabilities, according to a blog post from OpenAI. The new model can receive any combination of text, audio, and image as input and can output any combination of text, audio, and image.

Read about: OpenAI’s New AI Search Tool Poses Threat to Google’s Dominance

Its response time to audio inputs is 232 milliseconds on average, with a maximum of 320 milliseconds; this is equivalent to an average human response time during a conversation.

Say hello to GPT-4o, our new flagship model which can reason across audio, vision, and text in real time: https://t.co/MYHZB79UqN

Text and image input rolling out today in API and ChatGPT with voice and video in the coming weeks. pic.twitter.com/uuthKZyzYx
— OpenAI (@OpenAI) May 13, 2024

In addition to being significantly faster and 50% less expensive in the API, it matches GPT-4 Turbo performance on text in English and code and significantly improves on text in non-English languages. In particular, GPT-4o outperforms current models in visual and audio understanding.

ChatGPT 3.5 Vs ChatGPT 4 Vs Chat GPT 4.o

You could speak with ChatGPT using Voice Mode with an average latency of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) prior to GPT-4o. Voice Mode performs this by utilizing a pipeline consisting of three distinct models: a basic model that converts audio to text, a GPT-3.5 or GPT-4 model that receives text input and outputs it, and a third basic model that transforms the text back into audio.

As a result of this procedure, the primary intelligence source, GPT-4, loses a great deal of information. It is unable to understand tone, numerous speakers, background noise, or emotions such as laughter or singing.

Worth a read: Airtel Partners with Google Cloud to Revolutionize AI and Cloud Solutions in India

OpenAI trained a single new model with GPT-4o end-to-end for text, vision, and audio, which means that the same neural network handles all inputs and outputs. Since GPT-4o is the company’s first model to include all of these modalities, it is said that they have only just begun to explore the capabilities and limitations of the model.

our new model: GPT-4o, is our best model ever. it is smart, it is fast,it is natively multimodal (!), and…
— Sam Altman (@sama) May 13, 2024

With the help of the new model, ChatGPT can now handle 50 languages with greater speed and quality. Additionally, developers can start creating apps using the new model right away because it will be made available via OpenAI’s API.

In addition to being accessible to ChatGPT free users (subject to usage caps), GPT-4o is also now available to ChatGPT Plus and Team users, and Enterprise users will have access to it shortly. The message limit for Plus members is up to five times larger than that of free users, and it is considerably higher for Team and Enterprise users.

Ritesh Jha
Reporter
- LinkedIn
Ritesh Jha is a journalist specializing in business and technology. He is particularly interested in recent and emerging technologies, as well as AI, startups and entrepreneurship.