Artificial intelligence models have been capable of analyzing images and describing their content for years, but OpenAI’s Spring Update marked a significant advancement.
The introduction of GPT-4o in ChatGPT — even without voice and video features — showcased one of the most advanced AI vision models to date.
This model’s success is partly attributed to its native multimodal capabilities, allowing it to comprehensively understand and reason across images, video, sound, and text. Unlike other models that first convert everything to text, GPT-4o processes these different formats directly.Â
To evaluate its abilities, I provided a series of images and asked the model to describe what it saw. The accuracy of its descriptions indicates the quality of the model. AI vision models, including GPT-4, often miss one or two objects or make incorrect descriptions. For each test, I showed ChatGPT-4o an image and asked, “What is this?” without giving any additional context or information. This simulates how people are likely to use this feature in real-world scenarios, similar to how I used it at an event in Paris.
The goal was to assess how well it analyzed each picture. After each test, I inquired if it could determine whether the image was AI-generated. All the images were created using Ideogram based on descriptions generated by Claude 3, making them entirely AI-produced.
After examining the first two images, ChatGPT-4o began automatically indicating whether it believed the image was AI-generated, without me having to prompt it.