Unveiling the hidden price of GPT-4o.

June 14, 2024

With the introduction of GPT-4o, OpenAI has reinforced its position as the leading company in artificial intelligence. This cutting-edge multimodal AI tool integrates text, voice, and visual capabilities, offering significantly faster performance than its predecessors, which greatly improves the user experience. However, the most appealing aspect of GPT-4o is that it appears to be free—although this is somewhat misleading.

There is no subscription fee for GPT-4o, but users effectively pay with their data. GPT-4o collects all the information users input, whether it’s text, audio files, or images, much like a black hole absorbs everything nearby.

This AI not only gathers users’ data but also captures third-party information revealed during interactions. For instance, if you request a summary of a New York Times article by sharing a screenshot, GPT-4o processes it and provides the summary, but OpenAI retains the copyrighted material from the screenshot to train its model.

OpenAI isn’t alone in this practice. Over the past year, companies like Microsoft, Meta, Google, and X have updated their privacy policies to potentially allow the collection of user data for AI training. Despite facing multiple lawsuits in the U.S. for unauthorized use of copyrighted content, these companies continue to seek data to enhance their models.

High-quality training data is becoming scarce. In late 2021, OpenAI reportedly transcribed over a million hours of YouTube videos, violating the platform’s rules, although Google, YouTube’s parent company, has not taken legal action against OpenAI, possibly to avoid exposing its own data collection practices.

With GPT-4o, OpenAI is leveraging its growing user base, attracted by the free service, to gather extensive multimodal data. This approach is similar to the tech-platform business model that offers free services while profiting from data harvesting, a concept known as ‘surveillance capitalism’ coined by Harvard professor Shoshana Zuboff.

Users can opt out of having their ‘chats’ used for model training, but doing so via ChatGPT’s settings page disables chat history, causing users to lose access to past conversations. This design likely discourages users from opting out. To opt out without losing chat history, users must navigate a complex process through OpenAI’s privacy portal, which adds significant transaction costs and likely deters many users.

Even with user consent, copyright infringement issues remain, as users might provide data they don’t own. This creates ‘externalities,’ or spillover effects, on the original content creators. Thus, user consent is insufficient to prevent copyright violations.

Holding companies like OpenAI accountable for potential copyright violations is challenging because AI-generated content rarely resembles the original data, making it difficult for copyright holders to identify usage in model training. Companies can also claim ignorance, as they receive the content from users.

Some content creators and publishers have implemented measures to protect their material from AI data scraping, such as technological blocks and updated terms of service. Sony Music, for example, recently warned over 700 AI companies and streaming platforms against unauthorized use of its content.

Ultimately, the most effective solution to address these externalities is for regulators to restrict AI companies’ ability to collect and use data shared by users.