The rapid growth of AI technology is reshaping our digital landscape in unprecedented ways.
A new estimate indicates that AI could deplete all internet text data within a few years, potentially leading to the use of private information, according to a recent study. By 2026.
AI (Artificial Intelligence) systems like GPT-4 and Claude 3 Opus may exhaust publicly available data, which they rely on to improve. This could force tech companies to seek alternative data sources, such as synthetic data, lower-quality content, or private data from servers storing messages and emails. The study was published on June 4 on the preprint server arXiv.
“If chatbots consume all available data without advances in data efficiency, AI progress may stagnate,” said Pablo Villalobos, a researcher at Epoch AI and the study’s lead author, in an interview with Live Science. Without sufficient training data, AI models could produce lower-quality outputs, as seen with Google’s Gemini AI, which provided unreliable suggestions due to poor data sources.
Researchers estimated the amount of online text using Google’s web index, determining that around 250 billion web pages contain approximately 7,000 bytes of text each. Projections based on internet traffic and user activity suggest that high-quality data will be depleted by 2032, with low-quality data lasting until 2050. Image data could be exhausted between 2030 and 2060.
AI systems improve predictably with larger datasets, a phenomenon known as the neural scaling law. Whether companies can enhance model efficiency to compensate for the lack of new data remains uncertain. Villalobos noted that data scarcity might not severely hinder AI growth, as companies could increasingly use private data, similar to Meta’s new policy to train its AI using interactions on its platforms. Other challenges, such as power consumption and hardware costs, may become more significant.
Using synthetic data is another potential solution, though it has been primarily successful in specific areas like gaming, coding, and math. Harvesting private or intellectual property without permission could lead to legal challenges, as highlighted by Rita Matulionyte, a technology and intellectual property law expert.
Data scarcity is not the only hurdle for AI advancement. AI-powered searches consume significantly more electricity than traditional searches, leading tech leaders to explore nuclear fusion for data center power, although this technology is still in its infancy.