Data is at the Core of Modern AI Systems
Data is fundamental to today’s advanced AI systems, but the cost of acquiring it is rising, putting it out of reach for all but the wealthiest tech companies.
Last year, James Betker, a researcher at OpenAI, wrote a blog post about generative AI models and their training datasets. Betker argued that training data — rather than a model’s design, architecture, or other features — is crucial for creating increasingly sophisticated and capable AI systems.
“Trained on the same dataset for long enough, pretty much every model converges to the same point,” Betker wrote.
Is Betker correct? Is training data the most significant factor in determining a model’s capabilities, whether it’s answering questions, drawing human hands, or generating realistic cityscapes?
It’s certainly a plausible idea.
Statistical Machines
Generative AI systems are essentially probabilistic models — vast collections of statistics. They make predictions based on numerous examples, determining which data fits best in a given context (e.g., placing the word “go” before “to the market” in the sentence “I go to the market”). It stands to reason that the more examples a model has, the better its performance.
“As long as you have a stable training setup, it does seem like the performance gains are coming from data,” Kyle Lo, a senior applied research scientist at the Allen Institute for AI (AI2), an AI research nonprofit, told TechCrunch.
Data Quality Over Quantity
Kyle Lo cited Meta’s Llama 3, a text-generating model released earlier this year, which outperforms AI2’s OLMo model despite their architectural similarities. Llama 3 was trained on significantly more data than OLMo, which Lo believes accounts for its superior performance on many popular AI benchmarks.
(It’s worth noting that current AI benchmarks might not be the best measure of a model’s performance, but outside of qualitative tests like our own, they’re among the few metrics available.)
However, training on exponentially larger datasets doesn’t necessarily lead to exponentially better models. Models follow the “garbage in, garbage out” principle, meaning data quality and curation are crucial, possibly even more so than sheer quantity.
“A small model with carefully curated data can outperform a large model,” Lo explained. For instance, Falcon 180B, a large model, ranks 63rd on the LMSYS benchmark, while the much smaller Llama 2 13B ranks 56th.
In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh highlighted that higher-quality annotations significantly improved the image quality in DALL-E 3, OpenAI’s text-to-image model, over its predecessor DALL-E 2. “The text annotations are a lot better than they were [with DALL-E 2] — it’s not even comparable,” he said.
Many AI models, including DALL-E 3 and DALL-E 2, are trained using data labeled by human annotators. This helps the model learn to associate labels with observed characteristics. For example, a model fed with many cat pictures annotated by breed will eventually learn to associate terms like bobtail and shorthair with their distinctive visual traits.
Concerns Over Data Accessibility in AI Development
Experts like Kyle Lo are concerned that the growing emphasis on large, high-quality training datasets will centralize AI development among the wealthiest tech companies with billion-dollar budgets. While innovations in synthetic data or fundamental architecture could disrupt this trend, they don’t seem imminent.
“Entities controlling valuable content for AI development are motivated to restrict access,” Lo said. “As access to data becomes limited, early movers on data acquisition are being favored, making it harder for others to catch up.”
Indeed, the race to acquire training data has often led to unethical (and sometimes illegal) behavior, such as aggregating copyrighted content without permission. This situation benefits tech giants with deep pockets for data licensing.
Generative AI models, like those from OpenAI, are trained on various data types — some copyrighted — sourced from public web pages, including problematic AI-generated content. Companies claim fair use protections, but many rights holders disagree, though they currently have limited recourse.
There are numerous examples of AI companies acquiring large datasets through questionable means. For instance, OpenAI reportedly transcribed over a million hours of YouTube videos without permission to train GPT-4. Google has updated its terms of service to use public Google Docs and other online content for its AI products, and Meta has considered using IP-protected content despite potential lawsuits.
Additionally, companies, including large startups like Scale AI, rely on low-paid workers in developing countries to annotate training sets, often exposing them to disturbing content without benefits or job security.
Even legitimate data deals aren’t fostering an equitable AI ecosystem. OpenAI has spent hundreds of millions licensing content from news publishers and stock media libraries, far beyond the budget of most academic research groups, nonprofits, and startups. Meta even considered buying the publisher Simon & Schuster for e-book rights, though the publisher was eventually sold to KKR for $1.62 billion in 2023.
With the AI training data market expected to grow from $2.5 billion to nearly $30 billion within a decade, data brokers and platforms are charging top dollar. Shutterstock has secured deals worth $25 million to $50 million, and Reddit claims to have made hundreds of millions from data licensing. Platforms with large datasets, like Photobucket, Tumblr, and Stack Overflow, have also signed agreements with AI developers. However, users often see none of these profits, affecting the broader AI research community.
“Smaller players can’t afford these data licenses and, therefore, can’t develop or study AI models,” Lo said. “This could lead to a lack of independent scrutiny of AI development practices.”
Independent Efforts
Despite these challenges, some independent and nonprofit efforts aim to create accessible datasets for generative AI models. EleutherAI, a grassroots nonprofit, is working with the University of Toronto, AI2, and independent researchers on The Pile v2, a dataset sourced primarily from the public domain.
In April, AI startup Hugging Face released FineWeb, a filtered version of Common Crawl, claiming it improves model performance on many benchmarks. Other initiatives, like LAION’s image sets, have faced legal and ethical challenges but are committed to better practices. For example, The Pile v2 removes problematic copyrighted material from its predecessor.
The question remains whether these open efforts can keep pace with Big Tech. As long as data collection and curation are resource-intensive, the answer is likely no — unless a significant research breakthrough levels the playing field.