
Technologist Brian Roemmele cautioned in 2015 that low-quality internet data would lead to biased training of artificial intelligence, resulting in inaccurate outputs. His worry has taken center stage in AI debates a decade later. By 2025, large language models will rely heavily on platforms like Reddit, which alone accounts for 40.1% of training citations. While these sources provide volume, they often lack accuracy and editorial oversight. Roemmele proposed a fix, “high protein” data from 1870-1970, focuses on verified historical archives that predate social media noise. Recent studies indicate that this approach could significantly enhance AI model performance, making data quality the next key battleground in AI development.
Heavy Reliance on Low-Quality User-Generated Content Risks AI Reliability
Statista’s 2025 infographic on AI data sources reveals Reddit dominates with 40.1% of citations, followed by Wikipedia, YouTube, and Google. This reliance on user-generated content aligns with Roemmele, warning about “Internet Sewage.” Such sources have diverse backgrounds, but they are likely to include conspiracy theories, unproven statements, and cultural bias. According to a study published in the Journal of Artificial Intelligence Research in 2023, models that were trained on non-curated sources, such as Reddit, exhibited far higher rates of misinformation than models trained using curated data.
Beyond accuracy concerns, privacy issues are emerging. A 2025 MIT Technology Review investigation found personal data, such as passports and credit cards, embedded in major training datasets, raising ethical and legal risks. These problems are compounded by the sheer scale of low-quality input, which increases the difficulty of filtering harmful material. Critics argue that the AI industry’s obsession with scale has overshadowed the importance of trustworthiness. Roemmele, stance is that refining smaller, more reliable datasets, especially historical ones, could outperform massive but noisy internet archives. Research from Nature Machine Intelligence supports this, showing that improving just 10% of training data quality can yield a 20% performance boost. For AI’s future, the challenge is not just quantity but the integrity of the information models consumed.
“High Protein” Data Could Recalibrate AI’s Knowledge Base
The proposal made by Roemmele, the so-called high-protein information, shifts towards the application of the knowledge collected between 1870 and 1970, which period is connected to the abundance of verifiable, practical information. This comprises books, scholarly journals, and newspapers that have been through editorial processes, which lessens the clutter encountered in current websites. Efforts such as the Library of Congress Chronicling America, Internet Archive, and Google Books are already off to a good start in indexing such sources, which is backed by OCR technology allowing searchable text.
The approach is supported by empirical evidence. A 2025 ScienceDirect study confirmed that accuracy, completeness, and consistency directly improve machine learning outcomes. High-quality historical datasets could provide AI with a stable foundation of facts, context, and cultural nuance. While synthetic data is gaining popularity, it often mirrors the flaws of its source material. Historical archives offer a fundamentally different path: information rooted in human experience before the distortion of modern digital platforms. Challenges remain, digitization is labor-intensive, and copyright concerns persist, but the long-term payoff could be a generation of AI models with reduced bias, fewer hallucinations, and greater reliability. Roemmele’s vision reframes the data race from “more” to “better,” suggesting that the future of AI may depend on rediscovering the past.
Quality Data as AI’s Next Competitive Frontier
Brian Roemmele, decade-old warning now resonates across AI research: models trained on vast, low-quality datasets risk becoming engines of misinformation. With Reddit and other user-generated platforms dominating citations, AI systems inherit biases and factual errors at scale. His alternative, curated historical “high protein” data, offers a route to more trustworthy, capable models. Supported by recent studies, this strategy shifts the focus from amassing content to refining it. Further work on digitization projects may be the tipping point. Both AI and quality can influence the advantage of any competitor in the race to produce the best AI. It is not the volume of data but the quality of data that may be the deciding factor.