Artificial Intelligence (AI) systems could consume all of the internet’s free knowledge as soon as 2026, warns a new study. AI models like GPT-4, which powers ChatGPT, or Claude 3 Opus, rely on the many trillions of words shared online to improve their capabilities. However, projections suggest they will exhaust the supply of publicly-available data between 2026 and 2032.
To build better models, tech companies will need to look elsewhere for data. This could include producing synthetic data, turning to lower-quality sources, or tapping into private data in servers that store messages and emails. The researchers published their findings on June 4 on the preprint server arXiv.
Pablo Villalobos, a researcher at the research institute Epoch AI, suggests that if chatbots consume all of the available data and there are no further advances in data efficiency, the field might see relative stagnation. Models would only improve slowly over time as new algorithmic insights are discovered and new data is naturally produced.
Training data fuels AI systems’ growth, enabling them to fish out ever-more complex patterns to root inside their neural networks. For example, ChatGPT was trained on roughly 570 GB of text data, amounting to roughly 300 billion words, taken from books, online articles, Wikipedia, and other online sources. Algorithms trained on insufficient or low-quality data produce sketchy outputs.
To estimate how much text is available online, the researchers used Google’s web index, calculating that there were currently about 250 billion web pages containing 7,000 bytes of text per page. They used follow-up analyses of internet protocol (IP) traffic and the activity of users online to project the growth of this available data stock.
Read more: www.livescience.com