The Cost of AI Training Data Is A Luxury Only Big Tech Can Afford

The rapid advancement of artificial intelligence (AI) has been fueled by data. However, the cost of AI training data is skyrocketing, making it a luxury that only the wealthiest tech companies can afford. This trend is leading to an increasingly centralized AI development landscape, where small businesses and academic institutions struggle to compete.

James Betker, a researcher at OpenAI, argued in a blog post that training data, rather than a model’s design or architecture, is the key to increasingly sophisticated and capable AI systems. He suggested that given the same dataset, almost every model would converge to the same point if trained for long enough.

Generative AI systems are essentially probabilistic models that make educated guesses based on vast amounts of examples. It seems intuitive that the more examples a model has to go on, the better its performance. Kyle Lo, a senior applied research scientist at the Allen Institute for AI, agreed with this sentiment, stating that performance gains seem to be coming from data, at least once a stable training setup is established.

However, training on larger datasets isn’t a guaranteed path to better models. Data curation and quality matter a great deal, perhaps even more than sheer quantity. A small model with carefully designed data could outperform a large model.

The high cost of AI training data is creating a barrier to entry in the AI field. This could potentially stifle innovation and limit independent scrutiny and research of AI technology. Furthermore, large tech companies are consolidating their leadership in the AI field by acquiring copyrighted content or leveraging public data sources.

Read more: techcrunch.com