OpenAI’s GPT-4 Trained on Over a Million Hours of YouTube Transcripts

OpenAI, in its quest for high-quality training data for its most advanced large language model, GPT-4, reportedly transcribed over a million hours of YouTube videos. According to The New York Times, OpenAI knew this was legally questionable but believed it to be fair use. OpenAI president Greg Brockman was personally involved in collecting videos that were used.

The company curates “unique” datasets for each of its models to “help their understanding of the world” and maintain its global research competitiveness. OpenAI uses “numerous sources including publicly available data and partnerships for non-public data,” and is looking into generating its own synthetic data.

The company exhausted supplies of useful data in 2021 and discussed transcribing YouTube videos, podcasts, and audiobooks after blowing through other resources. By then, it had trained its models on data that included computer code from Github, chess move databases, and schoolwork content from Quizlet.

Google spokesperson Matt Bryant stated that the company has “seen unconfirmed reports” of OpenAI’s activity, adding that “both our robots.txt files and Terms of Service prohibit unauthorized scraping or downloading of YouTube content,” echoing the company’s terms of use. Google takes “technical and legal measures” to prevent such unauthorized use “when we have a clear legal or technical basis to do so.”

Nimbus27

OpenAI’s GPT-4 Trained on Over a Million Hours of YouTube Transcripts

Related articles: