OpenAI May Have Scraped Millions of Hours of YouTube Videos to Train GPT-4

April 16, 2024
OpenAI May Have Scraped Millions of Hours of YouTube Videos to Train GPT-4
139
Views
openai youtube transcripts gpt 4 training

A recent report from The New York Times has exposed OpenAI for scraping over a million hours of YouTube videos to train its most advanced large language model to date, GPT-4. 

According to the report, the AI company had exhausted all resources of reputable data back in 2021. Desperate to get its hands on more high-quality data, OpenAI developed its Whisper audio transcription model. It is a speech recognition tool which was then used to transcript the YouTube videos.

The Times reports that OpenAI was aware of the legal ambiguity surrounding its actions but  “believed it to be fair use”. Three people with information about the conversations said that some OpenAI employees talked about how the move might violate YouTube’s policies. The use of YouTube videos for applications that are “independent” of the video platform is prohibited. The report goes on to state that Greg Brockman, president of OpenAI, was also involved in gathering the videos that were used.

Lindsay Held, an OpenAI representative, stated to The Verge in an email that the company selects “unique” datasets for every model to “help their understanding of the world” and preserve its competitiveness in global research. Held continued by saying that the business looks into creating its own synthetic data and uses “numerous sources including publicly available data and partnerships for non-public data.”

YouTube’s parent company, Google, claimed to have “seen unconfirmed reports” of OpenAI’s activities. “Both our robots.txt files and Terms of Service prohibit unauthorized scraping or downloading of YouTube content,” the company states.

Article Tags:
· · · ·
Article Categories:
Tech News

Leave a Reply

Your email address will not be published. Required fields are marked *

The maximum upload file size: 2 MB. You can upload: image, audio, video, document, spreadsheet, interactive, text, archive, code, other. Links to YouTube, Facebook, Twitter and other services inserted in the comment text will be automatically embedded. Drop file here