- Artificial intelligence remains without data to train.
- With so much money at stake, they do not hesitate to use the law to obtain data that is not public.
Table of Contents
Introduction
Google accused OpenAI last week of using YouTube videos to train Sora. Now, an investigation by The New York Times assures that OpenAI also uses more than a million hours of YouTube videos to train Whisper, his AI that converts audio into text.
As was to be expected, Google has not sat well because OpenAI is not only its data but also its most direct rival in the field of artificial intelligence.
We’ll see if this case reaches the courts or if there’s an agreement between companies so that the two win.
OpenAI used YouTube videos to train their AI.
Artificial intelligence needs real-world data to improve. And the more perfect this AI is, the more data it needs.
According to The New York Times, via The Verge, major AI companies have already consumed all the public data available to train AI, as well as the private collections with which they have reached an agreement.
According to the research, OpenAI will remain without data in 2021. So its executives discussed the possibility of using YouTube videos, podcasts, and audio books, even knowing they were in a “grey zone” of the law.
Finally, they decided to use a million hours of YouTube videos to extract the audio and train Whisper, their voice-to-text AI. They would welcome the term “reasonable use” by employing only a fraction of the hundreds of billions of hours of video on YouTube.
YouTube as a Goldmine:
Supposedly, the president of OpenAI himself, Greg Brockman, was involved in obtaining those videos.
Google spokesman Matt Bryant confirmed to The Verge that the company had “seen unconfirmed reports” of OpenAI activity and assured that “both our robots.txt files and the terms of service prohibit the scraping or unauthorized download of YouTube content.”.
The New York Times investigation also assures that Meta was out of data a long time ago and barred the possibility of licensing books and even buying a large publishing house.
According to some experts, AI companies will need more data than they can generate by 2028.
The solution is to create synthetic data, that is, artificially designed to be used with AI, or to use other training models that do not require so much data. But so far, none of this has worked.
Conclusion
AI companies compete in a relentless race to dominate a market that will generate a lot of money, and they do not hesitate to skip copyright in order to train their AI faster than their rivals. A suicidal race that sows doubts about the supposed security of that AI, as long as it does not annihilate us or make us its slaves…