Home BUSINESS Meta, Google, OpenAI used protected data to train LLMs, report

Meta, Google, OpenAI used protected data to train LLMs, report

by Ohio Digital News

Gary Marcus is a leading AI researcher who’s increasingly appalled at what he’s seeing. He founded at least two AI startups, one of which sold to Uber, and has been researching the subject for over two decades. Just last weekend, the Financial Times called him “perhaps the noisiest AI questioner” and reported that Marcus assumed he was targeted by a critical Sam Altman post on X: “Give me the confidence of a mediocre deep-learning skeptic.”

Marcus doubled down on his critique the very next day after he appeared in the FT, writing on his Substack about “generative AI as Shakespearean tragedy.” The subject was a bombshell report from the New York Times that OpenAI violated YouTube’s terms of service by scraping over a million hours of user-generated content. What’s worse, Google’s need for data to train its own AI model was so insatiable that it did the same thing, potentially violating the copyrights of the content creators whose videos it used without their consent.

As far back as 2018, Marcus noted, he has expressed doubts about the “data-guzzling” approach to training that sought to feed AI models with as much content as possible. In fact, he listed eight of his warnings, dating all the way back to his diagnosis of hallucinations in 2001, all coming true like a curse on Macbeth or Hamlet manifesting in the fifth act. “What makes all this tragic is that many of us have tried so hard to warn the field that we would wind up here,” Marcus wrote.

While Marcus declined to comment to Fortune, the tragedy goes well beyond the fact that nobody listened to critics like him and Ed Zitron, another prominent skeptic cited by the FT. According to the Times, which cites numerous background sources, both Google and OpenAI knew what they were doing was legally dubious—banking on the fact that copyright in the age of AI had yet to be litigated—but felt they had no choice but to keep pumping data into their large language models to stay ahead of their competition. And in Google’s case, it potentially suffered harm as a result of OpenAI’s massive scraping efforts, but its own bending of the rules to scrape the very same data left it with a proverbial arm tied behind its back.

Did OpenAI use YouTube videos?

Google employees became aware OpenAI was taking YouTube content to train its models, which would infringe both its own terms of service and possibly the copyright protections of the creators to whom the videos belong. Caught in this bind Google decided not to denounce OpenAI publicly, because it was afraid of drawing attention to its own use of YouTube videos to train AI models, the Times reported. 

A Google spokesperson told Fortune the company had “seen unconfirmed reports” that OpenAI had used YouTube videos. They added that YouTube’s terms of service “prohibit unauthorized scraping or downloading” of videos, which the company has a “long history of employing technical and legal measures to prevent.” 

Marcus says the behavior of these big tech firms was predictable because data was the key ingredient needed to build the AI tools these companies were in an arms race to develop. Without quality data, like well-written novels, podcasts by knowledgeable hosts, or expertly produced movies, the chatbots and image generators risk spitting out mediocre content. That idea can be summed up in the data science adage “crap in, crap out.” In an op-ed for Fortune Jim Stratton, the chief technology officer of HR software company Workday, said “data is the lifeblood of AI,” making the “need for quality, timely data more important than ever.”

Around 2021, OpenAI ran into a shortage of data. Desperately needing more instances of human speech to continue improving its ChatGPT tool, which was still about a year away from being released, OpenAI decided to get it from YouTube. Employees discussed the fact that cribbing YouTube videos might not be allowed. Eventually a group, including OpenAI president Greg Brockman, went ahead with the plan.  

That a senior figure like Brockman was involved in the scheme was evidence of how crucial such data-gathering methods were to developing AI, according to Marcus. Brockman did so “very likely knowing that he was entering a legal gray area—yet desperate to feed the beast,” Marcus wrote. “If it all falls apart, either for legal reasons or technical reasons, that image may linger.”

When reached for comment, a spokesperson for OpenAI did not answer specific questions about its use of YouTube videos to train its models. “Each of our models has a unique dataset that we curate to help their understanding of the world and remain globally competitive in research,” they wrote in an email. “We use numerous sources including publicly available data and partnerships for nonpublic data, and are exploring synthetic data generation,” they said, referring to the practice of using AI-generated content to train AI models. 

OpenAI chief technology officer Mira Murati was asked in a Wall Street Journal interview whether the company’s new Sora video generator had been trained using YouTube videos; she answered, “I’m actually not sure about that.” Last week YouTube CEO Neal Mohan responded by saying that while he didn’t know if OpenAI had actually used YouTube data to train Sora or any other tool, if it had that would violate the platform’s rules. Mohan did mention that Google uses some YouTube content to train its AI tools based on a few contracts it has with individual creators—a statement a Google spokesperson reiterated to Fortune in an email. 

Meta decides licensing deal would take too long

OpenAI wasn’t alone in facing a lack of adequate data. Meta was also grappling with the issue. When Meta realized its AI products weren’t as advanced as OpenAI’s, it held numerous meetings with top executives to figure out ways to secure more data to train its systems. Executives considered options like paying a licensing fee of $10 per book for new releases and outright buying the publisher Simon & Schuster. During these meetings executives acknowledged they had already used copyrighted material without the permission of its authors. Ultimately, they decided to press on even if it meant possible lawsuits in the future, according to the New York Times.   

Meta did not respond to a request for comment.

Meta’s lawyers believed if things did end up in litigation they would be covered by a 2015 case Google won against a consortium of authors. At the time a judge ruled that Google was permitted to use the authors’ books without having to pay a licensing fee because it was using their work to build a search engine, which was sufficiently transformative to be considered fair use. 

OpenAI is arguing something similar in a case brought against it by the New York Times in December. The Times alleges that OpenAI used its copyrighted material without compensating it for doing so, while OpenAI contends its use of the materials is covered by fair use because they were gathered to train a large language model rather than because it’s a competing news organization. 

For Marcus the hunger for more data was evidence that the whole proposition of AI was built on shaky ground. In order for AI to live up to the hype with which it’s been billed it simply needs more data than is available. “All this happened upon the realization that their systems simply cannot succeed without even more data than the internet-scale data they have already been trained on,” Marcus wrote on Substack. 

OpenAI seemed to concede that was the case in written testimony with the U.K.’s House of Lords in December. “It would be impossible to train today’s leading AI models without using copyrighted materials,” the company wrote. 

Subscribe to the Eye on AI newsletter to stay abreast of how AI is shaping the future of business. Sign up for free.

Source link

related posts