Meta's Ambitious AI Pursuit Encounters Legal and Ethical Challenges
A burgeoning copyright lawsuit against Meta has brought to light a collection of internal communications outlining the company's strategy to develop open-source AI models, known as Llama. These communications reveal Meta's concerns about "media coverage suggesting we have used a dataset we know to be pirated," indicating the company's awareness of potential legal ramifications.
The discussions, which were part of exhibits unveiled by a California court, suggest that Meta utilized copyrighted data in training its AI systems while aiming to keep this under wraps. As Meta accelerated efforts to outpace competitors such as OpenAI and Mistral, some of these internal messages, made public last week, emphasize Meta's intent to stay ahead in the AI landscape.
In an October 2023 email, Ahmad Al-Dahle, Meta's vice president of generative AI, emphasized to Meta AI researcher Hugo Touvron the importance of matching OpenAI's GPT-4 standards. Al-Dahle highlighted the necessity for Meta to "learn how to build frontier and win this race," a message which implied the potential use of controversial data sources like Library Genesis (LibGen) to train its AI.
In an email from Meta's director of product Sony Theakanath to VP of AI research Joelle Pineau, the use of LibGen was debated. Theakanath noted that "GenAI has been approved to use LibGen for Llama3... with agreed-upon mitigations," adding that the piracy site was crucial for achieving state-of-the-art performance. It has been rumored that competitors such as OpenAI and Mistral similarly tap into LibGen, although neither has confirmed these claims.
"Libgen is essential to meet SOTA [state-of-the-art] numbers," Theakanath stated, "it is known that OpenAI and Mistral are using the library for their models."
The lawsuit stems from accusations by author Richard Kadrey and comedian Sarah Silverman, among others, who claim Meta contributed to copyright violations by using illegally sourced content to train its models. Meta maintains the position that employing copyrighted material in AI training should fall under fair use laws.
The company reportedly applied several "mitigations" to their use of LibGen, including removing obviously pirated data and avoiding public acknowledgment of the dataset's usage. Theakanath also cited the necessity for "red team" checks to guard against the deployment of models for hazardous purposes such as bioweapons.
Additional challenges outlined involved "policy risks," particularly concerning regulatory reactions to potential adverse media statements about Meta's reliance on pirated sources. Internal dialogue revealed concerns over these risks undermining Meta's standing with regulators.
The extensive use of digital data sources has become increasingly contentious, as highlighted by a New York Times report exposing Meta's desperation for data following the debut of ChatGPT. The report noted deliberations on acquiring a major publishing company and hiring contractors for unauthorized content summarization.
This urgency underscores a wider issue faced by AI developers, who appear to be hitting a "data wall" due to the scarcity of novel data sources, a point acknowledged by OpenAI's Ilya Sutskever, who spoke on the challenges of peak data at a significant AI conference. This scarcity has driven companies to explore unconventional ways of acquiring data, like paying digital creators for unused video content.
Given Meta's and OpenAI's determination to evolve their AI systems rapidly, the path forward remains turbulent. Although a judge dismissed parts of the preceding lawsuit by Kadrey and Silverman, the revelations continue to bolster parts of the ongoing legal case.