Will the New York Times Take Down Large Language Models?

That whistling sound you hear may not be an old-school newspaper walking past a graveyard—it may well be an AI industry-killing asteroid. On December 27, 2023, the New York Times filed a groundbreaking suit against OpenAI and Microsoft. The Times alleged copyright infringement, vicarious copyright infringement, contributory copyright infringement, violations of the Digital Millennium Copyright Act’s prohibition on removing copyright management, unfair competition, and trademark dilution. The 69-page, 204-paragraph complaint, filed in the Southern District of New York, alleges, among many other things, that:

ChatGPT-4 (used by Microsoft Bing and Copilot among other products) was trained by misappropriating massive amounts of copyrighted material outside the “fair use” exception to copyright;
It reproduces—frequently verbatim—passages from the Times;
It undercuts referral revenue by stripping referral links from Times content;
It sometimes creates inaccurate content and attributes it to the Times;
It removes copyright notices appearing in Times materials; and
It allows users to circumvent its paywalls to access substantive content.

The complaint is systematic and detailed and goes right for the Achilles’ heel of AI: its alleged misuse of intellectual property. Large-language AI base models (LLMs) like ChatGPT-4 train on data scraped from the Internet. Base models—some claiming to be trained on the “whole Internet”—require tremendous computing time and energy to train, and revision is essentially an expensive do-over. An AI developer cannot simply reach into a deep-learning model and pluck out offending training. If the Times obtains a permanent injunction (in addition to the undoubtedly considerable damages it is seeking), OpenAI and Microsoft could not use Times material in the future and would have to destroy all models trained on Times content. This could halt product development.

One could also expect suits from other major sources used to train ChatGPT-4. The Times did not have access to that list when it filed suit, but it likely will in discovery. According to the Times complaint, the predominant sources of ChatGPT-3 training material are newspapers and media outlets, Wikipedia, and paywalled academic journals. If OpenAI’s use of Times materials falls outside “fair use” or otherwise results in an adverse ruling, other media may follow (and file) suit. Wikipedia and PLOS ONE—two heavily used sources in ChatGPT-3—may bring suits based on use restrictions that they impose on their copyrighted content, such as source attribution. Multiple parallel suits could significantly constrain the training base for LLMs too old and/or limit true public domain material, which could lead to class-action litigation involving individual copyright holders.

The Times suit is not the first claiming infringement in training and output, but it is so far the best-researched and best-funded. Copyright owners, particularly in the visual arts, have attacked the legitimacy of generative AI training sets. In January 2023, in Andersen v. Stability AI Ltd, No. 3:23-cv-00201, visual artists brought suit in the Northern District of California against Stability AI, Midjourney, and DeviantArt, alleging that their work was appropriated outside “fair use” exceptions to copyright laws. Hot on their heels, Getty Images—in a bid to protect the millions of photos it licenses—sued Stability AI in the District of Delaware in February (No. 1:23-cv-0135), alleging that the defendant copied and processed millions of images without permission. It may not have been possible for individual copyright owners to chase down single infringing uses—but the emergence of AI tools (and manufacturers) helps identify potential targets.

What could these cases mean for enterprises using generative AI? Although it is still early in both the “text” and “picture cases,” a ruling that AI training is outside “fair use” could open LLM developers to catastrophic levels of copyright exposure, both monetarily (due to statutory damages) and operationally (because every new successful claim could force retraining). This could impact end-users of AI in at least four ways:

A copyright owner might pursue a known end-user of AI due to outputs;
An end-user might lose access to particular AI-enabled products and services;
AI-enabled product costs might rise precipitously; and
AI-enabled models may have smaller and less reliable data sets.

Takeaways

The New York Times has brought an aggressive suit against the leading maker of AI (OpenAI) and its leading customer/value-added reseller (Microsoft).
An outcome of this litigation could create significant obstacles (or add significant extra cost) to the training and use of LLMs.
The litigation might also interrupt or eliminate user access to Microsoft AI and other ChatGPT-based tools.

Source link