The Gist
- AI copyright infringement. A new battleground has emerged between content owners and the technology sector on the right to use their data to train AI models.
- AI model training copyright issues. On trial is whether the “fair use” clause of US copyright law justifies the use of copyrighted works to train AI models.
- Future of AI. The decision on this case will set a critical precedent for the future of how AI models are trained and more broadly the future of generative AI.
We knew it was inevitable. It’s no secret that the modern generation of Large Language Models (LLMs) have become incredibly knowledgeable and proficient by crawling the internet to ingest a large corpus of content. While we may have questioned the legality of incorporating content that wasn’t explicitly owned or licensed, we all turned a blind eye in the name of innovation and creativity. The power of the generative AI platforms was too mesmerizing to worry about the source and legality of the data.
That may all begin to change. Let’s take a look at AI copyright infringement and some other related issues.
Late last month, The New York Times filed a lawsuit against OpenAI and Microsoft claiming ChatGPT and Copilot were trained using the vast New York Times article archive without permission. The Times and others fear that these platforms will begin to compete with news agencies as a new source for users to get their news and information. To date, this is the most comprehensive, in-depth legal case in the generative AI space. (OpenAI responded to the lawsuit’s merits, and here’s OpenAI CEO Sam Altman’s latest comments on the matter).
Web Scraping and Crawling
Web scraping and crawling have been around for decades and are not generally illegal. However, the way the information is used can determine if there is any copyright infringement. Many websites have terms of service that prohibit web crawling and, if a web crawler ignores these terms, it could be considered a breach of contract (not a copyright infringement).
Generally, the use of information captured from web crawling falls under the “fair use” clause of US copyright law but is very subjective and nuanced.
While it’s unlikely this case will result in an injunction against OpenAI and Microsoft, the decision on this case will set a critical precedent for the future of how AI models are trained and more broadly the future of generative AI.
Let’s take a look at the history and how it could help determine the winners and losers in the generative AI market.
AI Model Training Copyright Issues: Good Information Is Hard to Find
We’ve all probably experienced a generative AI platform that responds with information from left field — commonly referred to as “hallucinations.” However, what’s harder to distinguish, and potentially more dangerous, is when these platforms produce wrong information, despite sounding confident. The root cause of these inaccuracies frequently stems from the source of the data — either from diverse data sources, improper training, biased data or improper encoding. The old saying garbage in, garbage out still holds true.
Therefore, one of the more difficult things for LLMs to identify is high-quality, accurate, unbiased data sources. The internet is an information highway littered with inaccuracies, the presence of biases, conflicting information and propaganda. Web crawling creates an abundance of information that requires a lot of post-process filtering to get high-quality sources.
This is why the New York Times content is so critical. It is world renowned for its journalistic integrity and may be one of the richest sources of information available on the internet. It started publication in 1851 and has amassed one of the largest online archives. There aren’t many active publications that can claim they did daily reporting on the American Civil War.
The New York Times is not alone as many other notable periodicals have existed for some time such as The Philadelphia Inquirer (1789), the Detroit Free Press (1837), Los Angeles Times (1881).
The quality of these sources is very high, cover a wide gamut of topics and provide a chronicle of the history of the world. These are all critical sources for LLMs and are now at risk of being omitted, which could create a ripple effect in terms of licensing of source data.
Related Article: Is Your AI-Generated Content Protected by US Copyright?
AI Copyright Infringement: Isn’t Crawling the Internet Allowed?
The Common Crawl has been a foundational resource for LLMs. It has amassed over 250 billion pages over the past 17 years and is adding 3-to-5 billion pages per month. Its mission is to enable researchers, entrepreneurs and developers to gain unrestricted access to a wealth of information, enabling them to create new applications and uses.
The corpus contains raw web page data, metadata extracts and text extracts. It focuses primarily on preserving HTML web pages and does not archive images, videos, JavaScript files, CSS stylesheets, etc. The goal is to provide a large-scale data-mining database and not to preserve the exact look and feel of a website. The Common Crawl contains a wealth of information such as C4, GitHub, Books, Wikipedia, StackExchange, etc. and generally needs to be filtered given its vast amount of content.
It is important to note that the Common Crawl has not vetted any of the content it has collected so there is no guarantee of the accuracy, biases or other anomalies that may exist in the content.
What’s interesting is that the Common Crawl dataset does include copyrighted material. So why isn’t it in violation of copyright law?
The nuance may be in the fact that the Common Crawl does not offer an easy way for users to view or consume the content. The data formats provided are in text, metadata and raw data and are meant more for applications and machines to use directly. It’s not easy to extract information from the archive. This results in the bulk of the users being in noncommercial, educational and research sectors.
The other consideration is that it is a nonprofit, open-source platform that provides a “representative sample” of the web and not the entire content of the web.
It remains to be seen how the Common Crawl will avoid the notorious complex interpretation of fair use in the future.
Related Article: Generative AI: Exploring Ethics, Copyright and Regulation
Copyright Protection Is Serious Business
It’s not widely known that Copyright Law is a foundational, federal law that is part of the United States Constitution. Specifically, Article I, Section 8, Clause 8, known as the Patent and Copyright Clause, states that “Congress can secure limited exclusive rights to authors and inventors to promote the progress of science and useful arts.”
The first implementation of copyright protection came with the Copyright Act of 1790 which granted American authors the right to print, reprint or publish their work for a period of 14 years and to renew for another 14. The law was meant to provide an incentive to authors, artists and scientists to create original works by providing creators with a monopoly to generate “science and useful arts.” The act underwent several revisions in 1831, 1870 and 1909 mainly to extend the duration and broaden the rights.
It wasn’t until the Copyright Act of 1976 that “Fair Use” came into play. Fair use, a legal principle, permits unlicensed citation or incorporation of copyrighted material in another creator’s work. Now central to modern copyright law, it is the cornerstone allowing academics, journalists, filmmakers and writers to leverage these works without infringement in certain circumstances.
The test that determines whether fair use applies is the purpose of use, nature of the copyrighted work, the amount used in relation to the entire work and the impact on the potential market. Fair use recognizes that most works are created by borrowing from some antecedent works and aims not to stifle future creativity and innovation.
Origin of ‘Fair Use’
However, the origin of fair use stems back to a 19th-century copyright case called Folsom v. Marsh in 1841. The author in question was the Rev. Charles W. Upham who wrote a two-volume, 856-page book on the life of George Washington using several letters already published. Within the work, 353 pages had been previously published in a 12-volume work titled “Writings of George Washington.” compiled and edited by Jared Sparks. The letters were specifically given to Washington’s nephew Justice Bushrod Washington, and Supreme Court Justice Joseph Story ruled that due to the volume of pages used, the book violated copyright law.
Fair use has been critically important in protecting intellectual property, but at the same time has been designed to be particularly murky in its definition lending to case-by-case decisions.
The other significant copyright legislation is the Digital Millennium Copyright Act (DMCA), which was launched in 1998 and criminalizes the production and dissemination of technology, devices or services intended to circumvent measures that control access to copyrighted works. The DMCA covers music, movies, text and other copyrighted digital assets making it illegal to download and/or use protected digital content without paying for them. This became commonly known as digital rights management or DRM.
In addition to fair use, this law could become relevant if, in the process of training LLMs, any specific copyright management information was removed.
Related Article: The Legal Implications of Generative AI
What Is Fair to Use?
A new battleground has emerged between content owners and the technology sector where content creators argue that tech firms must get permission before using their content to train generative AI models. Tech firms argue that they are adequately covered under fair use.
Copyright protection is focused on the progress of arts and sciences and therefore protects human creativity. It is not focused on the protection of readily available facts and data, which are actually not copyrightable.
Fair use is at the heart of The New York Times’ case regarding AI copyright infringement issues. It’s not questioning whether The New York Times’ content is copyright protected — it is. The question at hand is does AI model training constitute fair use? The answer may depend on how the information used in training is presented back to the user. If the information is used as a basis for learning but not necessarily used to reproduce large portions of content verbatim it should fall within fair use. Or if the content is used as the basis for generating new, unique content it could also fall under fair use. However, if the models are reciting large portions of articles that would be a clear violation of fair use.
The New York Times stipulates that its articles and news reporting are more than presenting the key facts and contain a high degree of journalistic creativity, which should be covered under copyright protection. The Times argues that a large portion of its content is being used verbatim and is not transformed, thereby impacting its future business model. It is also concerned about reputational damage if the content is used inappropriately.
‘Transformative’ Claims
Microsoft and OpenAI are claiming their use of copyrighted material is transformative in that the output created from the platforms is not the original content but original net-new content. The challenge with this position is that today the platforms exhibit responses that recite large portions of New York Times’ content verbatim.
Back in August 2023, the US Copyright Office affirmed its position that AI-generated work is not eligible for copyright protection as it does not meet the human authorship test. Work that contains AI-generated content but has sufficient human authorship could be covered under copyright protection provided the human-authored portions are material.
Other Precedents May Influence Decision-Making
There are several other cases that may influence the decision in the New York Times case.
In 2003, Google launched its Google Book Search project, which was an attempt to digitize books through scanning and computer-aided recognition for searching online which was widely seen as a transformative step for libraries. However, many authors and publishers raised concerns that Google didn’t get permission and have the proper licenses and filed a lawsuit against Google (Google vs. Authors Guild). Initially, a class action settlement was reached requiring Google to pay $125 million to rights-holders. However, after several appeals, which are still active today, it was ultimately decided in Google’s favor as it was determined it met all the criteria of fair use. This is a great example of how complicated and subjective fair use cases can become.
In another significant 2010 case, Google vs. Oracle, Oracle, which owns a copyright in Java SE, filed suit against Google for copyright infringement after Google copied roughly 11,500 lines of code from the Java SE program to enable an Application Programming Interface (API) to allow programmers to create new applications on their acquired Android platform. The court ruled that it was covered under fair use given it was a subset of code and was being used to promote and create new applications.
Image Scraping
An even more closely related case was filed in early 2023 between Getty Images vs. Stability AI (the creators of AI image generator Stable Diffusion). Getty Images claims Stability AI unlawfully scraped millions of images from its site. In December 2023, a UK court ruled that the case has merit and can move to trial. The outcome of this case will be an important one for artists, photographers and others who have copyrighted images.
Additionally, numerous independent artists, such as Kelly McKernan, have filed copyright lawsuits against AI image generating platforms like Midjourney and Stable Diffusion. Variations of her popular acrylic and watercolor paintings started appearing online without any attribution to her original works. The suit alleges that the AI image-generators violate the rights of millions of artists by ingesting large numbers of digital images and then producing derivative works that compete against the originals. Unfortunately, the case was dismissed since proper copyrights were not obtained on the original works, but an amended suit is likely to follow.
And finally, on Aug. 30, 2023, the US Copyright Office issued a notice of inquiry (NOI) on “the use of copyrighted works to train AI models, the appropriate levels of transparency and disclosure with respect to the use of copyrighted works, and the legal status of AI-generated outputs.” This NOI is an important step for the office’s AI initiative that was launched in early 2023.
The interpretation of fair use as it pertains to content creation and distribution rights is one of the most critical decisions for the future of digital technology.
Is Providing Attribution That Hard?
One solution could be for the LLMs to create citations and linkage to the original content they are ingesting and provide this on request in the responses. A user could enter a prompt that says, “show me the sources” and the LLM can provide a list of sources used to formulate the response with proper attribution. Search engines attribute, linking back to source information. Today, LLMs are a black box with no visibility to what sources were used to formulate a response which creates concerns about both accuracy and legality.
The other solution would be to enter into licensing or royalty agreements with content creators such as The New York Times to allow the use of its data to train the models. Platforms like Reddit have already proclaimed “the end of the free data era” and have started to charge for API access. The irony here is that many LLMs have already developed licensing models for the use of their platform but haven’t considered the licensing requirements for the training data sources.
If all else fails, content creators can always opt-out using the classic “robots.txt” file that alerts crawlers not to crawl the site. However, this is not as easy as it seems for content such as an artist’s picture that might appear on many sites that they don’t control.
There is another growing trend among companies to expressly prohibit the use of their data for training of AI models through the terms and conditions posted on their website. Violation of this would lead to a breach of contract.
Actions that limit access to training data, while perfectly legitimate for copyright protection, are not necessarily in the best interest of the evolution of the LLM as a knowledge platform.
A Wide Range of Outcomes Are Possible
There are a variety of outcomes that might come from this ruling between the New York Times and OpenAI.
The courts could decide that training LLM is not AI copyright infringement provided that there are substantial protection and controls in place to protect original works.
The settlement could also mandate the creation of a “Copyright Committee” that has representatives from various content creation markets that oversee and govern the usage of copyrighted material. There could also be a positive marketplace solution where content providers create license agreements for LLM creators to leverage their content.
In a more extreme course of action, we could see congressional action step in and amend the Copyright Act of 1976 or DMCA to specifically cover AI and LLMs. The AI copyright infringement issues just might be large enough for Congress to intervene.
It’s quite possible that this case and associated appeals could linger in the courts for some time, but there is no question that the final outcome will set a critical precedence for the future of digital technologies.
Learn how you can join our contributor community.