Uncategorized

Fair Use or Foul Play? The AI Copyright Quandary


The Gist

  • AI copyright infringement. A new battleground has emerged between content owners and the technology sector on the right to use their data to train AI models.
  • AI model training copyright issues. On trial is whether the “fair use” clause of US copyright law justifies the use of copyrighted works to train AI models.
  • Future of AI. The decision on this case will set a critical precedent for the future of how AI models are trained and more broadly the future of generative AI.

We knew it was inevitable. It’s no secret that the modern generation of Large Language Models (LLMs) have become incredibly knowledgeable and proficient by crawling the internet to ingest a large corpus of content. While we may have questioned the legality of incorporating content that wasn’t explicitly owned or licensed, we all turned a blind eye in the name of innovation and creativity. The power of the generative AI platforms was too mesmerizing to worry about the source and legality of the data.

That may all begin to change. Let’s take a look at AI copyright infringement and some other related issues. 

Late last month, The New York Times filed a lawsuit against OpenAI and Microsoft claiming ChatGPT and Copilot were trained using the vast New York Times article archive without permission. The Times and others fear that these platforms will begin to compete with news agencies as a new source for users to get their news and information. To date, this is the most comprehensive, in-depth legal case in the generative AI space. (OpenAI responded to the lawsuit’s merits, and here’s OpenAI CEO Sam Altman’s latest comments on the matter).

A golden copyright symbol appears on the left against a black background in piece about AI copyright infringement.
Late last month, The New York Times filed a lawsuit against OpenAI and Microsoft claiming ChatGPT and Copilot were trained using the vast New York Times article archive without permission.MR on Adobe Stock Photos

Web Scraping and Crawling

Web scraping and crawling have been around for decades and are not generally illegal. However, the way the information is used can determine if there is any copyright infringement. Many websites have terms of service that prohibit web crawling and, if a web crawler ignores these terms, it could be considered a breach of contract (not a copyright infringement).

Generally, the use of information captured from web crawling falls under the “fair use” clause of US copyright law but is very subjective and nuanced.

While it’s unlikely this case will result in an injunction against OpenAI and Microsoft, the decision on this case will set a critical precedent for the future of how AI models are trained and more broadly the future of generative AI.

Let’s take a look at the history and how it could help determine the winners and losers in the generative AI market.

AI Model Training Copyright Issues: Good Information Is Hard to Find

We’ve all probably experienced a generative AI platform that responds with information from left field — commonly referred to as “hallucinations.” However, what’s harder to distinguish, and potentially more dangerous, is when these platforms produce wrong information, despite sounding confident. The root cause of these inaccuracies frequently stems from the source of the data — either from diverse data sources, improper training, biased data or improper encoding. The old saying garbage in, garbage out still holds true.

Therefore, one of the more difficult things for LLMs to identify is high-quality, accurate, unbiased data sources. The internet is an information highway littered with inaccuracies, the presence of biases, conflicting information and propaganda. Web crawling creates an abundance of information that requires a lot of post-process filtering to get high-quality sources.

This is why the New York Times content is so critical. It is world renowned for its journalistic integrity and may be one of the richest sources of information available on the internet. It started publication in 1851 and has amassed one of the largest online archives. There aren’t many active publications that can claim they did daily reporting on the American Civil War.

The New York Times is not alone as many other notable periodicals have existed for some time such as The Philadelphia Inquirer (1789), the Detroit Free Press (1837), Los Angeles Times (1881).

The quality of these sources is very high, cover a wide gamut of topics and provide a chronicle of the history of the world. These are all critical sources for LLMs and are now at risk of being omitted, which could create a ripple effect in terms of licensing of source data.

Related Article: Is Your AI-Generated Content Protected by US Copyright?

AI Copyright Infringement: Isn’t Crawling the Internet Allowed?

The Common Crawl has been a foundational resource for LLMs. It has amassed over 250 billion pages over the past 17 years and is adding 3-to-5 billion pages per month. Its mission is to enable researchers, entrepreneurs and developers to gain unrestricted access to a wealth of information, enabling them to create new applications and uses.

The corpus contains raw web page data, metadata extracts and text extracts. It focuses primarily on preserving HTML web pages and does not archive images, videos, JavaScript files, CSS stylesheets, etc. The goal is to provide a large-scale data-mining database and not to preserve the exact look and feel of a website. The Common Crawl contains a wealth of information such as C4, GitHub, Books, Wikipedia, StackExchange, etc. and generally needs to be filtered given its vast amount of content.

It is important to note that the Common Crawl has not vetted any of the content it has collected so there is no guarantee of the accuracy, biases or other anomalies that may exist in the content.

What’s interesting is that the Common Crawl dataset does include copyrighted material. So why isn’t it in violation of copyright law?

The nuance may be in the fact that the Common Crawl does not offer an easy way for users to view or consume the content. The data formats provided are in text, metadata and raw data and are meant more for applications and machines to use directly. It’s not easy to extract information from the archive. This results in the bulk of the users being in noncommercial, educational and research sectors. 

The other consideration is that it is a nonprofit, open-source platform that provides a “representative sample” of the web and not the entire content of the web.

It remains to be seen how the Common Crawl will avoid the notorious complex interpretation of fair use in the future.

Related Article: Generative AI: Exploring Ethics, Copyright and Regulation

Copyright Protection Is Serious Business

It’s not widely known that Copyright Law is a foundational, federal law that is part of the United States Constitution. Specifically, Article I, Section 8, Clause 8, known as the Patent and Copyright Clause, states that “Congress can secure limited exclusive rights to authors and inventors to promote the progress of science and useful arts.”

The first implementation of copyright protection came with the Copyright Act of 1790 which granted American authors the right to print, reprint or publish their work for a period of 14 years and to renew for another 14. The law was meant to provide an incentive to authors, artists and scientists to create original works by providing creators with a monopoly to generate “science and useful arts.” The act underwent several revisions in 1831, 1870 and 1909 mainly to extend the duration and broaden the rights.

It wasn’t until the Copyright Act of 1976 that “Fair Use” came into play. Fair use, a legal principle, permits unlicensed citation or incorporation of copyrighted material in another creator’s work. Now central to modern copyright law, it is the cornerstone allowing academics, journalists, filmmakers and writers to leverage these works without infringement in certain circumstances.

The test that determines whether fair use applies is the purpose of use, nature of the copyrighted work, the amount used in relation to the entire work and the impact on the potential market. Fair use recognizes that most works are created by borrowing from some antecedent works and aims not to stifle future creativity and innovation.

Origin of ‘Fair Use’

However, the origin of fair use stems back to a 19th-century copyright case called Folsom v. Marsh in 1841. The author in question was the Rev. Charles W. Upham who wrote a two-volume, 856-page book on the life of George Washington using several letters already published. Within the work, 353 pages had been previously published in a 12-volume work titled “Writings of George Washington.” compiled and edited by Jared Sparks. The letters were specifically given to Washington’s nephew Justice Bushrod Washington, and Supreme Court Justice Joseph Story ruled that due to the volume of pages used, the book violated copyright law.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *