As machine learning (“ML”) and artificial intelligence (“AI”) platforms begin to reconfigure the technology landscape, many are concerned that the best methods for training AI platforms will come under attack as instances of copyright infringement. The worry, roughly, is that ingesting a large amount of data without the prior authorization of the data’s owners constitutes a form of highly automated copyright infringement. Accordingly, AI platforms do not have the legal right to copy, analyze, and then produce new material based on the vast amounts of copyright-protected data it ingested on an automated basis. And when these platforms produce new content, that new content will, in some cases, infringe the copyrights of the large data sets used to train the AI model. After all, the output of these platforms could not have been produced without accessing, copying, and using tremendously large data sets.
What does copyright law have to say about protecting creativity in a world that will soon depend upon AI in a wide range of commercial industries? Are the existing guardrails of copyright robust enough to support large-scale innovation while protecting the interests of the individual entrepreneur? These are the questions we will try to answer together.
A Quick Primer on Copyright Law
Let’s start with the nuts and bolts. US copyright law is a federal regime that allows an entity or an individual to protect tangible, original works - including everything from music to film, from painting to sculpture, from photographs to works of literature, from software code to websites. Copyright does not grant protection for ideas or laws of nature. Instead, copyright only provides protection for the specific way in which ideas are expressed. Here’s a simple way of digesting this core axiom of copyright law: You cannot copyright the abstract idea or custom of saying “I love you,” but you can copyright a poem whose general message for the reader or audience is “I love you.” Expression alone is protected by copyright. By contrast, anyone is free to use, remix, manipulate, and repurpose the abstract ideas underneath any concrete expression.
Copyright holders control the reproduction, distribution, performance and display of their protected works. These are the “exclusive rights” granted to copyright holders by the U.S. Copyright Act. In general, third parties cannot reproduce, distribute, perform or display copyrighted works without the prior authorization of the copyright holder. There are notable exceptions, such as when transformative copies need to be made in order for search engines to make copyrighted material more readily available on the internet. But the general rule is that the aforementioned exclusive rights are available to all copyright holders. And this is what makes copyright so economically valuable: Copyright holders get to decide how much to charge third parties who want to reproduce, distribute, perform or display their protected work.
The Limits of Copyright
This leads us to a critical limitation of copyright law in the United States. Under the Copyright Act, copyright holders do not have the right to prevent others from observing or perceiving a copyrighted work. For example, if someone copied a series of copyrighted photographs and presented the copies to a captive audience, the copyright owner would not have a copyright infringement action against the audience members. The copyright holder would, however, have a compelling case against the third party that presented the original photographs to the audience. There is a reason for this distinction. Under the Copyright Act, there is no exclusive right of observation, perception or learning. The absence of such an exclusive right is by design; after all, if there was an exclusive right of observation, perception of learning, then copyright holders could sue academics and entrepreneurs every time a new idea was born from serious and sustained engagement with an existing body of copyrighted knowledge.
Permissible AI Training Activities
Before going any further, it is important to summarize what we have already learned about copyright law. Copyright is a bundle of exclusive rights that belong to the copyright owner. This bundle includes the rights of reproduction, display, performance and distribution, and this assortment of rights is generally thought to provide sufficient incentives for copyright holders to produce, market and sell works of authorship. Our copyright regime is a set of economic incentives, not just a hodge podge of random rules.
The economic incentives embedded in the Copyright Act are designed to increase the production of creative works. This is precisely why AI training models pose a difficult challenge for copyright: We know that copyright owners cannot sue people for merely consuming or learning from their work, but copyright gives creators powerful tools to prevent third parties from reproducing, displaying, performing or distributing their works without prior authorization. If AI platforms end-run this authorization requirement, does that mean copyright is being infringed on a massive scale? Or, to frame the question in a way that highlights how AI model training looks from the perspective of AI companies: If copyright owners cannot control how many people read or view their works, then doesn’t it follow that a super smart computer (i.e., an AI model) may lawfully read and learn from a limitless amount of valuable copyrighted material without triggering a copyright lawsuit?
Whenever an AI platform is ingesting a large data set for training purposes, the key question is this: If the data being ingested by the platform is protected by copyright, does that mean that copyright law has been violated? To answer this question, let’s consider some hypothetical cases. (Keep in mind that here we are only talking about the AI model training process. We are not addressing the more difficult question of whether copyright is infringed when AI platforms are used to generate new content. We will get to that question later.)
Example 1: Imagine that a data set has been compiled by Company X. While the data inside the data set is not itself subject to copyright (e.g., a data set that consists of mere facts, such as the physical addresses of all Americans), the unique arrangement of that data set is eligible for copyright protection. If an AI platform uses that data set without the authorization of Company X, there would be a legitimate claim of copyright infringement. Under existing US law, the creative arrangement of the data set’s items is subject to copyright protection. As a result, directly accessing such a database could constitute copyright infringement even if the purpose of using the data set was only to train an AI/ML platform.
Example 2: Imagine that the developer of an AI platform wrote a piece of software to collect data items available on the internet that are subject to copyright protection. These data items could be files, images, music clips, video segments, or any combination of the above. Assume further that the AI platform needs to ingest these data items in order to create new and valuable output in the future. For example, imagine that the platform needed to review all papers dealing with applications of quantum mechanics (“QM”) to computing in order to generate novel answers to user queries about QM in the future. In such a case, data ingestion would not be carried out for the purpose of reproducing, distributing, performing or displaying specific data items within the set. Quite the contrary, the purpose of data ingestion here is to train the AI so it can deliver new output in the future accurately and efficiently.
Does this pro-social purpose determine whether the operator of the AI platform is liable for copyright infringement? No. By itself, the purpose of data ingestion does not determine whether there is liability for an act of infringement. Where there is a technical infringement of one of the exclusive rights, courts consult the doctrine of fair use - a policy that is designed to allow certain uses of copyright-protected work to encourage the freedom of expression and the production of new creative work.
The fair use test has four different factors, but we will focus on the two most important ones here: (1) the purpose or nature of the use, and (2) the impact of the use on the market. When courts have to evaluate a fair use defense, they rely heavily on these two factors to determine whether it is wise, from a policy perspective, to allow unauthorized and uncompensated uses of copyright-protected work. The first factor, the purpose of nature of the use, might support allowing an AI platform to ingest large amounts of data, even where many of those data elements were copyright-protected. The operator of the AI platform would be able to argue that large scale data ingestion is necessary to produce useful new knowledge, and on that basis data ingestion constitutes a transformative use of the copyright-protected material processed by the AI platform. A court might find this argument compelling. After all, without the activity of such an AI platform, data would not self-organize to produce credible answers to complex queries. So, the argument goes, if we highly value the production and distribution of new knowledge, then the specific steps required by AI data ingestion should, in many cases, be viewed favorably under the first factor, the purpose or nature of the use.
The second factor, the impact of the use of copyright-protected material on the market, is critical. The analysis of this factor in cases like the hypothetical set forth above would be highly fact-specific. Let’s take two further modifications to our hypothetical to show how tricky this fact-based analysis would be. In the first case, imagine that an AI platform was designed to ingest only the questions and answers of a specific online product, for example the question and answer service Quora. If the AI platform ingested these questions and answers and then prepared new answers designed to appeal directly to Quora’s audience, then this service would on balance be viewed as having a competitive impact on an existing copyright holder. That impact would weigh against finding that the use of the original copyright-protected material was fair. Again, we cannot predict what a court would say in this specific case. But one thing is clear: In cases where data ingestion protocols are intentionally designed to produce output that is directly competitive with the businesses of existing copyright holders, that will generally weigh against a finding of fair use.
In the second case, imagine that an AI platform was designed to ingest all information it could find on the internet embodied in natural language. Unlike the first case, the AI model does not ingest only the data of one existing information aggregator (such as Quora); instead, the AI model ingests data from every source it can access. Because the field of sources is not curated, the AI model is less vulnerable to a claim that it was designed to compete with an existing market actor. As a consequence, establishing a negative impact on a specific copyright owner’s business interests becomes more difficult, and that weighs in favor of a finding of fair use. Finally, the potential creation of new information products through the training of the AI model actually makes the training phase itself more likely to qualify for fair use. Where diverse sets of training data are analyzed to make connections between data items that lead to the production of new knowledge, the AI platform’s output is more likely to qualify as a transformative use of the original copyright-protected material that the AI model was trained on in the first place.
Copyright law is robust and flexible. The fair use doctrine in particular has been called upon to manage the economic impact of complex innovations on copyright holders. For example, the application of fair use to Google’s organization of the world’s information on the internet is instructive: copying intended to make it easier to find and access original works of authorship is transformative, and is eligible on that basis for protection under the fair use doctrine. In cases like Google’s, the courts found a way to strike a healthy balance between creators’ rights and the need to protect innovation. Yet courts have not been called to apply fair use to AI models that aggregate and learn from truly tremendous amounts of data in order to create the kind of output that the human mind cannot create on its own. Furthermore, the automated data ingestion involved in training AI models appears, from the perspective of copyright law, to be mass copying - not careful borrowing of small, helpful snippets to create something new and transformative. As a result, it is possible that courts will decide that the sheer amount of information copied by AI models is not reasonable under fair use.
If courts adopted such an approach, it would be hard for any AI platform to train its models on real data without finding a way to obtain licenses from all of the copyright holders whose data was included in a certain data set. Licensing on this scale would be onerous, and perhaps even unworkable. Moreover, even if a licensing regime could be created, it would impoverish the aggregate data available for training AI models. Nevertheless, one must realize that the volume of copying (including local copying to a storage device or service performed prior to data ingestion) that AI models undertake in the context of model training does provide a potential stumbling block for innovators who want to include AI training under the ever-widening umbrella of fair use.
Does Content Generated by AI Platforms Constitute Infringement?
We have already considered how modern copyright law treats the ingestion and processing of tremendous amounts of data by AI platforms. We concluded that the fair use doctrine is likely to allow the training of AI platforms, but noted that the fair use doctrine has never been used to bless the sheer amount of copying that occurs in the training of cutting-edge AI models.
Now we shall turn to an equally important question: What about content that’s generated by an AI platform after the training process is complete? Is that content covered by fair use?
This is an incredibly hard question to answer without significant qualifications. Some of the content generated by AI platforms will, once released into the wild, compete with copyrighted works used to train those platforms. In those cases, the original copyright owners will take the position that the works generated by AI are piggybacking on the creativity and value embedded in the works they generated. Of course, copyright holders may have some difficulty articulating what specific protectable elements were taken. Nevertheless, where copyright-protected works are valuable, there will be sufficient incentive to file infringement suits. We do not know when those suits will be successful or when they will fail. In other words, AI companies today do not have clear guidance about what does and does not constitute copyright infringement.
Despite the lack of a clear legal roadmap, AI companies can be sure of one thing. Copyright law abhors free riding. AI platforms that generate music and art that directly compete with natural persons are going to face an uphill battle when lawsuits are filed. But whether those platforms are held to be infringers will depend on a very technical legal analysis of what protectible elements were copied and what was done with those elements. Let’s try to articulate how this will work by using two simplified examples. The first represents a clear case of infringement, and the second represents a scenario in which an AI model was designed to avoid creating infringing content.
Example 1: Assume that an AI platform were to ingest the entire corpus of Disney’s IP portfolio and subsequently release products - shorts, TV series, and movies - that were directly competitive with Disney’s works. Imagine further that many of the works produced by the AI platform had highly similar characters, structurally identical themes and plots, and a social universe that corresponded closely to Disney’s original ecosystems. Could this possibly be fair use?
Creative expression is the heart of a copyrighted work because it represents the unique contribution of the copyright holder to science and the useful arts. In the Disney example sketched above, the creative expression is the characters, themes, specific stories, and social universes one find’s in Disney’s IP portfolio. To take a different example, the heart of the poem we considered earlier is the specific words chosen by the author to express the sentiment “I love you”, not the concept of romantic love itself. Whenever an AI platform ingests creative expression - the heart of a copyrighted work - and then produces competing products, that is a form of free riding. The easier it is for a company like Disney to show that its market is being cannibalized by such new substitute products, and the harder it becomes for the fair use defense to pan out.
Example 2: Assume that an AI platform was specifically designed to ingest only the conceptual themes embodied Disney’s IP portfolio. Themes like love, loss, the consequences of poor choices (as we see in Pinocchio), and tragedy would be fair game for the platform, but it would not reproduce any protectible creative expression embedded in Disney’s works. This would be quite a feat. After all, the AI model would have to ignore (i) the look and the feel of Disney’s characters, (ii) the specific words characters utter in various properties, and (iii) the features of the social universe in which Disney characters live. But if the AI model could be trained in this way, would the AI platform’s new content qualify for fair use?
If an AI model is architected to access only the uncopyrightable dimensions of a copyright-protected work, then the AI model is not being trained in a way that infringes copyright. (Note that a system which worked in this way would have to make a local copy of copyright-protected work, then strip each work down to its uncopyrightable “skeleton,” and then run that stripped-down material through the AI model.) In our “I love you” poem example, the basic concept of such a poem, along with the different abstract elements that such poems often include - such as favorable comparisons between a loved one and other things - are not protectable under copyright. Concepts, semantic connections, abstract relationships, and the like do not qualify for copyright protection. These are non-protectable elements. So, if an AI model is designed to strip away everything but the non-protectable elements, then the new content it generates is likely to be considered fair use. For these reasons, fair use can protect AI models that do not copy or reproduce creative expression but instead only access the non-protectable elements in the course of producing new innovations.
Example 2 raises two significant practical problems. First, the question of how an AI model is being trained is a question of fact, not something generally known by the general public or the courts. As a result, copyright holders will not know whether copyrightable elements of their works are being copied and leveraged by AI companies. And so, whenever copyright owners fear that their work is being pushed out by substitute AI products, they will still have an incentive to file infringement suits.
Second, many AI training models may only learn optimally if they can examine both the protectable and non-protectable elements of copyrighted works. These models are designed to learn, after all, and so it may be unrealistic to believe that AI platforms will (or should) be designed to ingest and examine only the non-protectable elements of copyrighted material. If AI training models work better when they can ingest and process everything, then it may not be wise to handicap their operation. Copyright law may need to evolve.
The Way Forward
American copyright law faces a stiff challenge. On the one hand, no one quite knows what the economic impact of AI platforms will be on the various markets for individual copyrighted works. Some of those markets may be decimated while others not impacted at all. While few copyright holders produce work with the specific intention of licensing such work to an AI company, the fear individual creators have of competing with companies like Lensa, which produce vivid, high-quality AI generated images, is not without foundation either.
On the other hand, Congress has an obligation to ensure that American AI companies are competitive with their international counterparts, and that requires providing these companies with a clear picture of how copyright law will treat cutting edge AI platforms.
One proposal, put forth by Mark Lemley and Bryan Casey in an article entitled Fair Learning, involves the introduction of a new policy into copyright law. The doctrine of fair learning would allow AI companies to ingest and analyze large data sets even if the contents thereof were copyright-protected. According to Lemley and Casey, fair learning is the right policy approach because it expands upon a central principle of copyright law: Ideas, facts, and functions in a protected work are not themselves protectable by copyright. Given that most AI data training is done for the purpose of learning from the non-protectable elements of copyrighted-works, it follows that most AI platforms should be able to engage in large scale model training. As Lemley and Casey point out, an “ML system wants photos of stop signs so it can learn to recognize stop signs, not because of the artistic choices you made in lighting or composing your photo.”1
If adopted by law makers, the fair learning doctrine would go a long way toward protecting innovative AI companies. Yet it is not clear that fair learning alone resolves all conflict between copyright holders and generative AI platforms. Not all cases are going to be as straightforward as the stop sign case, and not all copyright holders are going to be satisfied with a disclaimer that essentially says “we copied your stuff, but we did it for reasons you shouldn’t worry about.”
So what else needs to be done? The key problem that needs to be addressed is that copyright holders do not have complete information about how their protected works are specifically used by generative AI platforms. In the absence of that information, they are free to assume the worst about how generative AI platforms create new content. To solve that problem, copyright holders need more information. One way to ensure that they get this information would be to require AI companies to adopt and publish a data usage policy. Such a policy would (a) identify the data sets they use to train their AI models, and (b) describe if and how their AI platform retains data from the data sets they employ. Any AI company that published a data usage policy would be entitled to the fair learning presumption; companies that opted not to publish such a policy would be required to rely on the standard fair use defense currently available to defendants in copyright cases.
The final requirements of a data use policy need not be settled here. However, we can say that this new requirement would have two main objectives. First, a robust data use policy aims to close the gap between what AI platform owners and copyright owners know about the actual functioning of AI platforms. In certain cases, a published data use policy would clarify that protectable copyright elements are not reproduced in AI-generated works, thereby dissuading copyright owners from filing lawsuits that are unlikely to succeed.2 Second, a dialogue would be initiated between AI companies and the creative class. Over the long run, this dialogue would help all ecosystem participants appreciate that machines learn from copyright works in largely the same way that humans do, by repeatedly sampling different works to learn how to perform the underlying abstract structures and patterns.
This is not true of all AI platforms that use photographic material to train their respective models. For instance, Dall-E and Stable Diffusion do make attempts to understand and mimic choices generally considered “artistic,” such as choices regarding lighting, framing, and composition.
It is important to fully understand what such a data use policy would mean in the context of AI-generated art. The AI models being used today ingest samples of copyright-protected artistic expression on a very large scale. For example, one commonly used AI model ingests 2 billion images and produces a comparatively tiny model taking up only 4GB. Each image is only represented, on average, by 2 bytes in the model, which means that original images are not faithfully reproduced in the model. The model does not compress the full-sized image but instead uses a large database of images to identify patterns across those images. In other words, “copying” in this context does not mean the same thing that copying means when an art forger attempts to make a complete 1:1 copy of a Monet.