On the use of copyright works as training data for Artificial Intelligence

Undergraduate Laws Blog

2 years ago

This post has been contributed by Dr Luke McDonagh, Module Convenor for Intellectual Property.

Creative AI uses substantial amounts of input data—images, videos, text and other artistic content—as part of its learning process. Music-generating AI utilises significant amounts of source material to find patterns and create new melodies based on various elements including tempo, chords and length. Similar rules apply in the context of visual art—The Next Rembrandt project involved 350 scanned images and over 150 gigabytes of data. While the final output was not universally acclaimed, it clearly demonstrated that modern AI can produce sophisticated creative output that resembles the works of professional (human) artists. Large amounts of source text are required to generate literature and creative writing too. Deep learning large language models (LLMs) such as GPT-4 —produce human-like text across a range of categories, including creative writing, parodies and storytelling — are now used in hundreds of different apps.

The key point here is that creative AI cannot function without source material. It needs to be trained using existing works, many of which are likely to be protected by copyright owned by another party. This inevitably raises the risk of infringement, both in relation to the AI’s inputs and outputs. Feeding source material (inputs) into the AI and processing this data may violate the right to reproduction.[1] Likewise, the final product (outputs) could be regarded as an adaptation of pre-existing works. With regards to outputs, however, any finding of infringement will depend on whether pre-existing elements can be recognised in the final product. Works that contain clearly identifiable elements are likely to violate the adaptation right.

Special treatment for AI?

If AI owners or users are to claim copyright ownership over new creative works generated by an algorithmic process, it is certainly arguable that they should also be held accountable where the AI commits infringement. If there is liability where a human performs a certain act, why should we give a machine more favourable fair dealing treatment when it does the same (especially as AI can do this on a much larger scale)? Surely, the law should not facilitate a binary system where humans are disadvantaged when carrying out the same task and should stay ‘technology neutral’ insofar as possible. This risk of a double-standard framework could also be explained with reference to the terminology used in this context—source material used by humans is typically referred to as ‘works’ whereas the same material used by AI is called ‘data’.

Exceptions under UK and EU law: transient copies

In UK, law the options for AI companies to claim a ‘fair dealing’ exception are limited. Article 5(1) of the EU Information Society Directive — known as the ‘transient copy’ exception — the core of which remains applicable in the post-Brexit UK, may apply in relation to some uses of creative AI where the reproduction is merely temporary.[2] In order to rely on this exception, the copying must: i) be incidental or transient; ii) form an essential part of the technological process; iii) enable the lawful use of a work; iv) and have no independent significance. As per the Infopaq case, the criteria are cumulative and will be interpreted strictly by the court.[3] This exception generally permits the copying of a protected work for the purposes of performing mechanical tasks which have no autonomous value (e.g. web browser data stored in a cache). In Infopaq the CJEU considered this provision in the context of data capture, which is arguably not too different from some of the steps involved in modern machine learning.

Consider the example of an AI application scanning through data on weather forecasts, aiming to assist users with scheduling holidays. In this case, some data may be temporarily stored so that it can be transmitted through a network between third parties. Here, the reproduction is clearly an essential part of the technological process and it is not necessary to keep the data once it has been run through the AI, i.e. this may be regarded as non-expressive use. Provided that there is no economic harm for the rightsholder, this use would also satisfy the three-stage test in the Directive (which states the provision will only apply in circumstances which ‘do not conflict with a normal exploitation of the work or other subject-matter and do not unreasonably prejudice the legitimate interests of the rightholder’). In the example of GPT – the use of AI being trained on copyright articles and books without permission – it is not clear that using these materials en masse without compensation meets this standard of a fair dealing.

Exceptions under UK and EU law: text and data mining

Text and data mining (TDM) concerns the extraction and use of large amounts of data for the purposes of finding patterns, discovering relationships, providing valuable information for research and other activities. Under s. 29A of the UK CDPA there is an exception for text and data analysis. However, this provision will be of limited significance where companies wish to exploit the final output commercially as s. 29A of the UK CDPA specifically restricts the application of the exception to reproduction for the purposes of non-commercial research.

In the EU, Creative AI could also be exempt under the Digital Single Market (DSM) Directive’s text- and data-mining exception.[4] In order to rely on Article 3, the relevant activity must be performed by a cultural heritage or research institution in the context of scientific research. Given the lack of a broader fair use doctrine in Europe, the more limited protection offered under this exception may encourage AI companies to engage in mixed partnerships with public entities. While Article 4 of the Directive seemingly permits TDM to be performed by business entities and for any purpose, there is an important caveat. The provision is inapplicable where rightsholders reserve the right to mine, which significantly limits its usefulness in practice. Finally, one key unanswered question— which could be addressed by the CJEU in any future case law — is whether non-profit entities could also utilise the exception where the mined data is used in an expressive manner.

Conclusion

How should liability be allocated where AI performs an infringing act and there are no available exceptions? While UK and EU courts have not clearly addressed this question yet, the threat of infringement and lack of legal certainty could cause harm to the advancement of AI if developers are discouraged from creating and distributing important products and there is no timely guidance on the issue. We expect a ruling in an upcoming 2024 English High Court case involving use of Getty Images. That may establish an important precedent.

[1] See e.g. s. 17(2) UK CDPA 1988 (infringement includes reproducing, inter alia, a literary, artistic or musical work in any material form, and other European jurisdictions offer similar provisions).

[2] Directive 2001/29 on the harmonisation of certain aspects of copyright and related rights in the information society (Information Society Directive); see Case C-5/08 Infopaq International A/S v Danske Dagblades Forening EU:C:2009:465; The Newspaper Licensing Agency v Meltwater Holding BV and others [2010] EWHC 3099 Ch.

[3] Case C-302/10 Infopaq International A/S v Danske Dagblades Forening (‘Infopaq II’), para. 27.

[4] Directive 2019/790 on copyright and related rights in the Digital Single Market arts 3 and 4.

Share this: