Newsletter

A brief look at privacy and generative AI

by William Eitrem, Eva Jarbekk and Thomas Hagen

Published: 1 February 2024

Overview and brief summary

Many aspects of processing personal data with large language models are fairly straight forward from a legal point of view, such as using personal data in training or prompting.
All training and prompting with use of personal data must have a legal basis.
Some generation of information which can amount to personal data might not fall within the scope of the data protection rules.
The European Data Protection Board ("EDPB") will likely be an important player in the space of personal data and large language models going forward.

An increasing number of companies are using services based on generative AI models. If these services are used to process personal data, both the data controller and data processor must meet the data protection and privacy requirements set out in the General Data Protection Regulation ("GDPR"). However, the technology is complicated, and it may be difficult to establish whether GDPR shall apply, as well as the rules' specific meaning for generative AI technology. In this article, we will discuss the representation of personal data in written generative models and address certain questions associated with large language models and personal data.

As in our previous newsletter article, we focus on the GPT technology by OpenAI. For certain technical clarifications, please see the link above.

Definitions

Initially, we define two fundamental terms within the GDPR. "Personal data" is any information which, directly or indirectly, relates to an individual (see Article 4 (1) of GDPR). "Processing" of personal data includes any operation on it, for example, collection, registration, storage, organization, structuring, alteration, usage, disclosure, dissemination, or other forms of making available and deletion (see Article 4 (2) GDPR).

Both terms are interpreted broadly. As these concepts have been thoroughly interpreted and explored in case law, legal literature, and guidelines, many interpretational uncertainties have been settled – but difficult questions may still arise. This is often due to technical complexity and/or a deficiency of technological understanding by legal practitioners rather than a lack of legal clarity. In such cases, it may be unclear where the processing occurs, how the processing is conducted, which entity initiates the processing, etc. Large language models are undeniably complex and challenging to understand, which can create legal uncertainty.

As large language models are often used in business relationships, the definitions of "data controller" and "data processor" are essential. A data controller is the entity that determines the purposes and means of processing personal data. A "data processor" is an entity that processes personal data on behalf of the data controller. Most times, the provider of a generative AI service is data processor, as it processes data on behalf of the user of the service it provides.

However, the role of a generative AI service provider may change according to how the model is used. For instance, when ChatGPT is used, OpenAI defines itself as the data controller for personal data processed for OpenAI's purposes (such as GPT-model training), while a ChatGPT user will normally be the data controller for personal data inputted through prompts, with OpenAI as data processor.

We divide the following discussion according to the model's and a select few of the program's different "phases": Training of the model, what happens with the information when it is "in" the model, when the model receives an input text, and when the model is used to generate new material.

Training

As outlined in our previous newsletter, training a GPT-model involves copying information in digital form. If the information copied is personal data, training on personal data constitutes processing.

Such processing of personal data requires a legal basis. Due to practical considerations, several legal bases are excluded for the training process itself, for example, Article 6 (1) letter b (agreement), Article 6 (1) letter c (legal obligation), Article 6 (1) letter d (vital interests), and Article 6 (1) letter e (public interest). The remaining alternatives are therefore consent, cf. Article 6 (1) letter a, and legitimate interest, cf. Article 6 (1) letter f.

Considering the extensive amount of training data large language models require to function properly, it is evident that consent under GDPR is not a realistic option, especially for foundational models. The only remaining possibility is to anchor the processing in the data controller's (presumably the owner of the generative AI model, like OpenAI) legitimate interests. Article 6 (1) letter f reads as follows:

Processing shall be lawful only if and to the extent that at least one of the following applies: … [where] processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.

The data controller's interest will be of commercial nature, as the company seeks to make the model as good as possible to monetise it.

The decisive factor is the interest assessment. For the purposes of this newsletter, we will not delve into all details of the specifics of interest assessment. However, the overwhelming amount of data the training requires is a strong argument against concluding that the processing can be based on legitimate interest. Further, there will be challenges in accomplishing information requests from data subjects. In addition, the training process is technically difficult to understand, and several providers, such as OpenAI, provide little information about how the training occurs. Both conditions can compromise the transparency principle in the GDPR. The combination of extensive processing and lack of transparency, therefore, undermines the possibility of using legitimate interest as a legal basis for training on personal data.

In this context, it is worth mentioning that on March 31, 2023, the Italian Data Protection Authority imposed a temporary ban on using ChatGPT in Italy. One of the explanations was that:

[T]here appears to be no legal basis underpinning the massive collection and processing of personal data in order to 'train' the algorithms on which the platform relies.

However, the subsequent dialogue between OpenAI and the Italian Data Protection Authority resulted in the ban being lifted – although it is unclear if this specific issue was adequately addressed. The extent to which the Italian Data Protection Authority contributed to clarifications in this area is therefore debatable. Furthermore, it is doubtful whether OpenAI had, in fact, considered any legal basis for its GPT models' training on personal data.

The legal grounds for training language models will be closely monitored in the future. For example, it is likely that the EDPB's "task force" will scrutinise this going forward.^[1]

"Storage" of training data

As outlined in our previous newsletter, information used for training data is not stored in a structured form. This can be illustrated by a situation where a person's full name (first name and surname) is used for training. The first name and surname will indeed be words known by the model, but as separate tokens. A first name or a surname by themselves do not normally constitute personal data without additional information that can be used to link the name to an individual. However, the model is trained to understand that these words have a connection.

To put it in simple terms, we can say that a strengthened statistical connection is "stored" between the words, e.g., first names and surnames, which is something else than if the words themselves were stored together. This strengthening increases the likelihood that the language model generates text where these words are linked together.

An interesting question is whether there is a threshold where the degree of statistical connection can be understood as the words "belonging" together and thus constituting personal data. More practically, the question is not if, but rather where this boundary should be drawn. If the threshold is exceeded, it would be reasonable to say that the information is indeed stored, and this storage constitutes another processing of personal data. The concept of what is "indirectly" enough personal data is also relevant in this context.

Input and storage of input text

For ChatGPT to generate material, it needs a prompt from the user. If the user writes personal data in the input text, it constitutes processing because it needs be copied into computer memory. This processing would normally trigger a requirement for a data processing agreement between OpenAI and the user.

In addition, issues arise regarding the possible long-term storage of such prompts (as opposed to short-term copying in computer memory). As a default, all input text is stored. This storage occurs as data storage usually does – structured, in a database – not as a strengthened statistical connection. Storage undoubtedly constitutes processing according to the GDPR.^[2]

More practically, OpenAI and other providers of large language models often offer functionality that ensures that input text is not stored on their systems. In this case, storage would presumably not occur, and would, naturally, not require legal justification.

Generation

In principle, nothing prevents a language model from generating information that can point to an identifiable physical person. The model's generation of such information will likely be a result of (1) the model having been trained on information involving personal data, and (2) the input text prompting statistical connection of the words to suggest that they should be generated together. In such instances, the legal conclusion would clearly be that the model's output constitutes processing of personal data.

A more theoretical situation is where the model generates information that relates to an identifiable physical person, but the model's owners can demonstrate that this data is derived from training materials unrelated to the actual person concerned.

A similar issue can arise if the user prompt contains a fictional name, but the generated information relates to a real, identifiable person. According to the GDPR, such generation likely constitutes processing, but it is challenging to identify a legal basis for such conclusion. If one literally interprets the wording of the regulation, a breach of privacy regulations could likely be established. However, it would be reasonable to not impose liability here, as the processing has occurred unconsciously and with a weak link between intended or negligent actions and the breach.

Oskar Engman

Associate

Stockholm

+46 8 407 24 60 +46 76 116 33 66 Email