As outlined in our previous newsletter, training a GPT-model involves copying information in digital form. If the information copied is personal data, training on personal data constitutes processing.
Such processing of personal data requires a legal basis. Due to practical considerations, several legal bases are excluded for the training process itself, for example, Article 6 (1) letter b (agreement), Article 6 (1) letter c (legal obligation), Article 6 (1) letter d (vital interests), and Article 6 (1) letter e (public interest). The remaining alternatives are therefore consent, cf. Article 6 (1) letter a, and legitimate interest, cf. Article 6 (1) letter f.
Considering the extensive amount of training data large language models require to function properly, it is evident that consent under GDPR is not a realistic option, especially for foundational models. The only remaining possibility is to anchor the processing in the data controller's (presumably the owner of the generative AI model, like OpenAI) legitimate interests. Article 6 (1) letter f reads as follows:
Processing shall be lawful only if and to the extent that at least one of the following applies: … [where] processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.
The data controller's interest will be of commercial nature, as the company seeks to make the model as good as possible to monetise it.
The decisive factor is the interest assessment. For the purposes of this newsletter, we will not delve into all details of the specifics of interest assessment. However, the overwhelming amount of data the training requires is a strong argument against concluding that the processing can be based on legitimate interest. Further, there will be challenges in accomplishing information requests from data subjects. In addition, the training process is technically difficult to understand, and several providers, such as OpenAI, provide little information about how the training occurs. Both conditions can compromise the transparency principle in the GDPR. The combination of extensive processing and lack of transparency, therefore, undermines the possibility of using legitimate interest as a legal basis for training on personal data.
In this context, it is worth mentioning that on March 31, 2023, the Italian Data Protection Authority imposed a temporary ban on using ChatGPT in Italy. One of the explanations was that:
[T]here appears to be no legal basis underpinning the massive collection and processing of personal data in order to 'train' the algorithms on which the platform relies.
However, the subsequent dialogue between OpenAI and the Italian Data Protection Authority resulted in the ban being lifted – although it is unclear if this specific issue was adequately addressed. The extent to which the Italian Data Protection Authority contributed to clarifications in this area is therefore debatable. Furthermore, it is doubtful whether OpenAI had, in fact, considered any legal basis for its GPT models' training on personal data.
The legal grounds for training language models will be closely monitored in the future. For example, it is likely that the EDPB's "task force" will scrutinise this going forward.