On March 13, the European Parliament crossed another and (almost) the last hurdle for the EU AI Act to come into force. Anticipated to become law by May or June, it will see its provisions phased in: six months later, countries will be mandated to ban prohibited AI systems; one year later, regulations for general-purpose AI systems will commence; and two years later, the entire AI Act will be enforceable.
It’s a good idea to prepare for compliance now, as there is a lot to do for AI developers and providers. This article focuses on the obligations imposed by the EU AI Act as they relate to data protection and how Private AI’s technology can facilitate compliance.
Data Protection Obligations under the EU AI Act
Recital 45 in the final text clarifies that “Practices that are prohibited by Union law, including data protection law […] should not be affected by this Regulation.” This means, not surprisingly, that the General Data Protection Regulation (GDPR) applies in addition to the EU AI Act. Recital 69 expands on this:
The right to privacy and to protection of personal data must be guaranteed throughout the entire lifecycle of the AI system. In this regard, the principles of data minimisation and data protection by design and by default, as set out in Union data protection law, are applicable when personal data are processed. Measures taken by providers to ensure compliance with those principles may include not only anonymisation and encryption, but also the use of technology that permits algorithms to be brought to the data and allows training of AI systems without the transmission between parties or copying of the raw or structured data themselves, without prejudice to the requirements on data governance provided for in this Regulation.
The EU AI Act also tells us that it “should not be understood as providing for the legal ground for processing of personal data, including special categories of personal data, where relevant, unless it is specifically otherwise provided for in this Regulation.” This warrants an explanation.
Under the GDPR, the processing of personal data and particularly of sensitive data called “special categories of data” can only be done if there is a legal basis for such processing, such as consent provided by the affected individual or a legitimate interest of the processor. The EU AI Act now clarifies that it does not constitute such a legal basis, but that a legal basis must instead be found in the GDPR, unless the Act specifically says otherwise.
An exceptional legal basis for the processing of special categories of personal data is provided for in Art. 10(5) and the corresponding Recital 70. Art. 10(5) says for bias detection and correction special categories of personal data may be processed, subject to strict security safeguards and under the condition that bias detection and correction cannot be effectively carried out with synthetic or anonymized data. This exception to the prohibition of the GDPR around the processing of special categories of personal data is the only one in the Act and it does not apply to the development of AI systems that scrape data from the web for general training purposes.
More generally, Art. 10(2)(b) requires developers of high-risk AI systems to implement risk mitigation and data governance practices that concern “data collection processes and the origin of data, and in the case of personal data, the original purpose of the data collection.” This somewhat cryptic provision seems to say that developers must consider the purpose for which personal data was originally collected and how, and ensure that the use for the development of the high-risk AI system is permitted. Note that Art. 5 of the GDPR requires the specification of the purpose for which personal data will be used and a legal basis for each purpose. Once collected, the data cannot be used for purposes incompatible with the original one. In other words, if an organization collects personal data for the purpose of providing services to a consumer, it is not a given that this data can then be used to train a high-risk AI system. Developers must also implement a risk mitigation system for high-risk AI systems. Art. 9 which imposes this obligation does not make reference to any specific types of risks, which allows the conclusion that risks to privacy must be covered.
Providers of high-risk AI systems are required under Art. 17 to implement a quality management system which must include comprehensive systems and procedures for data management that cover pretty much every data-related operation performed before and for the purpose of the placing on the market or the putting into service of high-risk AI systems. As part of this quality management system, providers are obliged to include a risk mitigation system in accordance with Art. 9, addressed above.
Deployers of high-risk AI systems must conduct a fundamental rights impact assessment under Art. 27. Art. 27(4) clarifies that if a data protection impact assessment has already been conducted pursuant to Art. 35 of the GDPR, the fundamental rights impact assessment shall complement that prior assessment. This means that the fundamental rights impact assessment must include privacy considerations that arise from the deployment of the high-risk AI system.
Providers of general-purpose AI systems (GPAIs) (which are not automatically high-risk AI systems), including open source GPAIs, must “draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office.” (Art. 53(1)(d)). For details of GPAIs read our article EU AI Act Final Draft – Obligations of General-Purpose AI Systems relating to Data Privacy.
Providers of general-purpose AI models with systemic risks have the additional obligation to assess and mitigate systemic risks that may arise from the development, the placing on the market, or the use of general-purpose AI models with systemic risk. A GPAI is presumed to have systemic risks when the cumulative amount of computation used for its training measured in FLOPs is greater than 10^25. The Act does not specify whether systemic risks that must be assessed and mitigated include privacy risks, but given that large language models of this magnitude are thus far trained on enormous amounts of data scraped from the internet, which includes personal information, and given that training data is regularly memorized by these models, it does not seem far fetched to conclude that systemic risks may include privacy risks. This interpretation is supported by the fact that the criteria for the determination of what constitutes systemic risks listed in Annex XIII include the quality and size of the data set as well as the specific types of in- and outputs.
How Private AI can Help with Compliance
All privacy-related responsibilities faced by developers, providers, and deployers of AI systems share a common challenge: they are exceedingly difficult to overcome without clear insight into the personal data within their training datasets. Absent such visibility, organizations will struggle to provide the necessary disclosures prior to gathering personal data, hindered by the inability to request specific consent or effectively implement data subject rights. This is particularly evident with the right to erasure, where worst-case scenarios may necessitate extensive retraining efforts, posing environmental concerns due to resource-intensive training processes and imposing financial burdens on businesses. Reporting obligations, fundamental right impact assessments, and systemic risks assessments are equally impossible without knowing what personal data is included in the training data sets.
Despite various strategies employed by model developers to tackle privacy issues, their effectiveness is constrained. Some opt to exclude websites containing substantial personal data from data scraping efforts, while others engage independent privacy experts in Reinforcement Learning from Human Feedback (RLHF) endeavors to align the model with the objective of safeguarding personal information. However, these approaches still leave vulnerabilities in personal data protection, exposing entities to potential liabilities under the GDPR and the EU AI Act.
Enter Private AI. Private AI’s technology is able to identify and report on personal identifiers included in large unstructured data sets and to replace them with synthetic data or placeholders. For many use cases, this approach that relies on context-aware algorithms trained by data experts in multiple languages is able to preserve data utility while maximizing data privacy.
This technology is not only useful for model developers but also further down the value chain. When businesses are concerned about violating privacy rights when employees include personal data in their prompts sent to an external model, Private AI’s PrivateGPT can be deployed to intercept the prompt, filter out the personal data and re-inject it automatically into the response for a seamless user experience. Test PrivateGPT for free here.
In addition, with the help of Private AI, privacy can be preserved during fine-tuning, creating embeddings for Retrieval Augmented Generation (RAG), and bias reduction.