Anonymous Data in the Age of AI: Hidden Risks and Safer Practices

Privacy’s Newest Threat

It’s no secret that in today’s digital economy, an era in which data is the new oil, entities are collecting, sharing, and analyzing personal information at an unprecedented scale. Privacy laws establish strict obligations on how entities safeguard personal data. However, data protection laws often do not apply to the lesser-known category of “anonymous data”. This has given rise to a growing trend in which some entities exaggerate or misrepresent their anonymization practices to bypass privacy obligations, justify data sharing, or to instill a false sense of security among customers. Analogous to the concept of greenwashing, in which companies overstate their sustainability claims to appear eco-friendly, the term anonymity-washing was coined in 2025 to describe a similar practice in the context of data privacy. 

Additionally, given rapid advances in technology, many datasets marketed as “anonymous” can often be reidentified using modern data analytics, cross-referencing techniques, and machine learning. This discrepancy between claims and actual privacy protection poses significant legal, ethical, and technical risks for these entities, ultimately compromising users’ privacy. 

Anonymization vs. Pseudonymization

Data anonymization is the process of irreversibly transforming personal data so that individuals cannot be identified, either directly or indirectly. Once data is properly anonymized, there is no way to reverse the process or link the information back to a specific person. Data pseudonymization, on the other hand, involves replacing personal identifiers such as names, email addresses, or ID numbers with pseudonyms or artificial labels. While this masks the identity, the original data can still be recovered by anyone with access to the key or mapping table connecting the pseudonyms to real identities. 

Privacy laws such as the European Union’s General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), or the U.S. Health Insurance Portability and Accountability Act (HIPAA) apply only in the case of personal data, which is any information relating to an identified or identifiable natural person (see, for example, Art. 4(a) GDPR; Cal. Civ. Code § 1798.140(v)(1)). The challenge lies in determining when a person can no longer be considered identifiable with the available data. With regard to the GDPR, the European Court of Justice (ECJ) has stated that, in determining whether a person is identifiable, ”account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person.” However, the ECJ did not consider all means of identifying a data subject as “likely reasonably.” Where the risk of identification “appears in reality to be insignificant,” for example, when identification is prohibited by law or practically impossible due to excessive effort, cost, or technical limitations, the information may not be regarded as personal data under GDPR. Thus, properly anonymized data automatically falls outside of the GDPR’s scope, which helps entities to comply with data protection obligations and avoid fines for noncompliance.

Pseudonymized data, on the other hand, is treated differently, and its classification depends on the context, as the European Court of Justice recently clarified regarding the application of the GDPR. If pseudonymization is implemented in such a way that only the data controller can re-identify individuals, then for any third parties who lack access to the additional information, the data subject is not considered identifiable. Consequently, the identifiable nature of the data subject must be examined in relation to each person and each controller involved in processing the relevant information. However, for any entity that retains the means to re-identify individuals, pseudonymized data remains within the scope of data protection laws and must be handled in compliance with their requirements.

Misleading Anonymization Claims 

To bypass privacy law obligations and gain user trust, some entities claim to anonymize or de-identify user data when in fact they are merely pseudonymizing it, for example, by replacing names with random IDs while retaining unique behavioral patterns that still enable re-identification. For example, in October 2006, Netflix launched the $1 million Netflix Prize to improve its movie recommendation system. As part of the contest, Netflix released a dataset containing 100 million movie ratings from 500,000 subscribers over six years. Although usernames were replaced with random IDs to anonymize the data, two University of Texas researchers later demonstrated that the dataset could be de-anonymized by cross-referencing it with publicly available information. The researchers successfully identified individual users and exposed details such as their viewing habits and even potential political affiliations. Thus, simply removing names and emails may be insufficient to achieve true data anonymization. 

A 2000 research study by Latanya Sweeney at Carnegie Mellon University’s Data Privacy Lab  demonstrated the scale of this privacy challenge. It reported that 87% of the U.S. population (216 million of 248 million people) could likely be uniquely identified using just three attributes: five-digit ZIP code, gender, and date of birth. Moreover, about 53% of the population (132 million people) could be uniquely identified with only place of residence (city, town, or municipality), gender, and date of birth. Even at the county level, combining county, gender, and date of birth could uniquely identify approximately 18% of individuals.

Growing Re-Identification Risks with Modern Technology

Furthermore, with advances in re-identification techniques and the capabilities of modern data science tools, even small datasets that appear anonymous or pseudonymous can often be re-identified. For example, information such as location traces, browsing behavior, public datasets, and Internet of Things (IoT) metadata, including device IDs, timestamps, sensor activity logs, and network connection details, can be cross-referenced to uniquely identify individuals. These re-identification risks increase with the rapid progress of Artificial Intelligence (AI), which can process and correlate vast amounts of data at unprecedented speeds. In particular, advancements in machine learning (ML) have made it possible for models to infer missing identifiers, cluster individuals based on subtle behavioral patterns, and even extract sensitive traits from seemingly harmless data. As a result, datasets once considered safely anonymized may still lead to the identification of individuals when analyzed with advanced predictive algorithms.

Legal and Ethical Consequences

This creates significant legal, compliance, and ethical risks. Mislabeling personal data as anonymous can lead to penalties, enforcement actions, and lawsuits, with GDPR fines reaching up to €20M or 4% of global turnover. New regulatory frameworks, including the EU Data Act, India’s Digital Personal Data Protection Act (DPDPA), and stricter enforcement under Brazil’s Data Protection Law (LGPD), are raising standards on anonymization practices and closing potential compliance loopholes. Beyond regulatory consequences, the irresponsible use of personal data undermines consumer trust and transparency. When entities fail to protect privacy, users may feel misled or exploited, ultimately losing confidence in both digital services and data-driven innovation. 

Best Practices 

To evaluate potential privacy exposure, entities should distinguish between anonymization and pseudonymization and perform risk-based re-identification assessments. Rather than relying on weak anonymization claims, entities should adopt provable privacy-preserving techniques such as:

  • Differential Privacy: a technique that introduces controlled statistical noise into datasets or query results to prevent the re-identification of individuals. 
  • Zero-Knowledge Proofs (ZKPs): a cryptographic method that allows one party (prover) to demonstrate the truth of a statement to another party (verifier) without revealing any underlying data. Applied to LLMs, ZKPs could verify outputs and compliance without disclosing prompts, training data, or model weights. 
  • Secure Multiparty Computation (sMPC) and Fully Homomorphic Encryption (FHE): cryptographic approaches that allow sensitive data to be used in joint computations or encrypted processing, enabling results to be derived without ever exposing raw inputs.
  • Mixnets and Tor-style Routing: a network-layer technique that hides communication patterns by obfuscating metadata such as sender, receiver, and traffic timing.

In practice, approaches like anonymous credential systems illustrate how entities can balance secure authentication with strong privacy protection. Therefore, to avoid false anonymization claims, best practices should include: 

  • Implementing proven privacy-enhancing technologies (PETs) to provide robust and reliable data protection;
  • Disclosing the anonymization methods and their limitations to users to promote transparency and informed decision-making; and
  • Conducting independent and routine reviews of the anonymization techniques to assess and verify their effectiveness and to support continued compliance with evolving standards and regulations.

How We Can Help

According to industry best practices, developers, researchers, and entities working with sensitive data are advised to use verifiable anonymization techniques to avoid facing legal risks and losing the trust of their customers. This is especially true for those working in the privacy-enhancing technology space. At Least Authority, we are committed to helping engineers, entities, and communities create systems that are secure, trustworthy, and privacy-preserving. Through our audits and expertise, we help teams to embed privacy into system architecture so user privacy is not merely promised but protected in practice.

Archives