Personal Identifiable Information in AI Models

Personal Identifiable Information in AI Models

I’m currently working on a customer support chatbot. We have a huge amount of historical conversations between real agents and customers. This historical data can be seen as an ideal data set to fine tune a large language model.

That said this data set contains Personal Identifiable Information (PII) which should under no circumstances leak into the parametric knowledge of the model. This poses a security risk, potentially leading to the disclosure of PII to unintended recipients.

Prior to training AI models, it is thus crucial to eliminate PII from datasets to safeguard individual privacy and adhere to data protection regulations such as GDPR, CCPA, and HIPAA.

In essence, PII encompasses data capable of directly or indirectly identifying an individual, such as names, addresses, phone numbers, email addresses, social security numbers, and IP addresses.

The process of cleaning PII data is relatively straightforward when dealing with structured data with known data fields. However, handling unstructured data, such as emails or customer support tickets, presents a challenging task in automatically detecting PII. Interestingly, AI models are employed to automatically detect and remove PII data in these cases.

Related Posts

Building Vertical AI Agents is Tough
Building Vertical AI Agents is Tough
Python's Limitations in Data Science and ML
Python's Limitations in Data Science and ML
I’ve built an AI-powered Psychotherapist Assistant!
I’ve built an AI-powered Psychotherapist Assistant!