Personal Identifiable Information in AI Models

Personal Identifiable Information in AI Models

I’m currently working on a customer support chatbot. We have a huge amount of historical conversations between real agents and customers. This historical data can be seen as an ideal data set to fine tune a large language model.

That said this data set contains Personal Identifiable Information (PII) which should under no circumstances leak into the parametric knowledge of the model. This poses a security risk, potentially leading to the disclosure of PII to unintended recipients.

Prior to training AI models, it is thus crucial to eliminate PII from datasets to safeguard individual privacy and adhere to data protection regulations such as GDPR, CCPA, and HIPAA.

In essence, PII encompasses data capable of directly or indirectly identifying an individual, such as names, addresses, phone numbers, email addresses, social security numbers, and IP addresses.

The process of cleaning PII data is relatively straightforward when dealing with structured data with known data fields. However, handling unstructured data, such as emails or customer support tickets, presents a challenging task in automatically detecting PII. Interestingly, AI models are employed to automatically detect and remove PII data in these cases.

Related Posts

The AI Technology Stack
The AI Technology Stack
Read Post
Contradicting KPIs and Metrics in AI
Contradicting KPIs and Metrics in AI
Read Post
Prefer Simple AI Models Over Complex Ones: Even if they Perform Worse
Prefer Simple AI Models Over Complex Ones: Even if they Perform Worse
Read Post

Driving Innovation Through Data and AI Excellence

Contact Us