Privacy | Notion

Overview

<aside> 💡 This is an evolving threat vector and is probably one of the biggest concerns of companies’ full-throated adoption of LLMs; namely, their concern that personal or confidential company information makes its way into the model training data and eventually into the outputs of user (hacker) prompts.

3 ways private data ‘leaks’ out:

Unintended Memorization (Model recalls specific data points verbatim)
Indirect Data Inference (Model infers sensitive information based on patterns)
Extraction Attacks (Adversarial queries reveal underlying data)

For more of the current research into this issue, see: https://www.semanticscholar.org/search?q=privacy loss TRAINING DATA LEAKAGE in large language models&sort=relevance

</aside>

Recent example of Privacy Faux Pas or Backlash

<aside> 💡

Microsoft had to do an about face on an AI feature that was to be their killer feature in the next iteration of Windows, called Recall. Which ironically, they had recall! Why? due to user privacy objections.

It worked by snapping screenshots of your PC screen every few seconds, allowing an internal AI to do near constant image-to-text analysis, allowing users to query just about anything on your PC. the search results would be presented in timeline format so you could see your interactions over time.

In principal, a useful sounding and potentially powerful capability… until you factor in the creep factor and the “uncanny valley” of all this… and you realize your great idea is a flop.

https://www.windowscentral.com/software-apps/windows-11/microsoft-postpones-windows-recall-after-major-backlash-will-launch-copilot-pcs-without-headlining-ai-feature

</aside>

More Examples of Privacy Violations

Data Breaches: The ChatGPT data breach in March 2023 exposed conversation histories and personal information of some users, including payment details of subscribers[6].
Exposure of Sensitive Information: There have been instances where LLMs have reproduced private information, such as email addresses or phone numbers, in their outputs[7].
Corporate Data Leaks: A study found that 3.1% of employees have input confidential company data into ChatGPT, risking exposure of proprietary information[6].
Medical Data Exposure: Cases where healthcare professionals have shared patient information with ChatGPT to draft letters, potentially violating patient privacy[6].
Unauthorized Data Usage: Concerns about companies using user inputs for purposes beyond the immediate interaction, such as improving their models or selling data to third parties[2][5].
Bias and Discrimination: LLMs trained on biased data can perpetuate and amplify societal biases, leading to privacy violations for marginalized groups[1].
Malicious Use: The potential for LLMs to be used in creating sophisticated phishing attempts or generating false information that appears credible[1][7].

To mitigate these risks, experts recommend implementing robust data protection measures, enforcing stricter privacy regulations, ensuring transparency in AI systems, and educating users about the potential risks of sharing sensitive information with LLMs[3][6]. As the technology continues to evolve, it's crucial to balance the benefits of LLMs with the fundamental right to privacy.

Citations: [1] https://arxiv.org/abs/2310.10383 [2] https://thenewstack.io/llms-and-data-privacy-navigating-the-new-frontiers-of-ai/ [3] https://stackoverflow.blog/2023/10/23/privacy-in-the-age-of-generative-ai/ [4] https://www.tonic.ai/blog/safeguarding-data-privacy-while-using-llms [5] https://www.sentra.io/blog/safeguarding-data-integrity-and-privacy-in-the-age-of-ai-powered-large-language-models-llms [6] https://www.linkedin.com/pulse/chatgpt-data-breach-wake-up-call-privacy-security-large [7] https://hiddenlayer.com/research/the-dark-side-of-large-language-models/ [8] https://hiddenlayer.com/research/the-dark-side-of-large-language-models-2/