Model Inversion: Extracting Secrets from AI (Part 4)

When AI Remembers Too Much

Model Inversion: Extracting Secrets from AI (Part 4)

When AI Remembers Too Much

You’d think that training data gets “digested” into model weights—abstracted into patterns, not memorized wholesale. Unfortunately, models—especially LLMs—can memorize significant portions of their training data.

Model inversion and data extraction attacks exploit this to recover private information. If your social security number was in the training data, an attacker might be able to extract it.


The Privacy Problem

What Models Memorize

Research has shown that LLMs can reproduce:

  • Phone numbers and addresses.
  • API keys and passwords.
  • Copyrighted text (books, articles).
  • Private conversations.
  • Personal identifiable information (PII).

Why This Happens

  1. Overparameterization: Models have more parameters than needed, enabling memorization.
  2. Repetition in training data: Frequently occurring patterns are strongly encoded.
  3. Unique sequences: Rare, distinctive text may be memorized verbatim.

Types of Privacy Attacks

1. Membership Inference

Question: “Was sample X in the training data?”

Impact: Reveals if someone was in a sensitive dataset (medical, financial, etc.).

2. Model Inversion

Question: “What does the training data look like?”

Impact: Reconstruct inputs (faces, text, records) from model.

3. Attribute Inference

Question: “What private attributes does person X have?”

Impact: Learn sensitive details about individuals in training data.

4. Training Data Extraction

Question: “Can we extract verbatim training examples?”

Impact: Recover private text, PII, copyrighted content.


Membership Inference Attacks

The Intuition

Models behave differently on data they’ve seen vs. unseen:

  • Training data: high confidence, low loss.
  • Unseen data: lower confidence, higher loss.

Shadow Model Attack

┌─────────────────────────────────────────────────────────────┐
│ 1. Train "shadow models" mimicking target model             │
│                                                             │
│ 2. Create dataset of (sample, in/out) labels                │
│    - Samples from shadow training data → "in"               │
│    - Samples not in shadow training data → "out"            │
│                                                             │
│ 3. Train attack classifier:                                 │
│    Input: model confidence scores                           │
│    Output: "in training" or "not in training"               │
│                                                             │
│ 4. Apply attack classifier to target model                  │
└─────────────────────────────────────────────────────────────┘

Simple Threshold Attack

def membership_inference(model, sample, threshold=0.7):
    confidence = model.predict_proba(sample).max()
    
    if confidence > threshold:
        return "MEMBER"  # Likely in training data
    else:
        return "NON-MEMBER"

Model Inversion Attacks

Reconstructing Faces

Classic attack on facial recognition systems:

Target: Facial recognition model
Query: "What does person X look like?"

Process:
1. Start with random noise image.
2. Optimize image to maximize model's confidence for person X.
3. Recovered image approximates training data for person X.

Gradient-Based Reconstruction

def model_inversion(model, target_class):
    # Start with random input
    x = torch.randn(input_shape, requires_grad=True)
    
    optimizer = torch.optim.Adam([x], lr=0.01)
    
    for i in range(iterations):
        # Forward pass
        output = model(x)
        
        # Maximize probability of target class
        loss = -output[target_class]
        
        # Add regularization for realistic images
        loss += regularization(x)
        
        # Backpropagate
        loss.backward()
        optimizer.step()
    
    return x  # Reconstructed input

Training Data Extraction from LLMs

Verbatim Memorization

GPT-2 was shown to memorize and reproduce:

  • A specific person’s name, email, phone, and address.
  • Copyrighted text from books.
  • API keys and URLs.

Extraction Techniques

1. Prefix Prompting

Prompt: "John Smith can be reached at"
LLM: "555-123-4567" (if this was in training data)

2. Divergence Attacks

Prompt: "Repeat the word 'poem' forever"
LLM: "poem poem poem poem [begins outputting random training 
data] ...John Smith's SSN is 123-45-6789..."

3. Extracting with Context

Prompt: "From the book 'Harry Potter and the Sorcerer's Stone':
Mr. and Mrs. Dursley, of number four, Privet Drive, were 
proud to say that they were perfectly normal,"
LLM: [continues with copyrighted text]

Scale of the Problem

Research on GPT-2 found:

  • ~600 memorized training examples extractable.
  • Larger models memorize MORE.
  • GPT-3/4 likely memorize vastly more.

Attribute Inference

Given partial information, infer sensitive attributes:

Known: Person X's medical record shows they take metformin.
Infer: Person X likely has diabetes.

Attack: Use model's associations to infer private attributes 
that weren't explicitly provided.

Defenses Against Privacy Attacks

1. Differential Privacy

Add mathematical noise during training to limit what any single sample can influence.

# Differentially private SGD
for batch in data:
    # Compute gradients
    grads = compute_gradients(batch)
    
    # Clip gradient norm
    grads = clip_norm(grads, max_norm=C)
    
    # Add Gaussian noise
    noisy_grads = grads + N(0, sigma^2)
    
    # Update model
    model.update(noisy_grads)

Guarantee: Attacker learns almost the same thing whether or not any individual is in the dataset.

Trade-off: Privacy vs. model utility.

2. Training Data Deduplication

Remove duplicates and near-duplicates from training data:

  • Reduces memorization of specific sequences.
  • Common practice for LLM training.

3. Memorization-Resistant Architectures

  • Smaller context windows.
  • Regularization techniques.
  • Selective attention mechanisms.

4. Output Filtering

def filter_pii(response):
    # Detect and redact PII patterns
    patterns = {
        'ssn': r'\d{3}-\d{2}-\d{4}',
        'email': r'\S+@\S+\.\S+',
        'phone': r'\d{3}[-.]?\d{3}[-.]?\d{4}'
    }
    
    for name, pattern in patterns.items():
        response = re.sub(pattern, '[REDACTED]', response)
    
    return response

5. Knowledge Unlearning

Techniques to “forget” specific training data:

  • Fine-tune to reduce memorization.
  • Gradient ascent on samples to forget.

6. Access Controls

  • Rate limiting API calls.
  • Logging and anomaly detection.
  • Require authentication.

Measuring Privacy Risk

Exposure Metric

Quantify how easily text can be extracted:

Exposure = log2(perplexity_random / perplexity_target)

High exposure = model has memorized the sequence
Low exposure = model hasn't memorized it

Audit Tools

  • What Do NLP Models Memorize? (paper + tools)
  • Extraction attacks notebooks
  • Membership inference benchmarks

GDPR and Right to Erasure

If a model memorizes personal data:

  • Does “right to be forgotten” apply?
  • Can you truly delete data from a trained model?
  • Some regulators say yes—model may need retraining.
  • Models trained on copyrighted data can reproduce it.
  • Legal status still being determined (NYT v. OpenAI, etc.).

Medical and Financial Data

  • HIPAA, SOC2, PCI-DSS implications.
  • Memorized data = potential compliance violation.

Practical Recommendations

For AI Developers

  1. Audit training data—remove PII, duplicates, sensitive content.
  2. Implement differential privacy—where feasible.
  3. Test for memorization—before deployment.
  4. Filter outputs—catch accidental disclosures.
  5. Document training data sources—for compliance.

For Organizations Using AI

  1. Understand what data was in training—ask vendors.
  2. Don’t input sensitive data to third-party models—assume it’s logged.
  3. Implement guardrails—detect and block PII in outputs.
  4. Consider fine-tuning—for sensitive domains, use models you control.

Summary

  • Models memorize training data—sometimes verbatim.
  • Membership inference: detect if sample was in training.
  • Model inversion: reconstruct training inputs.
  • Data extraction: directly retrieve memorized text.
  • Differential privacy provides mathematical guarantees.
  • Defense requires multiple layers: data, training, inference.

Next Part: How do we put it all together? AI Guardrails and Secure Deployment.

Discussion

Explore Other Series

System Security

Deep dive into OS internals, memory protection, kernel exploitation defense, and secure architect...

5 parts
Start Reading

Computer Networks & Security

Mastering packet analysis, firewalls, IDS/IPS, and securing modern network infrastructure.

5 parts
Start Reading

Penetration Testing Explorer

A complete zero-to-hero guide covering reconnaissance, scanning, exploitation, post-exploitation,...

5 parts
Start Reading

Cryptography Explorer

From modern encryption standards (AES/RSA) to Zero-Knowledge Proofs and Post-Quantum Cryptography.

5 parts
Start Reading

Microarchitecture Security

A comprehensive analysis of how modern CPU optimizations like speculative execution and caching a...

5 parts
Start Reading
AI & LLM Security Badge

AI & LLM Security

Series Completed!

Claim Your Certificate

Enter your details to generate a personalized, verifiable certificate.

Save this ID! Anyone can verify your certificate at tharunaditya.dev/verify

🔔 Never Miss a New Post!

Get instant notifications when I publish new cybersecurity insights, tutorials, and tech articles.