We scanned thousands of AWS, Azure & GCP images. Register for webinar to see what we found.

TRUFFLEHOG

COMPANY

RESOURCES

We scanned thousands of AWS, Azure & GCP images. Register for webinar to see what we found.

Joe Leon

The Dig

February 27, 2025

Research finds 12,000 ‘Live’ API Keys and Passwords in DeepSeek's Training Data

Joe Leon

February 27, 2025

tl;dr We scanned Common Crawl - a massive dataset used to train LLMs like DeepSeek - and found ~12,000 hardcoded live API keys and passwords. This highlights a growing issue: LLMs trained on insecure code may inadvertently generate unsafe outputs.

Last month, we published a post about Large Language Models (LLMs) instructing developers to hardcode API keys. That got us wondering: why is this happening at scale and across different LLMs? A logical starting point: the training data itself.

While we can’t access proprietary datasets, many are publicly available. Popular LLMs, including DeepSeek, are trained on Common Crawl, a massive dataset containing website snapshots. Given our experience finding exposed secrets on the public internet, we suspected that hardcoded credentials might be present in the training data, potentially influencing model behavior.

To test this, we downloaded the December 2024 Common Crawl archive (400 terabytes of web data from 2.67 billion web pages) and scanned it with TruffleHog, our open-source secret scanner.

Note: We understand that LLM behavior is influenced by multiple factors - training data is just one. Our goal is to highlight how frequently hardcoded credentials appear in one of the most widely-used LLM training datasets and spark a discussion on securing AI-generated code.

Key Findings

11,908 Live Secrets were detected using TruffleHog in 400TB of web data. (Note: A ‘live’ secret authenticates successfully.)
2.76 Million Web Pages contained live secrets.
High Reuse Rate among secrets: 63% were repeated across multiple web pages. In one extreme case, a single WalkScore API key appeared 57,029 times across 1,871 subdomains!

Methodology Overview

Understanding Common Crawl

Common Crawl provides a massive, publicly available dataset representing a broad cross-section of the internet, making it a valuable resource for training LLMs. The December 2024 dataset we analyzed included:

400TB of compressed web data
90,000 WARC files (Web ARChive format)
Data from 47.5 million hosts across 38.3 million registered domains

What’s a WARC file?

WARC files store web crawl data. It’s the same format used by archive.org to preserve web request/response data. A WARC file has one or more WARC records inside.

Here’s an example WARC record from the Common Crawl dataset:

The WARC record format preserves client request and server response data alongside metadata about the interaction. Since Common Crawl doesn’t send secrets to servers or record them in metadata, we focused only on scanning server responses.

Processing 400TB of Data

This was Truffle Security’s most extensive research scan to date. While TruffleHog is fast, processing 400 terabytes of data requires serious infrastructure.

How We Scanned It

We built a distributed job queue with 20 high-performance servers (16 CPU/32GB RAM each). Each node followed these steps:

Download a ~4GB Common Crawl file.
Decompress and split the file using awk along WARC record headers (WARC/1.0).
Run TruffleHog on the extracted content:

trufflehog filesystem --only-verified --json --no-update

Store the results in a database
Repeat 90,000 times.

Optimizations and Challenges

Filtering WARC records slowed us down.
- We initially tried skipping non-response records (ie. WARC request and metadata records), but filtering took longer than scanning everything.
WARC streaming was inefficient.
- We built a custom WARC file handler (like we did for APK files). But we quickly discovered that streaming WARC files sequentially was significantly slower than splitting the entire file with awk and then scanning the split files with TruffleHog via OS command.
Running on AWS saved time.
- Since Common Crawl data is hosted on AWS, running our scan on AWS infrastructure sped up downloads 5-6x.

Reporting Live Secrets Only

‘Live’ secrets are API keys, passwords, and other credentials that successfully authenticate with their respective services.

For this research, a secret was considered ‘live’ only if TruffleHog’s automated verification process (which includes service-specific authentication checks) confirmed its validity.

While we found thousands of live secrets, the number of strings that resemble secrets but lack verification in Common Crawl is far higher.

The secret above would not have been counted in our research.

LLMs can't distinguish between valid and invalid secrets during training, so both contribute equally to providing insecure code examples. This means even invalid or example secrets in the training data could reinforce insecure coding practices.

Implications & Next Steps

Our research confirms that LLMs are exposed to millions of examples of code containing hardcoded secrets in the Common Crawl dataset.

While this exposure likely contributes to LLMs suggesting hardcoded secrets model outputs are also shaped by other training datasets, fine-tuning, alignment techniques, and prompt context.

What can you do?

Use Copilot Instructions or Cursor Rules to provide additional context to your LLM messages inside VS Code or Cursor.

//Example Rule:
You are a security engineer with 30+ years of secure coding experience. 
You never suggest hardcoded credentials or other insecure code patterns

Expand your secret scanning to include public web pages and archived datasets (ex, Common Crawl, Archive.org, etc).

What can the industry do?

LLMs may benefit from improved alignment and additional safeguards - potentially through techniques like Constitutional AI - to reduce the risk of inadvertently reproducing or exposing sensitive information.

A Word About Disclosures

Common Crawl’s dataset is a snapshot of the public internet. The exposure of live keys on the public internet has been well-documented, including by us.

Leaked keys in Common Crawl’s dataset should not reflect poorly on their organization; it’s not their fault developers hardcode keys in front-end HTML and JavaScript on web pages they don’t control. And Common Crawl should not be tasked with redacting secrets; their goal is to provide a free, public dataset based on the public internet for organizations like Truffle Security to conduct this type of research.

As a policy, when Truffle Security finds exposed secrets, we always attempt to help impacted organizations revoke their keys. During this research, we faced a pretty significant challenge - how do we (1) educate ~12,000 different website owners (most of whom don’t understand how their website was deployed in the first place) about what a secret is, and (2) convince them to rotate the impacted secret?

Given the scale of the disclosures and the potential for our outreach to be flagged as spam, we adopted a different strategy. We contacted the vendors whose users were most impacted and worked with them to revoke their users' keys. We successfully helped those organizations collectively rotate/revoke several thousand keys.

Bonus: Extra Insights from Common Crawl

Secret Diversity
- TruffleHog detected 219 different secret types in Common Crawl!
Notable Exposures
- AWS Root Keys in Front-End Code?!
  - One AWS root key was used for S3 Basic Authentication.
  - We tested it - S3 Basic Auth doesn't work (thankfully!) But why was it there in the first place?

Example of a root AWS key exposed in front-end HTML.

A single webpage contained 17 unique live Slack webhooks.
- A live chat feature routed user messages to one of 17 Slack channels, depending on the topic. And for some reason, the developers decided to hardcode all 17 Slack Webhooks.

Mailchimp API keys were the most frequently leaked.
- Nearly 1,500 unique Mailchimp API keys were hardcoded in front-end HTML and JavaScript.
- Developers hardcoded them into HTML forms and JavaScript snippets instead of using server-side environment variables.
- Impact: Attackers could use these keys for phishing campaigns, data exfiltration, and brand impersonation.

Reused keys revealed client lists.
- Some software development firms use the same API key across multiple client sites, making it trivial to identify their customers.
- This could be an interesting avenue for additional security research. If this topic or any others appeal to you, Truffle Security has an open CFP.