AI agents leak secrets faster than you can fix them - schedule a demo with us at Black Hat 2026, Booth 5727

TRUFFLEHOG

COMPANY

RESOURCES

AI agents leak secrets faster than you can fix them - schedule a demo with us at Black Hat 2026, Booth 5727

Dylan Ayrey

The Dig

June 1, 2026

Scanning 7.6 Petabytes of AI Training Data for Secrets

Dylan Ayrey

June 1, 2026

tl;dr We scanned every public dataset on Hugging Face, which is where most open AI training data lives. That came to 7.6 petabytes across 187 million files, the largest secret scan of AI training data we know of. We found 221,303 live, unique credentials sitting in 6,003 datasets.

One of the highest-impact secrets we found had access to 393 GB of PII covering what we estimate to be roughly 3.7% of the global population. More on this will come in a dedicated follow-up. The rest of the scan shows how broad the problem is: cloud storage buckets, hosted databases, cloud-admin keys, and tokens that can push code into software a lot of people install.

We shared the findings with Hugging Face before publication; the company partnered closely with us, and CTO Julien Chaumond contributed native storage-bucket scanning support to TruffleHog.

Tokens that can push code into things you install

The scariest credentials here let you change software that other people run. Inside the training data we found 349 live GitHub personal access tokens: 223 with full repo write, 130 that can rewrite CI workflows, 112 with admin:org, and 110 that can publish packages. On top of that, 318 Docker Hub tokens that can push images. We checked npm and PyPI specifically and found zero live, so we’re not claiming those.

A single repo or admin:org token rewrites every repository its owner can push to, and that change ships to everyone who installs the result. Some of these tokens sit on accounts wired into software that millions of people run. Others belonged to accounts positioned deep in the software supply chain.

GitHub PATDockerHugging Face

Docker Hub: push images318

Hugging Face: write237

GitHub: full repo write223

GitHub: rewrite CI (workflow)130

GitHub: admin:org112

GitHub: publish packages110

Hugging Face: org-admin70

Live, verified tokens — hover a bar

Live, verified write-capable credentials found in public training data, counted by what they actually authorize.

One live repo-scoped token belonged to the founder of a widely used Model Context Protocol registry whose account was connected to the official MCP organization. That organization’s repositories hold servers and SDKs used by major AI coding tools and have more than 178,000 GitHub stars between them. Other examples included a highly privileged token held by an engineer at a large technology company, a developer at a bank, and a researcher at an AI lab. We are withholding the names of the people and organizations involved, and have responsibly disclosed our findings.

Keys with real blast radius

The scan also turned up live keys that open real infrastructure: cloud accounts, hosted databases, storage buckets, and messaging platforms. We used these credentials only for verification and metadata-only impact checks, meaning database size stats, Redis memory counters, and CloudWatch S3 bucket-size metrics. We did not read database rows, list object keys, download files, or modify anything. Here is what they unlock.

Cloud takeover8,557GCP service-account keysAcross 3,811 projects

Private storage51.7 TBin non-public S3 bucketsConfirmed from bucket metadata

Live databases8,594working database logins3.5 TB measured by metadata

Impersonation5,885Slack and Mailgun keysMany tied to named workspaces or domains

Chatbot spread18×copies of one pasted AWS keyCaptured once, then mirrored

Cloud takeover · GCP — 8,557 live Google service-account keys, across 3,811 projects

A service-account key is a non-interactive credential for a cloud project. Of the verified examples, 1,926 were Firebase admin keys with database access, one carried the explicit Owner role, and one was a Kubernetes cluster-admin. Project metadata indicated that some were associated with healthcare and payment applications. We are withholding project names and account identifiers.

Cloud storage · AWS S3 — 51.7 TB in buckets with public access blocked

The full S3 StandardStorage lower bound was 185 TB, but raw byte count is not enough: S3 can hold public assets, logs, backups, or almost anything. So we checked only bucket-level metadata for the largest accounts. Bucket policy and public-access-block settings confirmed 51.7 TB in buckets configured to block public access. Bucket-name tokens pointed at prod, backup, cloudtrail, invoice, customer, billing, rds, mongo, and terraform. We did not list object keys or read object contents.

AWS keys passing STS identity checks — 3,343
Keys able to list S3 buckets — 907
Bucket count visible through metadata — 8,676
Buckets with all public-access-block flags enabled — 51.7 TB; largest measured account — 66.9 TB

S3 lower bound (all buckets)185 TB

Confirmed non-public S351.7 TB

Live databases3.5 TB

Storage reachable by leaked keys — TB, from size metadata only

Live databases — 8,594 verified-live database logins, 3.5 TB by metadata

Connection strings that still authenticate to hosted databases. The target names lean heavily toward tutorials and side projects (test, myfirstdatabase, todo apps), and the median MongoDB cluster was only 2.8 MB. But the tail is real: 89 MongoDB clusters and 5 Postgres databases exceeded 1 GB, and the largest MongoDB cluster exposed 617.7 GB by database-size metadata alone. 6,121 of 6,802 MongoDB credentials still connected; the tail included a SQL Server tied to a US defense contractor and Postgres sets tied to a Brazilian federal agency.

617.7 GBLargest single exposed database, by size metadata

6,121 / 6,802MongoDB logins that still authenticate

94MongoDB & Postgres clusters over 1 GB

Impersonation · Comms — 5,885 live Slack tokens and Mailgun keys

We found 231 Slack tokens, 99.6% of which identified the associated workspace, plus 5,654 Mailgun keys with 2,470 tied to a custom sending domain. The examples included a Fortune 500 technology workspace and sending domains associated with or resembling major technology and consumer brands. We are withholding the workspace and domain names.

A new leak path · Chatbots — 18× mirrors from one key pasted into a chatbot

A live AWS key tied to a Brazilian lending fintech reached the training data because someone pasted their boto3 code into a chatbot. The conversation was captured by LMSYS-Chat-1M and mirrored about 18 times. We are withholding the company name.

What an attacker walks away with

Live, verified credentials grouped by what they control. Each credential type sits in one bucket, counted once.

Email & messaging14.5k

Cloud infrastructure13.1k

AI provider accounts10.7k

Hosted databases8.6k

Live, verified credentials — hover a bar for detail

The risk, quantified: at least $920,000 a year in stolen AI inference

The training data is full of keys to the AI providers themselves: 11,496 live across 1,210 datasets, covering OpenAI, Azure OpenAI, Anthropic, Gemini, Groq, and more. We never used any of them, so I can’t tell you the real balances, but I don’t need to. Every provider publishes a default spend cap, and a verified-live key sits on an account with at least that cap.

742 + 26OpenAI + Anthropic keys

$100entry-tier monthly cap

$76,800/moconservative exposure floor

This is a floor, not an estimate of actual balances or unauthorized usage. The keys were verified but never used.

Up to $200K/moAnthropic’s published cap for a single top build-tier account—more than the entire scan cost.

A new OpenAI account that has entered billing gets a $100/month usage limit by default. Anthropic’s entry tier is the same, $100 a month. We found 742 live OpenAI keys and 26 live Anthropic keys. Drain each one to just its default cap and that’s $76,800 a month, about $920,000 a year, of inference billed to people who have no idea their key is in a dataset. That’s the floor, the number you get if every account is stuck on the lowest tier.

And it climbs fast. OpenAI’s caps run $100, $500, $1,000, $5,000, then $50,000 a month at the top tier. Anthropic’s top build tier caps at $200,000 a month. We found 160 organization-owned OpenAI keys and 34 machine service-account keys, which are exactly the credentials that sit on funded, high-tier billing and almost never get rotated. A single top-tier key drained to its cap bills more in one month than this entire 7.6-petabyte scan cost to run.

All of that is before Gemini (1,429 keys), DeepSeek (667), xAI’s Grok (162, with no free tier at all), and 174 enterprise Azure OpenAI deployments, several of them baked into AllenAI’s Dolma 3. Every one of these keys is a live invoice pointed at its owner, and a free seat at a frontier model for whoever finds it first.

The biggest scan we’ve ever run

We cloned the public dataset hub end to end: every repository, every branch, every large-file object. Then we flattened Parquet, Arrow, JSONL, archives, and binaries into scannable text and ran TruffleHog with verification on. It worked out to 186.9 million unique files and about 7.6 petabytes of content across roughly 815,000 dataset repositories. About 670,000 of those finished cleanly. The largest web-scale scan we’d done before this topped out around 400 terabytes. This one was about nineteen times bigger.

The size is only half the story. These are the training sets behind models people actually use. The worst-hit ones are named, card-documented pretraining corpora that open models were built on. We verified every credential we cite against its provider, so they were live when we looked.

186.9M unique files scanned
7.6 PB of AI training data

How big is 7.6 petabytes?

7.6 petabytes is hard to picture. A single DVD holds 4.7 GB, so this scan would fill about 1.6 million of them. Stacked into a tower, those discs would stand roughly 1.9 kilometers tall, about as high as 4.4 Empire State Buildings on top of each other.

Sometimes the company leaked its own key

There are two ways a live key ends up in a dataset. Most of the time it belongs to a stranger. It leaked somewhere public, got scraped, and rode a corpus into a dataset whose publisher has never heard of them. The rarer case is also the more awkward one: the people who built and published the dataset leaked their own working key into it. Usually it’s the exact token they use to push to Hugging Face, sitting in a notebook, a cache file, or a backup they uploaded by accident.

The training data holds 787 live Hugging Face tokens, 237 with write access and 70 with org-admin. A live write token in a public dataset is a key to push malicious model weights or poison datasets across a whole org. Of the ones we could trace, about 700 had been scraped into someone else’s corpus and 63 were self-leaks, dropped by the owner into their own namespace. Either way, the AI supply chain is leaking the exact keys an attacker would need to poison it, and a tampered model or dataset can ride the same pipeline straight to everyone downstream.

Scraped from a stranger700

Self-leaked by the owner63

How 763 traceable Hugging Face write & org-admin tokens reached a dataset

Some self-leaks were tied directly to people with publishing access. The head of product at an AI infrastructure startup left an account-write Hub token in the company’s own text-to-SQL benchmark dataset. Another developer committed a highly privileged GitHub token with organization administration, repository write, CI workflow, package publishing, and repository deletion permissions into a speech dataset. An administrator across several pretraining-data organizations exposed a write token in one of those organizations’ datasets. We also found corporate upload tokens in datasets published by a data-labeling vendor and researchers at two AI labs. In each case, the credential was published by someone connected to the account or organization it could modify.

The cloud keys break the other way. Every live AWS, GCP, Azure, Docker, and database credential we could attribute traced back to someone other than the publisher. Almost nobody pastes their own live cloud key into a public dataset on purpose, so those got there by being scraped. That’s the next story.

One leak, copied into hundreds of datasets

Most of these secrets leaked somewhere else first, in a GitHub repo or a web page or a chat log. From there they got vacuumed into an upstream corpus like The Stack or Common Crawl, and then rode every derivative of that corpus downstream. 44% of all unique live secrets show up in more than one dataset. 19,380 of them appear in ten or more. The Stack and its forks alone carry 51,571 distinct live keys, and AllenAI’s Dolma family carries 28,110. Of the keys that reach ten or more datasets, 99.3% pass through one of these scrape corpora. The amplification is mechanical. Scrape a corpus once, and every dedup, filter, and fine-tuning remix republishes the same live keys.

The same verified credentials flow from upstream source families into the big pretraining corpora. As an example, two near-identical stack-edu re-uploads share 19,977 of 19,977 keys, and Stack code keys cross straight into AllenAI’s Dolma 3 web mixes.

Source family → dataset. Width ∝ shared live secrets. Hover to trace a flow.

Flow weight is the number of identical verified-live secrets (by credential hash) an upstream family shares with a dataset. Showing the four source families and the fifteen most secret-laden derivative datasets.

1,131 datasets contained the same single live key. One Infura key, pasted once into a ChatGPT conversation, got captured by the WildChat chat-log dataset and then copied into 1,131 public datasets and 10,162 file locations. Revoking it at the source does nothing about the other 1,130 copies. A leak in training data spreads on its own, and every copy is another place it keeps working.

Models you’ve heard of, trained on live keys

The household-name chatbots like ChatGPT, Claude, Gemini, and Llama keep their training data private, so I can’t tell you what’s in them. The open models that publish their data are full of live keys.

Model / corpus	Live keys▼	Downloads
Huginn-0125huginn-dataset · UMD	55,876	557,268
StarCoder / The StackStack_Tokenized · BigCode lineage	25,217	85,917
Swallowswallow-code-v2 · Institute of Science Tokyo	22,484	483,788
OLMo 3 (32B)dolma3_mix-6T · Allen Institute for AI	21,278	365,531
SmolLM2 (Stack-Edu)stack-edu · Hugging Face lineage	20,035	6,974
Pleias 1.0common_corpus · Pleias	14,399	1,489,914
Lucie-7BLucie-Training-Dataset · OpenLLM-France	10,699	351,590
Comma v0.1comma_v0.1_training_dataset · EleutherAI	6,479	155,762
Zamba / Zamba2Zyda · Zyphra	6,473	252,375

Unique live secrets per dataset, deduplicated by credential. Model attribution confirmed from each dataset’s public card. Downloads are Hugging Face all-time.

If you touch any part of this pipeline

Dataset authors & AI labs

Scan before you publish. Run a secret scanner over a corpus before it goes up. It’s the cheapest step in the pipeline, and it stops you from shipping live keys into every downstream remix.
Scan before you train. If you’re pulling a public dataset into a run, assume it has live credentials until you’ve checked. The big code and web corpora clearly do.

Developers & providers

Rotate, don’t hide. Any key that ever hit a public repo, a web page, or a chatbot should be treated as burned. Rotation is the only fix that survives being copied into a dataset.
Offer bulk revocation. Providers with revocation APIs and proactive scanning let researchers help at this scale. Without them, 221,303 keys is just a spreadsheet nobody can act on.

Training data is the most permanent leak there is

You can scrub a secret out of git history with a force-push. A training set has no undo. By the time a key lands in a published corpus, it’s been copied into derivative datasets, downloaded onto thousands of machines, and folded into model weights. These datasets are valuable because they’re permanent, versioned, and remixed constantly. That is exactly what makes a leaked key in one impossible to clean up.

The fix hasn’t changed: rotate the key. Once you revoke it, the copy in the dataset is just a harmless string. Until then it keeps working every time the data gets reused, and reuse is the whole point of training data.

221,303 keys is hard to disclose

We always try to help people revoke what we find, and at this scale we started before publishing. The highest-impact findings, including the exposure behind that 393 GB of PII, went to the affected parties and their providers ahead of this post so they could revoke and lock things down first, and we are holding publication until the most critical of them confirm receipt. We are not naming any of the people, companies, or datasets involved. Emailing 221,303 owners one by one isn’t realistic, and most of them have no idea what a training dataset even is, let alone that they’re in one, so we are also working the provider side: notifying the vendors whose customers are most affected so they can revoke in bulk, and routing verified findings through the partner channels we already have.

Hugging Face and the dataset authors didn’t cause this. They’re publishing snapshots of public code and the public web, which is what they’re supposed to do. The secrets leaked upstream, from people who’ll probably never read this. Anyone training on this data should still know it’s in there. Nothing here is a how-to: we’ve withheld the dataset names, file paths, and key material that would let a reader pull live credentials out of these corpora.

We shared these findings with Hugging Face ahead of publication, and they’ve been a real partner in getting ahead of the problem. Their CTO, Julien Chaumond, went a step further and contributed code directly to TruffleHog: native scanning support for Hugging Face’s new storage buckets, so secret scans of an org or user now cover object storage alongside models, datasets, and Spaces. That support lands in an upcoming TruffleHog release, and we’ll share more about the collaboration in a future post.

Takeaways

AI training data is full of live credentials. 221,303 unique, verified keys across 6,003 public datasets, including the named training corpora of models people actually use.
What a key unlocks matters more than its raw permission level. The damage here comes from credentials that push code into installed software, open hosted databases, take over cloud accounts, and send mail as real brands.
Leaks multiply. 44% of these keys live in more than one dataset, and one reached 1,131. Chat logs are now their own leak path, capturing keys people paste into chatbots.
Rotation is still the only fix. Scan corpora before publishing, scan before training, and revoke anything that ever touched a public surface.

Research and analysis by the Truffle Security research team. Scanning powered by TruffleHog. Live credentials were used only for verification and metadata-only impact checks; no stored data was read, copied, or modified.