tl;dr We scanned every public dataset on Hugging Face, which is where most open AI training data lives. That came to 7.6 petabytes across 187 million files, the largest secret scan of AI training data we know of. We found 221,303 live, unique credentials sitting in 6,003 datasets. The clearest human-impact case was a live key to an India-scale KYC database tied to inVOID, a video identity-verification company later acquired by Bureau: 393 GB, 143 databases, and a governmentCheck collection with 16.1 million documents. Owner-run summaries confirmed fields for document numbers, PAN, names, dates of birth, user IDs, transaction IDs, face and liveness checks, OCR, and verification status. The rest of the scan shows how broad the problem is: cloud storage buckets, hosted databases, cloud-admin keys, and tokens that can push code into software a lot of people install.
The biggest scan we’ve ever run
We cloned the public dataset hub end to end: every repository, every branch, every large-file object. Then we flattened Parquet, Arrow, JSONL, archives, and binaries into scannable text and ran TruffleHog with verification on. It worked out to 186.9 million unique files and about 7.6 petabytes of content across roughly 815,000 dataset repositories. About 670,000 of those finished cleanly. The largest web-scale scan we’d done before this topped out around 400 terabytes. This one was about nineteen times bigger.
The size is only half the story. These are the training sets behind models people actually use. The worst-hit ones are named, card-documented pretraining corpora that open models were built on. We verified every credential below against its provider, so they were live when we looked.
186.9M unique files scanned
7.6 PB of AI training data
How big is 7.6 petabytes?
7.6 petabytes is hard to picture. A single DVD holds 4.7 GB and is about 1.2 mm thick. To store this scan you’d need 1.6 million of them, and stacked into a tower they’d reach nearly 2 kilometers.
It cost about $15,000 in compute to scan all 7.6 petabytes
The fun part of “biggest scan ever” is how cheap it was. The whole thing ran on ARM Graviton spot instances (c7g, c6g, m7g) on ECS, spread across up to six AWS regions so we could grab spare spot capacity wherever it was going for the lowest price. It ran for 109 days straight, 2,616 hours of continuous scanning.
The fleet averaged about 0.8 GB/s and peaked at 6.5 GB/s, which is 23.5 terabytes chewed through in a single hour. Spot pricing on Graviton is cheap enough that the compute came out to roughly $15,000, somewhere around $2,000 per petabyte. Scanning the entire open AI training-data frontier costs less than one engineer-month. That’s the whole reason nobody had checked what’s in there.
Runtime: 109 days
Peak throughput: 6.5 GB/s
Est. compute cost: ~$15K
Models you’ve heard of, trained on live keys
The household-name chatbots like ChatGPT, Claude, Gemini, and Llama keep their training data private, so I can’t tell you what’s in them. The open models that publish their data are full of live keys. BigCode’s The Stack, the code corpus behind StarCoder, carries 73,467 live keys across its public copies. Hugging Face’s Stack-Edu, the data behind SmolLM2, adds tens of thousands more. Pleias’ Common Corpus, the most-downloaded dataset in this whole set at 1.5 million pulls, ships with 14,399. AI2’s Dolma 3, behind OLMo 3, has 28,121. And the single worst dataset, the corpus behind the Huginn reasoning model, holds 55,876 working credentials on its own.
| Model / corpus | Live keys▼ | Downloads |
|---|---|---|
| Huginn-0125huginn-dataset · UMD | 55,876 | 557,268 |
| StarCoder / The StackStack_Tokenized · BigCode lineage | 25,217 | 85,917 |
| Swallowswallow-code-v2 · Institute of Science Tokyo | 22,484 | 483,788 |
| OLMo 3 (32B)dolma3_mix-6T · Allen Institute for AI | 21,278 | 365,531 |
| SmolLM2 (Stack-Edu)stack-edu · Hugging Face lineage | 20,035 | 6,974 |
| Pleias 1.0common_corpus · Pleias | 14,399 | 1,489,914 |
| Lucie-7BLucie-Training-Dataset · OpenLLM-France | 10,699 | 351,590 |
| Comma v0.1comma_v0.1_training_dataset · EleutherAI | 6,479 | 155,762 |
| Zamba / Zamba2Zyda · Zyphra | 6,473 | 252,375 |
Unique live secrets per dataset, deduplicated by credential. Model attribution confirmed from each dataset’s public card. Downloads are Hugging Face all-time.
Tokens that can push code into things you install
The scariest credentials here let you change software that other people run. Inside the training data we found 349 live GitHub personal access tokens: 223 with full repo write, 130 that can rewrite CI workflows, 112 with admin:org, and 110 that can publish packages. On top of that, 318 Docker Hub tokens that can push images. We checked npm and PyPI specifically and found zero live, so we’re not claiming those.
What matters is placement more than headcount. A single repo or admin:org token rewrites every repository its owner can push to, and that change ships to everyone who installs the result. Some of these tokens sit on accounts wired into software that millions of people run.
Live, verified write-capable credentials found in public training data, counted by what they actually authorize.
Some of these have names on them. The one that got me: a live repo-scoped GitHub token belonging to the founder of Smithery AI, the largest registry of Model Context Protocol servers. He’s a public member of the official modelcontextprotocol org, the 42 repositories with 178,000+ GitHub stars between them that hold the MCP servers and SDKs Claude, Cursor, and most AI coding tools are built on. Any repo that token can push to sits right at the root of the AI-tooling supply chain. Others trace back to a SberDevices engineer (a near-maximal token with repo, workflow, admin:org, write:packages, and delete_repo), a developer at a bank, and VinAI Research.
Sometimes the company leaked its own key
There are two ways a live key ends up in a dataset. Most of the time it belongs to a stranger. It leaked somewhere public, got scraped, and rode a corpus into a dataset whose publisher has never heard of them. The smaller and more uncomfortable group is first-party: the people who built and published the dataset leaked their own working key into it. Usually it’s the exact token they use to push to Hugging Face, sitting in a notebook, a cache file, or a backup they uploaded by accident.
The training data holds 787 live Hugging Face tokens, 237 with write access and 70 with org-admin. A live write token in a public dataset is a key to push malicious model weights or poison datasets across a whole org. Of the ones we could trace, about 700 had been scraped into someone else’s corpus and 63 were self-leaks, dropped by the owner into their own namespace. You can stage an attack on the AI supply chain with credentials you found inside the AI supply chain.
The self-leaks have names on them. ThirdAI’s head of product left an account-write Hub token in his own text-to-SQL benchmark dataset. A Mongolian developer committed a near-maximal GitHub token, admin over his orgs plus full repo write, CI, package publish, and repo delete, straight into his own Common Voice speech dataset. Fan Zhou, who holds write and admin across a cluster of pretraining-data orgs including finemath and GAIR, dropped his write token into one of their datasets, which is poisoning reach at the source. The data-labeling vendor Unidata left its corporate upload token in its own chest-x-ray set, and researchers at VinAI Research and MBZUAI each did the same.
The cloud keys break the other way. Every live AWS, GCP, Azure, Docker, and database credential we could attribute traced back to someone other than the publisher. Almost nobody pastes their own live cloud key into a public dataset on purpose, so those got there by being scraped. That’s the next story.
One leak, copied into hundreds of datasets
Most of these secrets leaked somewhere else first, in a GitHub repo or a web page or a chat log. From there they got vacuumed into an upstream corpus like The Stack or Common Crawl, and then rode every derivative of that corpus downstream. 44% of all unique live secrets show up in more than one dataset. 19,380 of them appear in ten or more. The Stack and its forks alone carry 51,571 distinct live keys, and AllenAI’s Dolma family carries 28,110. Of the keys that reach ten or more datasets, 99.3% pass through one of these scrape corpora. The amplification is mechanical. Scrape a corpus once, and every dedup, filter, and fine-tuning remix republishes the same live keys.
The diagram below traces it. The same verified credentials flow from upstream source families into the big pretraining corpora. Two near-identical stack-edu re-uploads share 19,977 of 19,977 keys. Stack code keys cross straight into AllenAI’s Dolma 3 web mixes.
Flow weight is the number of identical verified-live secrets (by credential hash) an upstream family shares with a dataset. Showing the four source families and the fifteen most secret-laden derivative datasets.
1,131 datasets contained the same single live key. One Infura key, pasted once into a ChatGPT conversation, got captured by the WildChat chat-log dataset and then copied into 1,131 public datasets and 10,162 file locations. Revoking it at the source does nothing about the other 1,130 copies. A leak in training data spreads on its own. Every copy is one more place it lives forever.
What an attacker walks away with
Live, verified credentials grouped by what they control. Each credential type sits in one bucket, counted once.
A live key to an India-scale identity database
The clearest human-risk case was not an email token or a cloud admin key. It was a live MongoDB credential tied to inVOID, an India-focused identity-verification company later acquired by Bureau. Public sources describe inVOID as a video-KYC provider used by banks, fintechs, NBFCs, crypto, gaming, education, dating, transport, rental, and shared-economy companies. AWS said inVOID was handling more than 15,000 KYC checks per day; YC described it as processing more than 5 million KYCs per month.
The leaked credential was sitting in a public Django source file, invoid/login/views.py, then got scraped into Stack-derived Python corpora and showed up in multiple Hugging Face datasets. The code was a login flow that queried face-verification records by a user’s auth key. The original GitHub file is still public.
We used the credential only for metadata checks, then asked the owner to confirm record shapes without sharing values. The database measured 393 GB across 143 databases. The largest collection, governmentCheck, had 16.1 million documents by metadata count. Owner-run summaries confirmed fields for document numbers, document types, PAN-related fields, names, dates of birth, user IDs, transaction IDs, face and liveness confidence, OCR, document processing, and verification status. We are not publishing raw records or personal values.
Scale: 393 GB, 143 databases, 16.1M documents in the largest government-check collection.
Data class: identity verification, government-document checks, liveness, OCR, DigiLocker-style flows, and face matching.
Leak path: public GitHub code to Stack-derived corpora to Hugging Face training datasets.
Other keys with real blast radius
We used credentials only for verification and metadata-only impact checks. For size, that meant database stats, Redis memory counters, and CloudWatch S3 bucket-size metrics. We did not read database rows, list object keys, download files, or modify anything. The India KYC case is the clearest human-impact example. The rest show how broad the credential problem gets once public datasets absorb live secrets.
Cloud takeover · GCP — 8,557 live Google service-account keys, across 3,811 projects
A service-account key is a non-interactive master key to a cloud project. 1,926 are Firebase admin keys with full database control, one carries the explicit Owner role, and one is a Kubernetes cluster-admin. Evidence: firebase-adminsdk@glob-medicals (patient data), firebase-adminsdk@icpayment-878f1 (payments), projectowner@fabled-orbit (role: Owner), kubernetes-admin@elliptical-feat (cluster-admin).
Cloud storage · AWS S3 — 51.7 TB in buckets with public access blocked
The full S3 StandardStorage lower bound was 185 TB, but raw byte count is not enough: S3 can hold public assets, logs, backups, or almost anything. So we checked only bucket-level metadata for the largest accounts. Bucket policy and public-access-block settings confirmed 51.7 TB in buckets configured to block public access. Bucket-name tokens pointed at prod, backup, cloudtrail, invoice, customer, billing, rds, mongo, and terraform. We did not list object keys or read object contents.
AWS keys passing STS identity checks — 3,343
Keys able to list S3 buckets — 907
Bucket count visible through metadata — 8,676
Buckets with all public-access-block flags enabled — 51.7 TB; largest measured account — 66.9 TB
Live databases — 8,594 verified-live database logins, 3.5 TB by metadata
Connection strings that still authenticate to hosted databases. The target names lean heavily toward tutorials and side projects (test, myfirstdatabase, todo apps), and the median MongoDB cluster was only 2.8 MB. But the tail is real: 89 MongoDB clusters and 5 Postgres databases exceeded 1 GB, and the largest MongoDB cluster exposed 617.7 GB by database-size metadata alone. 6,121 of 6,802 MongoDB credentials still connected; the tail included a SQL Server tied to a US defense contractor and Postgres sets tied to a Brazilian federal agency.
Impersonation · Comms — 5,885 live Slack tokens and Mailgun keys
231 Slack tokens, 99.6% naming the exact workspace, plus 5,654 Mailgun keys with 2,470 on a custom sending domain. A phishing kit that arrives pre-addressed as real brands: Slack workspace DXC.technology (Fortune 500), Mailgun on cisco-collaboratenow.com, and brand domains for sap, autodesk, and dkny.
A new leak path · Chatbots — 18× mirrors from one key pasted into a chatbot
A live AWS key for Credoro, a Brazilian lending fintech, reached the training data because someone pasted their boto3 code into a chatbot. The conversation was captured by LMSYS-Chat-1M and mirrored about 18 times.
11,496 live AI keys, and a $76,800-a-month floor
The training data is full of keys to the AI providers themselves: 11,496 live across 1,210 datasets, covering OpenAI, Azure OpenAI, Anthropic, Gemini, Groq, and more. We never used any of them, so I can’t tell you the real balances. I don’t need to. Every provider publishes a default spend cap, and a verified-live key sits on an account with at least that cap.
A new OpenAI account that has entered billing gets a $100/month usage limit by default. Anthropic’s entry tier is the same, $100 a month. We found 742 live OpenAI keys and 26 live Anthropic keys. Drain each one to just its default cap and that’s $76,800 a month, about $920,000 a year, of inference billed to people who have no idea their key is in a dataset. That’s the floor, the number you get if every account is stuck on the lowest tier.
And it climbs fast. OpenAI’s caps run $100, $500, $1,000, $5,000, then $50,000 a month at the top tier. Anthropic’s top build tier caps at $200,000 a month. We found 160 organization-owned OpenAI keys and 34 machine service-account keys, which are exactly the credentials that sit on funded, high-tier billing and almost never get rotated. A single top-tier key drained to its cap bills more in one month than this entire 7.6-petabyte scan cost to run.
All of that is before Gemini (1,429 keys), DeepSeek (667), xAI’s Grok (162, with no free tier at all), and 174 enterprise Azure OpenAI deployments, several of them baked into AllenAI’s Dolma 3. Every one of these keys is a live invoice pointed at its owner, and a free seat at a frontier model for whoever finds it first.
Training data is the most permanent leak there is
You can scrub a secret out of git history with a force-push. A training set has no undo. By the time a key lands in a published corpus, it’s been copied into derivative datasets, downloaded onto thousands of machines, and folded into model weights. There’s no recall button. These datasets are valuable because they’re permanent, versioned, and remixed constantly. That is exactly what makes a leaked key in one impossible to clean up.
The fix hasn’t changed: rotate the key. Once you revoke it, the copy in the dataset is just a harmless string. Until then it keeps working every time the data gets reused, and reuse is the whole point of training data.
221,303 keys is hard to disclose
We always try to help people revoke what we find. At this scale, emailing owners one by one isn’t realistic, and most of them have no idea what a training dataset even is, let alone that they’re in one. So we’re working the provider side: notifying the vendors whose customers are most affected so they can revoke in bulk, and routing verified findings through the partner channels we already have.
Hugging Face and the dataset authors didn’t cause this. They’re publishing snapshots of public code and the public web, which is what they’re supposed to do. The secrets leaked upstream, from people who’ll probably never read this. Anyone training on this data should still know it’s in there.
If you touch any part of this pipeline
Dataset authors & AI labs
Scan before you publish. Run a secret scanner over a corpus before it goes up. It’s the cheapest step in the pipeline, and it stops you from shipping live keys into every downstream remix.
Scan before you train. If you’re pulling a public dataset into a run, assume it has live credentials until you’ve checked. The big code and web corpora clearly do.
Developers & providers
Rotate, don’t hide. Any key that ever hit a public repo, a web page, or a chatbot should be treated as burned. Rotation is the only fix that survives being copied into a dataset.
Offer bulk revocation. Providers with revocation APIs and proactive scanning let researchers help at this scale. Without them, 221,303 keys is just a spreadsheet nobody can act on.
Takeaways
AI training data is full of live credentials. 221,303 unique, verified keys across 6,003 public datasets, including the named training corpora of models people actually use.
What a key unlocks matters more than its raw permission level. The damage here comes from credentials that push code into installed software, open hosted databases, take over cloud accounts, and send mail as real brands.
Leaks multiply. 44% of these keys live in more than one dataset, and one reached 1,131. Chat logs are now their own leak path, capturing keys people paste into chatbots.
Rotation is still the only fix. Scan corpora before publishing, scan before training, and revoke anything that ever touched a public surface.
Research and analysis by Dylan Ayrey for The Dig. Scanning powered by TruffleHog. Live credentials were used only for verification and metadata-only impact checks; no stored data was read, copied, or modified.


