Truffle Security

Millions of India KYC Records Exposed by a Leaked Database Key

Millions of India KYC Records Exposed by a Leaked Database Key

Truffle Security

A live MongoDB credential tied to inVOID/Bureau opened an internet-reachable identity-verification cluster: 143 databases, 393 GiB by metadata, and a governmentCheck collection with 16.1 million documents. Record checks confirmed KYC field types, including government-document, identity-profile, OCR, transaction, and face/liveness fields.

Millions of India KYC Records Exposed by a Leaked Database Key

That is roughly 3.7% of the global human population. This is not a confirmed breach; it is an exposure. We did not confirm this database contains all 300 million identities, but the figure gives useful scale for the potential impact.

What Was Exposed

The exposed record shapes included KYC and identity-verification fields from collections such as masterRecord, governmentCheck, digilocker, liveness, photocopy, mask, cardLiveness, webocr, and facedetect.

Millions of India KYC Records Exposed by a Leaked Database Key

The not-masked classification is value-free. It means a document-ID-like string field did not contain masking characters such as *, x, or #. Field paths classified this way included documentNumber.value, panNumber.value, and voterIdNumber.value. That means the sampled records included raw document-ID fields, not only redacted document IDs. The values themselves are withheld.

Metadata Footprint

  • 393.2 GiB total MongoDB footprint by metadata.

  • 143 databases visible to the credential.

  • 313.4 GiB in the largest observed database, rapidoinvoid.

  • 16.1 million documents in the largest observed collection, governmentCheck, by collection metadata.

  • Collection names included governmentCheck, masterRecord, liveness, digilocker, webocr, photocopy, mask, and facedetect.

  • Multiple collections exposed transactionId indexes.

Who inVOID Was

Public sources describe inVOID as a New Delhi identity-verification and KYC startup founded in 2018. An AWS case study said inVOID conducted more than 15,000 KYCs per day across sectors in India, including finance, NBFCs, cryptocurrency, gaming, education, online dating, and shared-economy providers. The Y Combinator profile said inVOID processed more than 5 million KYCs per month and worked with more than 35 companies in India. Bureau announced in 2023 that it had acquired inVOID. Bureau said its platform had verified more than 300 million identities, roughly 3.7% of the global human population.

How the Key Spread

The source was a public GitHub repository: abinash-singh/test_repo. The repository was created and pushed on May 8, 2019. It had four commits within minutes, no stars, no forks, and a generic name. The commit author shown by GitHub was inVOID. There is no README, so the public context comes from the repository tree, commit metadata, source files, and templates.

Millions of India KYC Records Exposed by a Leaked Database Key

The public repo was a small Django project snapshot under abinash-singh/test_repo.

The repo was not a single pasted connection string. It contained manage.py, Django settings, a local SQLite database, migrations, login templates, static assets, and compiled Python cache files. The application defined a custom inVoidUsers model with an authkey field.

In invoid/login/views.py, the app imported MongoClient from PyMongo and created a MongoDB client when the module loaded. The same file selected a video_kyc_demo database and a face_verification collection. Its login handler checked a username and password, read the logged-in user's authkey, and queried face-verification records for that key. The details template displayed columns named transaction_id, utc_timestamp, primary_id_file, auth_key, name, and phone_number.

Millions of India KYC Records Exposed by a Leaked Database Key

Redacted file view: the Django view hardcoded a MongoDB client and queried face_verification by authkey.

Public code datasets scrape public repositories. This repository was absorbed into Stack-derived Python corpora and then appeared in multiple Hugging Face datasets, including Stack V2 derivatives and tokyotech-llm/swallow-code-v2. We saw the same credential in at least five findings across four datasets.

Millions of India KYC Records Exposed by a Leaked Database Key

Hugging Face dataset context: common-pile/stackv2, a Stack V2 derivative.

Millions of India KYC Records Exposed by a Leaked Database Key

Hugging Face dataset context: tokyotech-llm/swallow-code-v2.

At the time of verification, the credential still worked. The MongoDB host accepted network connections from our research environment and the leaked username/password authenticated.

What the Evidence Does Not Explain

The public evidence does not explain why this MongoDB cluster accepted connections from the public internet, why the password remained valid after the repository was created, or how the credential ended up in abinash-singh/test_repo. The repository name, short commit history, lack of README, and bundled Django project files show what was exposed, but not why it was published.

The Django code also leaves context open. It looks like a login and demo-style view into video_kyc_demo and face_verification, but the repository does not say whether it was internal tooling, a customer demo, a test app, or an accidental project snapshot. The database contents show the data was non-public and sensitive; the repo alone does not explain the operational purpose.

Exposure Conditions

The public repository contained an application file with a hardcoded database credential. The credential remained valid years after the repository was created. The database accepted network connections from the public internet and authenticated the leaked username/password. After the repository was scraped into public code datasets, the same credential appeared outside the original GitHub repository.

The repository was created on May 8, 2019. The verification checks described here occurred after the credential had already propagated into multiple public datasets.

Methodology

The first checks were metadata-only: authentication, database names, database sizes, collection names, collection statistics, and index fields. Those checks did not dump collections, list document values, or export records.

Metadata established the database footprint and KYC collection names. It did not establish which fields were populated inside documents. To confirm impact without broad database access, individual record samples were used to classify field paths and record shapes. The public write-up reports field classes and masking classifications, not names, dates of birth, document numbers, phone numbers, image URLs, raw payloads, or full records.

Several sampled masterRecord shapes combined identity profile fields, government-document fields, biometric/face fields, transaction identifiers, and verification-status fields in the same record shape. Sampled webocr shapes combined OCR/document fields with contact/location and government-document fields. Sampled governmentCheck shapes combined government-document, identity-profile, transaction, and verification-status fields.

Limits

This post does not include raw credentials, hostnames, database passwords, document values, screenshots of records, or raw sampled documents. It does not claim that every document is a unique person, that every field was populated, or that every stored value was real. The confirmed claim is that a live credential exposed metadata for a large inVOID-linked KYC cluster, and targeted sampling confirmed identity-verification record shapes.

infra