Joe Leon

The Dig

August 2, 2024

TruffleHog now finds all Deleted & Private Commits on GitHub

TruffleHog now finds all Deleted & Private Commits on GitHub

Joe Leon

August 2, 2024

Last week we defined a new term, Cross Fork Object Reference, which occurs when one repository fork can access sensitive commit data from another fork (including data from private and deleted forks). This week we’re announcing TruffleHog can discover all of these commits and scan them for secrets. 

We’re open-sourcing an alpha version of a new TruffleHog module that enumerates Cross Fork Object References (and deleted git history), and then scans them for secrets. The process is slow, and beholden to GitHub’s rate limits, because it involves enumerating thousands of potential commit hashes. Here’s how to use it:


trufflehog github-experimental --repo https://github.com/<USER>/<REPO>.git --object-discovery

Importantly, the --object-discovery flag requires users to set a valid GitHub Access Token in their environment variables (export GITHUB_TOKEN=ghp_) or to pass it in on the command line (--token ghp_). 

New Output!

If TruffleHog discovers any secrets, it will print those out just like before. But, there are a few new additions to the output. 

First, since this is a long-running process, the output includes a progress bar indicating how much longer it will take to enumerate all of the valid commit SHA hashes. 


Second, running TruffleHog with the --object-discovery flag creates two files in a new $HOME/.trufflehog directory:  valid_hidden.txt and invalid.txt

These files serve two purposes:

  1. They keep track of the valid and invalid commit hashes during scanning. If the scan is restarted for any reason, TruffleHog will read in those hashes and pick up where it left off.

  2. Users interested in reviewing CFOR commits for more than just secrets can easily do so by reviewing the valid_hidden.txt file. 


If you’d like TruffleHog to automatically remove those files upon exit, pass in the flag --delete-cached-data.

How it Works

GitHub’s GraphQL service allows users to query for Commit objects by Short SHA-1 commit hashes. As an example, to check the existence of a commit hash starting with AAAA on the TruffleHog repository, you send this query:


query {
  repository(owner: "trufflesecurity", name: "trufflehog") {
    commitAAAA: object(expression: "AAAA") {
      ... on Commit {
        oid
      }
    }
  }
}

If the server finds a valid commit, it will return the full SHA-1 commit hash (oid). 

Admittedly, this isn’t perfect. If there is a collision, meaning two (or more) commit objects share the same first characters in the expression, the server returns None.

To ensure we’re able to enumerate all valid commit hashes, we take the following approach:

  1. Estimate the total number of commit objects in the repository network. Commit objects are tags, blobs, commits and trees. All of these objects are referred to by a unique SHA-1 hash. We clone all accessible commit objects using a typical git clone command and then count the total number of used hashes. Additionally, we add to this count based on the number of forks in the network and an assumption of how many commits each fork will commit on average.

  2. Estimate how many collisions will occur at 4, 5, and 6 Short SHA-1 lengths. The minimum number of characters in a Short SHA-1 is 4. That yields 65,536 possibilities (16^4). At 5 characters, we have just over 1 million possibilities. At 6 characters, that leads to nearly 17 million possibilities. We then use the Birthday Paradox to estimate the total number of collisions at different Short SHA-1 lengths. Why not extend beyond 6? You could, but enumeration would take forever.

  3. Select the Short SHA-1 length that meets the collision threshold. Our collision threshold defaults to 1 - meaning that in the worst case, you’ll miss out on 2 commits (due to the 1 collision). However, users can override this collision threshold with the --collision-threshold <int> flag. TruffleHog will select the shortest Short SHA-1 length to abide by the collision threshold.

  4. Build the keyspace for the selected Short SHA-1 hash length, remove the keys that we know have been used, then query GraphQL. We query the GraphQL service in batches of a few hundred Short SHA-1 hash expressions to determine whether the commit exists.

This process works. We’ve found extremely sensitive keys for very large enterprises using this approach. But, there’s a big caveat. It’s slow. And it’s not TruffleHog. Enumerating all of the Cross Fork Object References is a lengthy process. For a smaller repository, this process can take 20 minutes. For a larger repository, it could take hours.

We know for a fact that there are ways to enumerate this data faster. But none of those methods respect GitHub’s servers and could easily cause a major spike in traffic when deployed by thousands of community members.

What’s the future of this feature?

At the moment, it’s hard to say. Our goal with releasing this is to provide a stop-gap measure for organizations that are realizing they have a hidden attack surface and no way to see it. 

If GitHub takes steps to restrict access to this data, then this feature can be sunsetted. If GitHub develops an API endpoint for repository owners to query for their own Cross Fork Object References, then we can alter the logic to include that endpoint. And if GitHub takes no action, then we’ll iterate on the design based on community contributions.