tl;dr There’s a lot of ways developers unintentionally leak data when open-sourcing a new tool. Jump to our recommendations here.
If you’ve ever released a tool on GitHub, you’ve likely followed a process like this:
Create a private repository on GitHub.
Commit data to the repository.
“Sanitize” the files before making it public (ie: remove secrets and other sensitive info).
Change the visibility of the repository from “Private” to “Public”.
Savvy git users might rebase the repository and squash the entire git history into one commit, believing that this prevents users from viewing that git history in the public version.
Unfortunately, in both cases, users are exposing much more information in the public version than they believe.
Take a look at the video below - we squash (technically rebase
) an entire repo down to 1 commit, but are still able to access our old commits.
Why is my data still visible?
Rebasing + Dangling Commits
During the rebase
process, git removed references to your squashed commit data, but did not delete anything. These are called “dangling” commits. These commits are not reachable from any branch, tag, or other reference; however, if you know the commit’s SHA-1 hash, you can still access it using a command like:
Since GitHub is built on git, they follow a similar process and continue to store these commits, despite users not seeing them in the standard UI. Again, if you know the SHA-1 hash, you can still access it (see below):
Dangling commits are not only created via rebase
operations, but also from force push events and pull requests.
Let’s say you commit data to a repository (and push it to GitHub) and then immediately recognize that you committed something you didn’t want. So you Google “how to delete the last commit” and run these commands:
This effectively resets the head of your git repo to the last commit, and then force pushes it to GitHub. Now your sensitive commit is no longer viewable on GitHub.
However, that commit still exists on GitHub (and your local git repo). It’s just dangling and no other git object references it. Read more here.
Pull Requests + Dangling Commits
When a user creates a Pull Request, GitHub will create a pseudo-branch named refs/pull/<something>
. The purpose of this branch is to determine what would happen if the PR’s feature branch were merged into the base branch (typically main
).
If a developer deletes the PR’s feature branch, GitHub maintains a copy of the pseudo-branch, including all commit data from the now-deleted branch. Read more here.
As we’ve seen, there are multiple ways dangling commits are created in a GitHub repository. While documented, many users don’t know their dangling commit data is accessible. And unfortunately, when a private repository is made public, dangling commits carry over to the public version.
As a result, anyone can access all of the repository’s commit history data, including “dangling” commits from rebasing, force pushing, pull requests and more.
But that’s not all. There’s one more way your private repository data might inadvertently leak into the public version.
Cross Fork Object References
We wrote about this in July, but the basic idea is this: A CFOR occurs when one fork can access sensitive data from another fork (including data from private and deleted forks).
Here’s the tie-in with open-sourcing: it’s not uncommon for organizations to create a private, internal version (fork) of a public tool, prior to making it public.
The way this works is a user will fork the repository (before it’s made public) and then commit additional features to the fork. Eventually, the private repository is made public and the organization continues development work on their private fork.
Any data committed to the private fork before the original “upstream” repository was made public, is also public. It looks kind of like this.
So in addition to dangling commits, data in private forks committed prior to open-sourcing the tool, are also available to the public.
What should I do?
To securely open-source a new project on GitHub, we recommend the following:
Create a new public repository on GitHub.
On your local machine, delete the existing
.git
folder in your project’s repository.Initialize a new git repository in that same folder and then push it to the new public repository on GitHub
Now your new public GitHub repository will not carry over any of the existing git history, nor provide public users with access to any data you did not intend to make public.
If you can’t delete the existing git repository, an alternative is to rebase your existing local git project, and then push to a new remote repository on GitHub. Those steps would look like this:
Create a new public repository on GitHub.
Rebase your local git project.
Change the remote origin URL to your new GitHub repository.
Push the changes.
While our initial testing shows that this method does not transfer over any commits from your initial repository, we still suggest the first method of fully deleting and re-initializing a local git repository.
Concluding Thoughts
Taking a step back, I think we should take into account GitHub’s position on committing sensitive data to a repository (public or private).
GitHub clearly states that any sensitive data committed to GitHub should be considered compromised. And it doesn’t matter if the commit starts private and ends public.
So please, next time you open-source a private GitHub repository, create a new repository first and put your code in there.