We just pushed a change to TruffleHog that makes each detector faster, the more detectors we add. You’re probably wondering how this is possible?
There’s no doubt Trufflehog V3 is faster than previous versions. The speedup seen in v2 to v3 could be attributed to the introduction of keyword preflighting on data to be scanned prior to pattern matching. Now there has been even more performance improvements with the release of Trufflehog v3.28.6. This release introduced a keyword optimization using the Aho-Corasick algorithm.
In total, this led to a 2x speedup on average in the overall scanner.
Before going into the optimization, let’s go over how keyword preflights are used in Trufflehog. First, the data to be scanned is divided up into chunks. Those chunks are then iterated over by a number of “detectors”. These detectors have one or more keywords associated with them which will determine if the chunk of data has potential for containing a secret. I.e, each detector needs to iterate through every keyword associated with that detector and search the chunk of data for said keyword(s) until it hits a match.
Check out how this was implemented here, or some pseudo code below:
The redundant part of this algorithm is “searching the chunk of data for said keywords(s)”. Given k keywords, a data chunk of n characters, and the total number of characters in all keywords, m, the search is then conducted in O(n*k + m) time, or polynomial time (check out this blog that explains the time complexity a bit more).
If only there was some way to reduce the time complexity, and do multiple checks at the same time!
Enter Aho-Corasick.
What this algorithm allows Trufflehog to accomplish is a keyword preflight search that can be accomplished in linear time. Again, given a data chunk of n characters, total number of characters in all keywords, m, and z being the total count of keyword matches in the chunk of data we can accomplish a keyword search in O(n+m+z), or linear time. The implementation of this algorithm in Trufflehog can be found here, or pseudo code below:
Note that we are only looking at “data” once vs. looking at “data” for every keyword like we were before.
Show us the numbers!
We ran a benchmark against 3 different repos varying in size; croc (small), rails (medium), and gitlab (large). In order to get an accurate comparison between the old and new (Aho-Corasick optimized), we ran `trufflehog` in `–filesystem` mode and turned off verifications since verifying a secret requires network requests which add uncertainty. Note `trufflehog` is loaded with the Aho-Corasick optimization.
The first benchmark with `–concurrency=1`:
Timing Concurrency=1
Repo ./trufflehog-no-ahoc (s). /trufflehog Speedup
croc 3.2 2.6 91.19x
rails 43.44 14.51 2.99x
gitlab 629.66 175.85 3.58x
Second benchmark w/ `–concurrency=10`:
Timing Concurrency=10
Repo . /trufflehog-no-ahoc (s) ./trufflehog Speedup
croc 2.51 2.28 1.10x
rails 13.17 8.5 11.55x
gitlab 96.28 51.7 1.86x
Summary
In addition to the 2x speedup, the coolest part about all of this is, for every new detector we add, the time spent on each detector per detector is reduced.
We’ll be continuing to push updates that improve the performance and scalability of TruffleHog, so stay tuned for what’s to come. We’d also like to give a big shout out to Petar Dambovaliev, the maintainer of the aho-corasick library we used.