TruffleHog uses the Aho-Corasick algorithm to efficiently scan data for sensitive information. At the core of our system are detectors that rely on identifying specific keywords in data chunks, making fast and efficient keyword matching critical to performance. It’s like a highly- efficient reading machine that never backtracks unnecessarily.
This post is the first in a two-part series exploring how we improved TruffleHog’s performance:
Part 1 (this post): We’ll dive into CPU optimizations, focusing on how precomputing failure transitions can speed up the scanning process.
Part 2: We’ll talk about memory optimizations, showing how to trim and manage the Aho-Corasick data structures for even better resource utilization.
Let’s start by exploring how one might naively approach the problem of pattern matching, and then see how we can make it more efficient.
The Naive Approach: Using a Trie
A Trie (pronounced “try”) is a tree-like data structure that helps you quickly find strings based on their prefixes, where traversing a path from the root builds a sequence of letters. For example, consider a trie built using the keywords "HE", "HERS", "HIS", and "SHE". Starting at the root, if the first character in the data is 'H', we move from the root to the node labeled 'H'. From there, if the next character is 'E', we proceed to the 'HE' node, where a '*' indicates a complete keyword match for "HE". Continuing further, if the next character is 'R', we follow the path to the 'HER' node. Finally, if the next character is 'S', we reach the 'HERS' node, marking a complete match for the keyword "HERS". The position in the trie and the path followed through it determine when a keyword match occurs.

Key Features of a Trie:
Efficient Prefix Matching: Instantly checks if your input string is a prefix of any stored pattern
Common Uses: Often appears in autocomplete, spell checkers, dictionary lookups, and more

Why the Naive Approach Falls Short
While a Trie is great for handling prefixes, it faces significant challenges when used to search for multiple patterns in large text, particularly in cases like these:
Overlapping Patterns: If two patterns share common prefixes, a simple Trie might force you to re-check states repeatedly.
Redundant Backtracking: Without extra help, you might end up repeatedly returning to the root to start new searches.
For example, when searching for the patterns [“he”, “she”, “his”, “hers”] in the text “ushers”, a naive approach would:
Find “he” but miss “she”
Or find “she” but need to restart to find “hers”
Waste time returning to the root repeatedly
Enhancing Our Approach with Finite State Machines
To address these limitations, we can transform our Trie into a more sophisticated structure by adding extra transitions that tell us exactly where to go when a match fails. These additional transitions eliminate the need to return to the root, making our pattern matching more efficient.

The key insight is that when we fail to match at a certain point, instead of starting over, we can jump to a state that represents the longest valid suffix of what we’ve seen so far. This is where the Aho-Corasick algorithm shines.
How Aho-Corasick Makes It All Work
Aho-Corasick does its job in two main phases:
Build the pattern-matching machine (enhance the Trie with additional transitions and failure links)
Scan the text once, identifying all matches as you go, without constantly restarting
Building the Pattern Matching Machine
The machine is defined by three key functions:
(a) Goto Function (g)
This is essentially our Trie structure, defining how to move between states based on input characters. For the keywords [“he”, “she”, “his”, “hers”], each complete keyword corresponds to a path from root to a marked node.
Example:
For the keywords [“he”, “she”, “his”, “hers”]:
Insert all keywords into the Trie.
Each complete keyword corresponds to a path from the root to a leaf or a node marked as an endpoint.
For instance, the node at the end of “he” is an output state, indicating that the keyword “he” ends there.

(b) Failure Function (f)
This is where the magic happens. The failure function enables our FSM to handle mismatches efficiently. If we can’t continue on the current path, the failure function tells us where to “fall back” without restarting from the root.
We compute these failure links using a breadth-first search (BFS) over the Trie. Why BFS? BFS is a traversal technique that explores nodes level-by-level, ensuring we process all nodes at a given depth before moving to the next. This is important because we need to compute failure links starting from the shortest patterns and moving on to longer ones. By processing states in this order, we ensure that when we assign a failure link to a node representing a longer prefix, we’ve already correctly set the failure links for all of its shorter prefix “ancestors”.

What does the failure link represent?
The failure link points to another state that represents the longest valid suffix of the current input string that also exists somewhere in the Trie. Think of it as saying, “If we can’t continue here, jump to the longest matching portion we’ve seen before.”

For example, suppose we’ve matched “she” so far, but the next character doesn’t continue this pattern:
The failure link for “she” points to “he”, because “he” is a suffix of “she” and exists in our pattern set
If we fail at “he”, we might end up at the root if no shorter suffix matches
This chain of failure links ensures that we don’t waste time re-checking from scratch. We’re effectively shifting to a “next best” position in the Trie that could still lead to a match.
(C) Output Function (output)
This simply indicates which keywords end at each state.
Example:
In the state for “he”, the output is “he”.
In the state for “she”, the output is “she”.

Searching the Text
Once our enhanced machine is built, it processes input text in a single pass. At each character:
Goto: Move to the next state based on the current character
Failure: If there’s no direct path, follow failure links until you find a suitable state
Output: Emit any keywords associated with the current state
Example:
Let’s search for the keywords [“he”, “she”, “his”, “hers”] in the text "ushers
" step by step. We’ll highlight where suffix-based matches appear as well, to avoid confusion.
Start at the root state:
Input: "
u
":
No direct match for "u" at the root. Since we’re at the root, the failure link also points back to root, so we stay at the root.Input: "
s
":
From the root, we have a transition on "s
". Move to the state representing "s
".Input: "
h
":
From "s
", reading "h
" moves us to the "sh
" state.Input: "
e
":
From "sh
", reading "e
" transitions us to the "she
" state.
At this point, we’ve matched "she
". Because "he
" is a suffix of "she
" and is also a keyword, the failure link from the "she
" state leads to the "he
" state, and we emit both "she
" and "he
"Input: "
r
":
From "she
", there’s no direct transition on "r
", so we follow the failure link back to "he
".
From "he
", reading "r
" takes us to "her
". (This is possible because we know "he
" transitions on "r
" to "her
" in our Trie.)Input: "
s
" From "her
", reading "s
", leads to "hers
". Since "hers
", is a key word, we emit ["hers
"]
By following this process, we scan the entire input only once, efficiently identifying "she", "he", and "her" in the string "ushers
" without restarting the search for each keyword.

A Note on Performance from the Original Aho-Corasick Paper
One of the most impressive aspects of the Aho-Corasick algorithm, at least in my opinion, is how gracefully it scales with more patterns. According to the original 1975 paper, once you’ve built the automaton (our enhanced Trie with failure links and output functions), the number of state transitions made while scanning the text is independent of how many patterns you have!
What does that mean in practice?
If we use our example patterns [“he”, “she”, “his”, “hers”] and scan through the text "she sells seashells"
, the algorithm will pass over each character exactly once. Even though we’re looking for four different patterns instead of just one, the algorithm doesn’t slow down. It might discover more matches as it moves through the text, but it won’t take more time per character. The state transitions are designed to handle all the patterns simultaneously, no matter how many there are!
This approach ensures that adding more patterns only affects the build phase—where we set up the Trie and failure links—but not the actual search phase. Once the automaton is ready, it’s optimized to handle all patterns together efficiently. As a result, the runtime performance remains predictable, even as your list of patterns grows. I don’t know about you, but I find that pretty nifty!
Bringing It All Together
With Aho-Corasick:
You build a Trie of all keywords.
Compute
failure
transitions to handle mismatches smoothly.Scan through the text, recognizing patterns without unnecessary restarts.
The result is a linear-time search relative to the length of your text and the patterns combined. It’s efficient, elegant, and widely used in applications like string searching, bioinformatics, and intrusion detection systems.
Original Paper Reference
The Aho-Corasick algorithm was introduced in the classic paper Efficient String Matching: An Aid to Bibliographic Search (Aho & Corasick, 1975). They demonstrated how to build a finite state machine for multiple pattern searching, achieving significant performance improvements over naive methods.
How TruffleHog Leverages Aho-Corasick
Alright, enough theory—let’s see how TruffleHog actually puts this algorithm to use. TruffleHog relies on detectors, specialized components designed to scan data for specific types of secrets, like AWS keys, API tokens, or database credentials. Each detector implements a common interface, enabling it to scan chunks of text, identify potential matches, and describe what it’s detecting.
Detectors work by using keywords—short, unique strings associated with the secrets they’re designed to find. For example, the AWS detector might look for keywords like AKIA
, ABIA
, or ACCA
, which are common prefixes for AWS access keys. These keywords are the building blocks for our Trie, and if you’ve been following along, this is where the magic happens. By feeding all the keywords from TruffleHog’s ~800 detectors into the Aho-Corasick algorithm, we construct an efficient automaton that can scan large volumes of text and rapidly identify matches.
When a keyword is identified during scanning, it allows us to tie the match back to its corresponding detector using a pre-defined mapping of keywords to detectors. With this mapping, TruffleHog narrows the search space to a smaller chunk of data around the keyword and passes it to the detector. The detector then uses a precise regex to determine whether a valid secret is present in that chunk. In other words, once the automaton is built, TruffleHog can process data rapidly, identifying keywords and efficiently routing relevant chunks to the appropriate detectors for validation.
CPU Optimization: Precomputing Failure Transitions
Originally, TruffleHog relied on the BobuSumisu/aho-corasick library to implement the Aho-Corasick algorithm. This library already does a fantastic job of building a performant pattern-matching automaton—it’s a solid piece of engineering on its own. However, we identified an opportunity for optimization in how it handled state transitions.
By moving certain computations from runtime to the build phase, we were able to significantly improve performance without changing the core logic of the algorithm. This optimization allows TruffleHog to process text more efficiently while maintaining all the pattern-matching capabilities of the original implementation.
Show me the numbers!

As the benchmarks show, precomputing failure transitions at build time gives us an 11-17% performance boost, depending on input size. Since keyword matching is core to TruffleHog's secret detection, these gains add up significantly when scanning millions of lines of code.
Quick Benefits Summary
Lower CPU Overhead: All transitions are resolved during the build phase, so runtime is just lookups
No Loss of Accuracy: We get the same results, just faster
Coming Up: Memory Optimizations (Part 2)
Stay tuned for Part 2, where our CPU optimization journey continues. While the precomputed transitions gave us an 11-17% performance boost, our memory optimizations push that to 22-40% while reducing memory allocations by 99%. We'll show you how reorganizing the Aho-Corasick data structures made this possible.

🎯 If you’re mainly interested in understanding how TruffleHog’s scanning got faster, you can stop here! The rest of this post dives deep into implementation details for those interested in the technical nitty-gritty.
Implementation Deep Dive
Building the Foundation: States and the FSM
First, let’s look at how we represent each state in our finite state machine:
This structure directly maps to our FSM diagram from earlier. Each state knows its transitions (trans), what to do on failure (failLink), and whether it represents a complete pattern (dict).
The Builder Pattern
To construct our FSM, we use a builder pattern. Here’s its structure:
Think of this as setting up our workbench. We’re creating a space to build our pattern-matching machine, starting with a root state.
Creating New States
When we need a new state in our FSM, here’s how we create it:
Remember those nodes in our Trie diagram? This is how we create each one. Each state starts fresh with no transitions or failure links - we’ll add those as we build the machine.
Adding Patterns
Here’s where we build the actual Trie structure for our patterns:
This is how we built that Trie structure we saw in the diagrams. For a pattern like “he”, we:
Start at the root
Create/follow a transition for ‘h’
Create/follow a transition for ‘e’
Mark the final state as a pattern match
Building the Pattern Matching Machine
Now let’s look at how we transform our Trie into an optimized Aho-Corasick automaton:
The Build function begins our optimization process. Instead of using linked states, we’ll use dense arrays that allow O(1) lookups.
Here’s how we convert our states into this optimized format:
Computing Failure Links
Now for the heart of Aho-Corasick - the failure function. Remember our earlier discussion about handling mismatches? Here's how we implement it:
The base cases are straightforward - the root and its immediate children fail back to the root. This matches our earlier diagram where we saw all level-1 nodes pointing back to the root.
The interesting part comes next:
Remember our example with "ushers" matching "she" and "hers"? This is where that magic happens. We follow the parent's failure links until we find a state that has a transition matching our current character.
Handling Overlapping Patterns
Finally, here’s how we handle cases where one pattern might be contained within another:
This handles cases like our earlier example where finding “she” also means we’ve found “he”. By following failure links and checking for dictionary matches, we ensure we don’t miss any overlapping patterns.
Runtime Pattern Matching
Once we’ve built our optimized Trie structure, here’s how we actually use it to find patterns:
This is the original implementation that computed failure transitions at runtime. Notice how when a direct transition fails (t == nilState
), we have to traverse failure links to find the next valid state. This traversal happens for every character that doesn’t have a direct transition, adding overhead to our pattern matching.
The Optimization: Precomputed Failure Transitions
Here’s how we optimize the build phase to precompute all possible failure transitions:
This changes our runtime Walk function from a complex failure link traversal to a simple array lookup:
You can find the complete code here.
The Key Differences
Build Time vs Runtime
Before: Computed failure transitions on-demand during pattern matching
After: Precomputes all possible transitions during Trie construction
State Transitions
Before: Required traversing failure links to find valid transitions
After: Single array lookup gives us the next state immediately
Memory vs CPU Trade-off
Before: Lower memory usage but higher CPU cost per character
After: Slightly higher memory usage but constant-time operation per character
By moving the failure transition computation to the build phase, we’ve eliminated the need to traverse failure links at runtime. Each character in the input now requires exactly one array lookup to determine the next state, regardless of how many patterns we’re matching.
Wrapping Up
In this first part, we explored how the Aho-Corasick algorithm works, introduced the concepts of Tries and FSMs, and showed how precomputing failure transitions can cut down on CPU costs. By applying these techniques, TruffleHog not only detects secrets more efficiently but does so while handling large volumes of input with ease.
Check back soon for Part 2 of this series, where we’ll tackle memory usage optimizations. Until then, happy scanning!