Ahrav Dutta

The Dig

January 23, 2025

Under the Hood: The Algorithmic Power Behind TruffleHog’s Secret Scanning (Part 1 of 2)

Under the Hood: The Algorithmic Power Behind TruffleHog’s Secret Scanning (Part 1 of 2)

Ahrav Dutta

January 23, 2025

TruffleHog uses the Aho-Corasick algorithm to efficiently scan data for sensitive information. At the core of our system are detectors that rely on identifying specific keywords in data chunks, making fast and efficient keyword matching critical to performance. It’s like a highly- efficient reading machine that never backtracks unnecessarily.

This post is the first in a two-part series exploring how we improved TruffleHog’s performance:

  • Part 1 (this post): We’ll dive into CPU optimizations, focusing on how precomputing failure transitions can speed up the scanning process.

  • Part 2: We’ll talk about memory optimizations, showing how to trim and manage the Aho-Corasick data structures for even better resource utilization.

Let’s start by exploring how one might naively approach the problem of pattern matching, and then see how we can make it more efficient.

The Naive Approach: Using a Trie

A Trie (pronounced “try”) is a tree-like data structure that helps you quickly find strings based on their prefixes, where traversing a path from the root builds a sequence of letters. For example, consider a trie built using the keywords "HE", "HERS", "HIS", and "SHE". Starting at the root, if the first character in the data is 'H', we move from the root to the node labeled 'H'. From there, if the next character is 'E', we proceed to the 'HE' node, where a '*' indicates a complete keyword match for "HE". Continuing further, if the next character is 'R', we follow the path to the 'HER' node. Finally, if the next character is 'S', we reach the 'HERS' node, marking a complete match for the keyword "HERS". The position in the trie and the path followed through it determine when a keyword match occurs.



Key Features of a Trie:

  • Efficient Prefix Matching: Instantly checks if your input string is a prefix of any stored pattern

  • Common Uses: Often appears in autocomplete, spell checkers, dictionary lookups, and more



Why the Naive Approach Falls Short

While a Trie is great for handling prefixes, it faces significant challenges when used to search for multiple patterns in large text, particularly in cases like these:

  • Overlapping Patterns: If two patterns share common prefixes, a simple Trie might force you to re-check states repeatedly.

  • Redundant Backtracking: Without extra help, you might end up repeatedly returning to the root to start new searches.

For example, when searching for the patterns [“he”, “she”, “his”, “hers”] in the text “ushers”, a naive approach would:

  1. Find “he” but miss “she”

  2. Or find “she” but need to restart to find “hers”

  3. Waste time returning to the root repeatedly

Enhancing Our Approach with Finite State Machines

To address these limitations, we can transform our Trie into a more sophisticated structure by adding extra transitions that tell us exactly where to go when a match fails. These additional transitions eliminate the need to return to the root, making our pattern matching more efficient.



The key insight is that when we fail to match at a certain point, instead of starting over, we can jump to a state that represents the longest valid suffix of what we’ve seen so far. This is where the Aho-Corasick algorithm shines.

How Aho-Corasick Makes It All Work

Aho-Corasick does its job in two main phases:

  1. Build the pattern-matching machine (enhance the Trie with additional transitions and failure links)

  2. Scan the text once, identifying all matches as you go, without constantly restarting

Building the Pattern Matching Machine

The machine is defined by three key functions:

(a) Goto Function (g)

This is essentially our Trie structure, defining how to move between states based on input characters. For the keywords [“he”, “she”, “his”, “hers”], each complete keyword corresponds to a path from root to a marked node.

Example:
For the keywords [“he”, “she”, “his”, “hers”]:

  • Insert all keywords into the Trie.

  • Each complete keyword corresponds to a path from the root to a leaf or a node marked as an endpoint.

  • For instance, the node at the end of “he”  is an output state, indicating that the keyword “he” ends there.



(b) Failure Function (f)

This is where the magic happens. The failure function enables our FSM to handle mismatches efficiently. If we can’t continue on the current path, the failure function tells us where to “fall back” without restarting from the root.

We compute these failure links using a breadth-first search (BFS) over the Trie. Why BFS? BFS is a traversal technique that explores nodes level-by-level, ensuring we process all nodes at a given depth before moving to the next. This is important because we need to compute failure links starting from the shortest patterns and moving on to longer ones. By processing states in this order, we ensure that when we assign a failure link to a node representing a longer prefix, we’ve already correctly set the failure links for all of its shorter prefix “ancestors”.



What does the failure link represent?


The failure link points to another state that represents the longest valid suffix of the current input string that also exists somewhere in the Trie. Think of it as saying, “If we can’t continue here, jump to the longest matching portion we’ve seen before.”



For example, suppose we’ve matched “she” so far, but the next character doesn’t continue this pattern:

  • The failure link for “she” points to “he”, because “he” is a suffix of “she” and exists in our pattern set

  • If we fail at “he”, we might end up at the root if no shorter suffix matches

This chain of failure links ensures that we don’t waste time re-checking from scratch. We’re effectively shifting to a “next best” position in the Trie that could still lead to a match.

(C) Output Function (output)

This simply indicates which keywords end at each state.

Example:

  • In the state for “he”, the output is “he”.

  • In the state for “she”, the output is “she”.



Searching the Text

Once our enhanced machine is built, it processes input text in a single pass. At each character:

  1. Goto: Move to the next state based on the current character

  2. Failure: If there’s no direct path, follow failure links until you find a suitable state

  3. Output: Emit any keywords associated with the current state

Example:
Let’s search for the keywords [“he”, “she”, “his”, “hers”] in the text "ushers" step by step. We’ll highlight where suffix-based matches appear as well, to avoid confusion.

Start at the root state:

  • Input: "u":
    No direct match for "u" at the root. Since we’re at the root, the failure link also points back to root, so we stay at the root.

  • Input: "s":
    From the root, we have a transition on "s". Move to the state representing "s".

  • Input: "h":
    From "s", reading "h" moves us to the "sh" state.

  • Input: "e":
    From "sh", reading "e" transitions us to the "she" state.
    At this point, we’ve matched "she". Because "he" is a suffix of "she" and is also a keyword, the failure link from the "she" state leads to the "he" state, and we emit both "she" and "he"

  • Input: "r":
    From "she", there’s no direct transition on "r", so we follow the failure link back to "he".
    From "he", reading "r" takes us to "her". (This is possible because we know "he" transitions on "r" to "her" in our Trie.)

  • Input: "s" From "her", reading "s", leads to "hers". Since "hers", is a key word, we emit ["hers"]

By following this process, we scan the entire input only once, efficiently identifying "she", "he", and "her" in the string "ushers" without restarting the search for each keyword.



A Note on Performance from the Original Aho-Corasick Paper

One of the most impressive aspects of the Aho-Corasick algorithm, at least in my opinion, is how gracefully it scales with more patterns. According to the original 1975 paper, once you’ve built the automaton (our enhanced Trie with failure links and output functions), the number of state transitions made while scanning the text is independent of how many patterns you have!

What does that mean in practice?
If we use our example patterns [“he”, “she”, “his”, “hers”] and scan through the text "she sells seashells", the algorithm will pass over each character exactly once. Even though we’re looking for four different patterns instead of just one, the algorithm doesn’t slow down. It might discover more matches as it moves through the text, but it won’t take more time per character. The state transitions are designed to handle all the patterns simultaneously, no matter how many there are!

This approach ensures that adding more patterns only affects the build phase—where we set up the Trie and failure links—but not the actual search phase. Once the automaton is ready, it’s optimized to handle all patterns together efficiently. As a result, the runtime performance remains predictable, even as your list of patterns grows. I don’t know about you, but I find that pretty nifty!

Bringing It All Together

With Aho-Corasick:

  • You build a Trie of all keywords.

  • Compute failure transitions to handle mismatches smoothly.

  • Scan through the text, recognizing patterns without unnecessary restarts.

The result is a linear-time search relative to the length of your text and the patterns combined. It’s efficient, elegant, and widely used in applications like string searching, bioinformatics, and intrusion detection systems.

Original Paper Reference

The Aho-Corasick algorithm was introduced in the classic paper Efficient String Matching: An Aid to Bibliographic Search (Aho & Corasick, 1975). They demonstrated how to build a finite state machine for multiple pattern searching, achieving significant performance improvements over naive methods.

How TruffleHog Leverages Aho-Corasick

Alright, enough theory—let’s see how TruffleHog actually puts this algorithm to use. TruffleHog relies on detectors, specialized components designed to scan data for specific types of secrets, like AWS keys, API tokens, or database credentials. Each detector implements a common interface, enabling it to scan chunks of text, identify potential matches, and describe what it’s detecting.

Detectors work by using keywords—short, unique strings associated with the secrets they’re designed to find. For example, the AWS detector might look for keywords like AKIA, ABIA, or ACCA, which are common prefixes for AWS access keys. These keywords are the building blocks for our Trie, and if you’ve been following along, this is where the magic happens. By feeding all the keywords from TruffleHog’s ~800 detectors into the Aho-Corasick algorithm, we construct an efficient automaton that can scan large volumes of text and rapidly identify matches.

When a keyword is identified during scanning, it allows us to tie the match back to its corresponding detector using a pre-defined mapping of keywords to detectors. With this mapping, TruffleHog narrows the search space to a smaller chunk of data around the keyword and passes it to the detector. The detector then uses a precise regex to determine whether a valid secret is present in that chunk. In other words, once the automaton is built, TruffleHog can process data rapidly, identifying keywords and efficiently routing relevant chunks to the appropriate detectors for validation.


CPU Optimization: Precomputing Failure Transitions

Originally, TruffleHog relied on the BobuSumisu/aho-corasick library to implement the Aho-Corasick algorithm. This library already does a fantastic job of building a performant pattern-matching automaton—it’s a solid piece of engineering on its own. However, we identified an opportunity for optimization in how it handled state transitions.

By moving certain computations from runtime to the build phase, we were able to significantly improve performance without changing the core logic of the algorithm. This optimization allows TruffleHog to process text more efficiently while maintaining all the pattern-matching capabilities of the original implementation.

Show me the numbers!



As the benchmarks show, precomputing failure transitions at build time gives us an 11-17% performance boost, depending on input size. Since keyword matching is core to TruffleHog's secret detection, these gains add up significantly when scanning millions of lines of code.

Quick Benefits Summary

  • Lower CPU Overhead: All transitions are resolved during the build phase, so runtime is just lookups

  • No Loss of Accuracy: We get the same results, just faster

Coming Up: Memory Optimizations (Part 2)

Stay tuned for Part 2, where our CPU optimization journey continues. While the precomputed transitions gave us an 11-17% performance boost, our memory optimizations push that to 22-40% while reducing memory allocations by 99%. We'll show you how reorganizing the Aho-Corasick data structures made this possible.



🎯 If you’re mainly interested in understanding how TruffleHog’s scanning got faster, you can stop here! The rest of this post dives deep into implementation details for those interested in the technical nitty-gritty.

Implementation Deep Dive

Building the Foundation: States and the FSM

First, let’s look at how we represent each state in our finite state machine:


type state struct {
    id       int64              // Unique identifier for the state
    value    byte              // Character value for this state
    parent   *state            // Parent state in the Trie
    trans    map[byte]*state   // Goto function: maps characters to next states
    dict     int64             // Length of pattern if this state is a match
    failLink *state            // Failure function: where to go on mismatch
    dictLink *state            // Output function: links to other matching patterns
    pattern  int64             // Pattern identifier for matching
}


This structure directly maps to our FSM diagram from earlier. Each state knows its transitions (trans), what to do on failure (failLink), and whether it represents a complete pattern (dict).

The Builder Pattern

To construct our FSM, we use a builder pattern. Here’s its structure:


type TrieBuilder struct {
    states      []*state       // All states in our FSM
    root        *state         // Root state where all patterns begin
    numPatterns int64          // Count of patterns added
}
func NewTrieBuilder() *TrieBuilder {
    tb := &TrieBuilder{
        states:      make([]*state, 0),
        root:        nil,
        numPatterns: 0,
    }
    // Initialize with root state
    tb.addState(0, nil)
    tb.addState(0, nil)
    tb.root = tb.states[1]
    return tb
}


Think of this as setting up our workbench. We’re creating a space to build our pattern-matching machine, starting with a root state.

Creating New States

When we need a new state in our FSM, here’s how we create it:


func (tb *TrieBuilder) addState(value byte, parent *state) *state {
    s := &state{
        id:       int64(len(tb.states)),
        value:    value,
        parent:   parent,
        trans:    make(map[byte]*state),
        dict:     0,
        failLink: nil,
        dictLink: nil,
        pattern:  0,
    }
    tb.states = append(tb.states, s)
    return s
}


Remember those nodes in our Trie diagram? This is how we create each one. Each state starts fresh with no transitions or failure links - we’ll add those as we build the machine.

Adding Patterns

Here’s where we build the actual Trie structure for our patterns:


func (tb *TrieBuilder) AddPattern(pattern []byte) *TrieBuilder {
    s := tb.root
    // Follow/create path in Trie for each character
    for _, c := range pattern {
        if t, ok := s.trans[c]; !ok {
            // Create new state if transition doesn't exist
            t = tb.addState(c, s)
            s.trans[c] = t
        }
        s = t
    }
    // Mark pattern endpoint with its length and ID
    s.dict = int64(len(pattern))
    s.pattern = tb.numPatterns
    tb.numPatterns++
    return tb
}


This is how we built that Trie structure we saw in the diagrams. For a pattern like “he”, we:

  1. Start at the root

  2. Create/follow a transition for ‘h’

  3. Create/follow a transition for ‘e’

  4. Mark the final state as a pattern match

Building the Pattern Matching Machine

Now let’s look at how we transform our Trie into an optimized Aho-Corasick automaton:


func (tb *TrieBuilder) Build() *Trie {
    // Compute failure and dictionary links
    tb.computeFailLinks(tb.root)
    tb.computeDictLinks(tb.root)
    // Prepare arrays for our optimized representation
    numStates := len(tb.states)
    dict := make([]int64, numStates)          // Output function
    trans := make([][256]int64, numStates)    // Goto function
    failLink := make([]int64, numStates)      // Failure function
    dictLink := make([]int64, numStates)      // Dictionary links
    pattern := make([]int64, numStates)       // Pattern IDs


The Build function begins our optimization process. Instead of using linked states, we’ll use dense arrays that allow O(1) lookups.

Here’s how we convert our states into this optimized format:


// Convert states to arrays for better performance
    for i, s := range tb.states {
        dict[i] = s.dict
        pattern[i] = s.pattern
        // Store transitions in a dense array for O(1) lookup
        for c, t := range s.trans {
            trans[i][c] = t.id
        }
        if s.failLink != nil {
            failLink[i] = s.failLink.id
        }
        if s.dictLink != nil {
            dictLink[i] = s.dictLink.id
        }
    }
    return &Trie{dict, trans, failLink, dictLink, pattern}
}


Computing Failure Links

Now for the heart of Aho-Corasick - the failure function. Remember our earlier discussion about handling mismatches? Here's how we implement it:


func (tb *TrieBuilder) computeFailLinks(s *state) {
    if s.failLink != nil {
        return  // Already computed
    }
    // Base cases: root and its children fail back to root
    if s == tb.root || s.parent == tb.root {
        s.failLink = tb.root
        }


The base cases are straightforward - the root and its immediate children fail back to the root. This matches our earlier diagram where we saw all level-1 nodes pointing back to the root.

The interesting part comes next:


} else {
        var ok bool
        // Follow parent's failure links until we find a matching transition
        for t := s.parent.failLink; t != tb.root; t = t.failLink {
            if t.failLink == nil {
                tb.computeFailLinks(t)
            }
            // Try to find a transition on the same character
            if s.failLink, ok = t.trans[s.value]; ok {
                break
            }
        }


Remember our example with "ushers" matching "she" and "hers"? This is where that magic happens. We follow the parent's failure links until we find a state that has a transition matching our current character.

Handling Overlapping Patterns

Finally, here’s how we handle cases where one pattern might be contained within another:


func (tb *TrieBuilder) computeDictLinks(s *state) {
    if s != tb.root {
        // Follow failure links to find matching patterns
        for t := s.failLink; t != tb.root; t = t.failLink {
            if t.dict != 0 {  // Found a matching pattern
                s.dictLink = t
                break
            }
        }
    }
    // Recursively compute dictionary links
    for _, t := range s.trans {
        tb.computeDictLinks(t)
    }
}


This handles cases like our earlier example where finding “she” also means we’ve found “he”. By following failure links and checking for dictionary matches, we ensure we don’t miss any overlapping patterns.

Runtime Pattern Matching

Once we’ve built our optimized Trie structure, here’s how we actually use it to find patterns:


// Trie represents a trie of patterns with extra links as per the Aho-Corasick algorithm.
type Trie struct {
    dict     []int64     // Marks states that are pattern endpoints
    trans    [][256]int64 // Dense array of transitions
    failLink []int64     // Where to go on pattern mismatch
    dictLink []int64     // Links to other matching patterns
    pattern  []int64     // Pattern identifiers
}
// WalkFn is called for each pattern match
// end: position where match ends, n: length of match, pattern: pattern identifier
type WalkFn func(end, n, pattern int64) bool
// Walk processes input text and finds all matching patterns
func (tr *Trie) Walk(input []byte, fn WalkFn) {
    s := rootState
    for i, c := range input {
        // Try direct transition first
        t := tr.trans[s][c]
        if t == nilState {
            // On failure, follow failure links until we find a valid transition
            for u := tr.failLink[s]; u != rootState; u = tr.failLink[u] {
                if t = tr.trans[u][c]; t != nilState {
                    break
                }
            }
            // If still no valid transition, try from root or reset to root
            if t == nilState {
                if t = tr.trans[rootState][c]; t == nilState {
                    t = rootState
                }
            }
        }
        s = t
        // Check for pattern matches at current state
        if tr.dict[s] != 0 {
            if !fn(int64(i), tr.dict[s], tr.pattern[s]) {
                return
            }
        }
        // Check for additional matches through dictionary links
        if tr.dictLink[s] != nilState {
            for u := tr.dictLink[s]; u != nilState; u = tr.dictLink[u] {
                if !fn(int64(i), tr.dict[u], tr.pattern[u]) {
                    return
                }
            }
        }
    }
}


This is the original implementation that computed failure transitions at runtime. Notice how when a direct transition fails (t == nilState), we have to traverse failure links to find the next valid state. This traversal happens for every character that doesn’t have a direct transition, adding overhead to our pattern matching.

The Optimization: Precomputed Failure Transitions

Here’s how we optimize the build phase to precompute all possible failure transitions:


func (tb *TrieBuilder) Build() *Trie {
    ...
    trie := &Trie{
        failTrans: make([][256]uint32, numStates),  // Precomputed transitions
        ...
    }
    ...
    for i, s := range tb.states {
        ...
        // Precompute all possible failure transitions
        for b := range 256 {
            c := byte(b)
            trie.failTrans[i][c] = tb.computeFailTransition(s, c)
        }
    }
    return trie
}
// computeFailTransition finds the proper fallback state once, during the build step
func (tb *TrieBuilder) computeFailTransition(s *state, c byte) uint32 {
    for t := s; t != nil; t = t.failLink {
        if next, exists := t.trans[c]; exists {
            return next.id
        }
    }
    return tb.root.id
}


This changes our runtime Walk function from a complex failure link traversal to a simple array lookup:


func (tr *Trie) Walk(input []byte, fn WalkFn) {
    s := rootState
    inputLen := len(input)
    for i := 0; i < inputLen; i++ {
        // Direct lookup of next state using precomputed transitions
        s = failTrans[s][input[i]]
        
        // Handle matches
        if dict[s] != 0 || dictLink[s] != nilState {
            if dict[s] != 0 && !fn(uint32(i), dict[s], pattern[s]) {
                return
            }
            for u := dictLink[s]; u != nilState; u = dictLink[u] {
                if !fn(uint32(i), dict[u], pattern[u]) {
                    return
                }
            }
        }
    }
}


You can find the complete code here.

The Key Differences

  1. Build Time vs Runtime

    • Before: Computed failure transitions on-demand during pattern matching

    • After: Precomputes all possible transitions during Trie construction

  2. State Transitions

    • Before: Required traversing failure links to find valid transitions

    • After: Single array lookup gives us the next state immediately

  3. Memory vs CPU Trade-off

    • Before: Lower memory usage but higher CPU cost per character

    • After: Slightly higher memory usage but constant-time operation per character

By moving the failure transition computation to the build phase, we’ve eliminated the need to traverse failure links at runtime. Each character in the input now requires exactly one array lookup to determine the next state, regardless of how many patterns we’re matching.

Wrapping Up

In this first part, we explored how the Aho-Corasick algorithm works, introduced the concepts of Tries and FSMs, and showed how precomputing failure transitions can cut down on CPU costs. By applying these techniques, TruffleHog not only detects secrets more efficiently but does so while handling large volumes of input with ease.

Check back soon for Part 2 of this series, where we’ll tackle memory usage optimizations. Until then, happy scanning!