[algorithms] Aho-Corasick 自動機：同時搜尋所有關鍵字

KMP 解決一個 pattern 的問題；AC 自動機解決同一個問題在有幾萬個 pattern 時的版本。

為什麼不用 k 次 KMP？

你要在一篇文章裡找出所有出現的敏感詞，總計有 10 萬個 pattern，文章長度 n = 10^6。

逐個跑 KMP：O(k × n) = 10^11——不可能。

AC 自動機把所有 pattern 組合起來，只掃一遍文本：O(n + Σm + z)，m = 各 pattern 長度，z = 總匹配次數。

三個核心結構

Trie：把所有 pattern 插入字典樹，共享前綴。pattern = [“he”, “she”, “his”, “hers”]：

root
├── h → e → [he] → r → s → [hers]
│         └── i → s → [his]
└── s → h → e → [she]

失敗鏈結（Failure Link）：類似 KMP 的失敗函數，但作用在 Trie 節點上。節點 u 的 failLink 指向 u 代表字串的「最長真後綴」在 Trie 中對應的節點。

"she" 末尾節點的 failLink = "he" 末尾節點
  （"she" 的最長真後綴 "he" 存在於 Trie 中）

匹配失敗時，沿 failLink 回退——不從根重新開始，能繼承前面已匹配的後綴。

輸出鏈結（Output Link）：當某節點不是任何 pattern 的結尾，但沿 failLink 路徑上有結尾節點，輸出鏈結直接指向最近的那個，讓匹配報告更方便。

建構：BFS 建 failLink

void buildFailLinks() {
    Queue<Integer> queue = new LinkedList<>();
    // 根的直接子節點：failLink 指根
    for (int child : trie[0].children.values()) {
        trie[child].fail = 0;
        queue.add(child);
    }
    while (!queue.isEmpty()) {
        int u = queue.poll();
        for (var entry : trie[u].children.entrySet()) {
            char c = entry.getKey();
            int v = entry.getValue();
            // v 的 failLink = u 的 failLink 沿著 c 走到的節點
            int f = trie[u].fail;
            while (f != 0 && !trie[f].children.containsKey(c))
                f = trie[f].fail;
            trie[v].fail = trie[f].children.getOrDefault(c, 0);
            queue.add(v);
        }
    }
}

搜尋

掃描文本時，當前節點沿字元轉移；若無對應轉移，沿 failLink 回退（等同 KMP 回退）：

void search(String text) {
    int cur = 0;
    for (char c : text.toCharArray()) {
        while (cur != 0 && !trie[cur].children.containsKey(c))
            cur = trie[cur].fail;
        cur = trie[cur].children.getOrDefault(c, 0);
 
        // 報告所有匹配（沿輸出鏈結）
        int tmp = cur;
        while (tmp != 0) {
            if (trie[tmp].patternIdx != -1)
                report(trie[tmp].patternIdx, textPos);
            tmp = trie[tmp].outputLink;
        }
    }
}

AC 自動機 = Trie + KMP 的失敗函數——如果你懂這兩個結構，AC 自動機只是把它們合在一起。

Terry Yao's Blog

目錄

Aho-Corasick 自動機：同時搜尋所有關鍵字

為什麼不用 k 次 KMP？

三個核心結構

建構：BFS 建 failLink

搜尋

關係圖譜