📖 Chapter 3.7 ⏱️ ~50 min read 🎯 Intermediate

Chapter 3.7: Hashing Techniques

📝 Before You Continue: You should know STL containers (Chapter 3.1) and string basics (Chapter 2.3). This chapter covers hashing principles and advanced competitive programming usage.

Hashing is one of the most important "tools" in competitive programming: it turns complex comparison problems into O(1) numeric comparisons. But hashing is also the easiest technique to get "hacked"—this chapter teaches both how to use it well and how to prevent being hacked.


3.7.1 unordered_map vs map: Internals & Performance

Internal Implementation Comparison

Featuremapunordered_map
Internal structureRed-black tree (balanced BST)Hash table
Lookup timeO(log N)O(1) avg, O(N) worst
Insert timeO(log N)O(1) avg, O(N) worst
Iteration orderOrdered (ascending by key)Unordered
Memory usageO(N), smaller constantO(N), larger constant
Worst caseO(log N) (stable)O(N) (hash collision)
#include <bits/stdc++.h>
using namespace std;

int main() {
    // map: ordered, O(log N)
    map<int, int> m;
    m[3] = 30; m[1] = 10; m[2] = 20;
    for (auto [k, v] : m) cout << k << ":" << v << " ";
    // output: 1:10 2:20 3:30  ← ordered!

    // unordered_map: unordered, O(1) average
    unordered_map<int, int> um;
    um[3] = 30; um[1] = 10; um[2] = 20;
    // iteration order undefined, but lookup is very fast

    // performance difference: N=10^6 operations
    // map: ~300ms; unordered_map: ~80ms (roughly)
}

When to Choose Which?

  • Use map: need ordered iteration, need lower_bound/upper_bound, extreme key range (high hash collision risk)
  • Use unordered_map: pure lookup/insert, key is integer or string, large N (> 10^5)

3.7.2 Anti-Hack: Custom Hash

Problem: unordered_map's default integer hash is essentially hash(x) = x, allowing attackers to construct many hash collisions, degrading operations to O(N) and causing TLE.

On platforms like Codeforces, this is a common hack technique.

Solution: splitmix64 Hash

// Anti-hack custom hasher — uses splitmix64
struct custom_hash {
    static uint64_t splitmix64(uint64_t x) {
        x += 0x9e3779b97f4a7c15;
        x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
        x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
        return x ^ (x >> 31);
    }

    size_t operator()(uint64_t x) const {
        static const uint64_t FIXED_RANDOM =
            chrono::steady_clock::now().time_since_epoch().count();
        return splitmix64(x + FIXED_RANDOM);
    }
};

// Usage:
unordered_map<int, int, custom_hash> safe_map;
unordered_set<int, custom_hash> safe_set;

⚠️ Contest tip: When using unordered_map on Codeforces, always add custom_hash. USACO test data won't deliberately construct hacks, but it's a good habit.


3.7.3 String Hashing (Polynomial Hash)

String hashing maps a string to an integer, turning string comparison into numeric comparison (O(1)).

Core Formula

For string s[0..n-1], define the hash value as:

hash(s) = s[0]·B^(n-1) + s[1]·B^(n-2) + ... + s[n-1]·B^0  (mod M)

where B is the base (typically 131 or 131117) and M is a large prime (typically 10⁹+7 or 10⁹+9).

Prefix Hash + Substring Hash O(1)

// String hashing: O(N) preprocessing, O(1) substring hash
#include <bits/stdc++.h>
using namespace std;
typedef unsigned long long ull;

const ull BASE = 131;
// Use unsigned long long natural overflow (equivalent to mod 2^64)
// Or specify MOD manually:
// const ull MOD = 1e9 + 7;

struct StringHash {
    int n;
    vector<ull> h, pw;

    StringHash(const string& s) : n(s.size()), h(n + 1, 0), pw(n + 1, 1) {
        for (int i = 0; i < n; i++) {
            h[i + 1] = h[i] * BASE + (s[i] - 'a' + 1);  // 1-indexed prefix hash
            pw[i + 1] = pw[i] * BASE;                      // BASE^(i+1)
        }
    }

    // Get hash of substring s[l..r] (0-indexed)
    ull get(int l, int r) {
        return h[r + 1] - h[l] * pw[r - l + 1];  // ← KEY formula
    }
};

int main() {
    ios_base::sync_with_stdio(false);
    cin.tie(NULL);

    string s = "abcabc";
    StringHash sh(s);

    // Compare if two substrings are equal
    // s[0..2] = "abc", s[3..5] = "abc"
    cout << (sh.get(0, 2) == sh.get(3, 5) ? "Equal" : "Not Equal") << "\n";  // Equal

    // Compare s[0..1] = "ab" vs s[3..4] = "ab"
    cout << (sh.get(0, 1) == sh.get(3, 4) ? "Equal" : "Not Equal") << "\n";  // Equal
}

Hash Formula Derivation:

h[r+1] = s[0]*B^r + s[1]*B^(r-1) + ... + s[r]*B^0
h[l]   = s[0]*B^(l-1) + ... + s[l-1]*B^0

h[r+1] - h[l] * B^(r-l+1)
= (s[0]*B^r + ... + s[r]*B^0)
  - (s[0]*B^r + ... + s[l-1]*B^(r-l+1))
= s[l]*B^(r-l) + s[l+1]*B^(r-l-1) + ... + s[r]*B^0
= hash(s[l..r]) ✓

下图直观展示了前缀哈希数组的构建过程,以及如何用 get(l, r) 公式在 O(1) 内提取任意子串的哈希值:

String Polynomial Hash


3.7.4 Double Hashing (Avoiding Collisions)

Single hash (mod M) has collision probability ≈ 1/M. For N substring comparisons, expected collisions ≈ N²/(2M).

  • If M = 10⁹+7, N = 10⁶: collision probability ≈ 10¹²/(2×10⁹) = 500 times! Not safe.
  • Solution: double hashing, using two different (B, M) pairs simultaneously, collision probability drops to 1/(M₁×M₂) ≈ 10⁻¹⁸.
// Double hashing: two (BASE, MOD) pairs used simultaneously, extremely low collision probability
struct DoubleHash {
    static const ull B1 = 131, M1 = 1e9 + 7;
    static const ull B2 = 137, M2 = 1e9 + 9;

    int n;
    vector<ull> h1, h2, pw1, pw2;

    DoubleHash(const string& s) : n(s.size()),
        h1(n+1,0), h2(n+1,0), pw1(n+1,1), pw2(n+1,1) {
        for (int i = 0; i < n; i++) {
            ull c = s[i] - 'a' + 1;
            h1[i+1] = (h1[i] * B1 + c) % M1;
            h2[i+1] = (h2[i] * B2 + c) % M2;
            pw1[i+1] = pw1[i] * B1 % M1;
            pw2[i+1] = pw2[i] * B2 % M2;
        }
    }

    // Return pair<ull,ull> as the hash "fingerprint" of substring s[l..r]
    pair<ull,ull> get(int l, int r) {
        ull v1 = (h1[r+1] - h1[l] * pw1[r-l+1] % M1 + M1) % M1;
        ull v2 = (h2[r+1] - h2[l] * pw2[r-l+1] % M2 + M2) % M2;
        return {v1, v2};
    }
};

3.7.5 Application: String Matching (Rabin-Karp)

// Rabin-Karp string matching: find all occurrences of pattern P in text T
// Time: O(N+M) average, O(NM) worst case (but extremely fast in practice)
#include <bits/stdc++.h>
using namespace std;
typedef unsigned long long ull;

vector<int> rabinKarp(const string& T, const string& P) {
    int n = T.size(), m = P.size();
    if (m > n) return {};

    const ull BASE = 131;
    ull patHash = 0, textHash = 0, pow_m = 1;

    // Compute BASE^m (natural overflow)
    for (int i = 0; i < m - 1; i++) pow_m *= BASE;

    // Initial hash
    for (int i = 0; i < m; i++) {
        patHash = patHash * BASE + P[i];
        textHash = textHash * BASE + T[i];
    }

    vector<int> result;
    for (int i = 0; i + m <= n; i++) {
        if (textHash == patHash) {
            // Verify when hashes match (avoid false positives from collision)
            if (T.substr(i, m) == P) result.push_back(i);
        }
        if (i + m < n) {
            // Rolling hash: remove leftmost char, add rightmost char
            textHash = textHash - T[i] * pow_m;   // remove leftmost
            textHash = textHash * BASE + T[i + m]; // add rightmost
        }
    }
    return result;
}

3.7.6 Application: Longest Common Substring

Problem: Given strings S and T, find the length of their longest common substring.

Approach: Binary search on the answer (length L of longest common substring), then use a hash set to check if any substring of length L appears in both strings.

// Longest common substring: O(N log N) — binary search + hashing
int longestCommonSubstring(const string& S, const string& T) {
    StringHash hs(S), ht(T);
    int ns = S.size(), nt = T.size();

    auto check = [&](int len) -> bool {
        unordered_set<ull> setS;
        for (int i = 0; i + len <= ns; i++)
            setS.insert(hs.get(i, i + len - 1));
        for (int j = 0; j + len <= nt; j++)
            if (setS.count(ht.get(j, j + len - 1)))
                return true;
        return false;
    };

    int lo = 0, hi = min(ns, nt);
    while (lo < hi) {
        int mid = (lo + hi + 1) / 2;
        if (check(mid)) lo = mid;
        else hi = mid - 1;
    }
    return lo;
}

⚠️ Common Mistakes

  1. Bad modulus choice: Don't use numbers other than 10⁹+7; especially avoid non-prime moduli (high collision rate). Recommended: 10⁹+7 and 10⁹+9 as a double hash pair.

  2. unordered_map hacked: On platforms like Codeforces, the default hash can be attacked. Always use custom_hash.

  3. Substring hash subtraction underflow: h[r+1] - h[l] * pw[r-l+1] may be negative (with signed integers). Use unsigned long long natural overflow, or (... % M + M) % M to ensure non-negative.

  4. BASE doesn't match character set: For lowercase letters (26 types), BASE must be > 26 (typically 31 or 131). For all ASCII characters (128 types), BASE must be > 128 (use 131 or 137).

  5. Hash collision causing WA: Even with double hashing, collisions are theoretically possible. If uncertain, add direct string comparison when hashes match.


Chapter Summary

📌 Core Comparison Table

ToolTime ComplexityUse Case
map<K,V>O(log N)Need ordering, need range queries
unordered_map<K,V>O(1) amortizedOnly need lookup/insert, key order not required
String hash (single)O(N) preprocess, O(1) querySubstring comparison, pattern matching
String hash (double)O(N) preprocess, O(1) queryHigh-precision scenarios, avoid collisions

❓ FAQ

Q1: Which is better — unsigned long long natural overflow double hash or manual mod hash?

A: ull natural overflow (equivalent to mod 2⁶⁴) is simpler to code, and 2⁶⁴ is large enough that single-hash collision probability is already very low (≈ 10⁻¹⁸). But crafted data can deliberately cause collisions — double hashing is safer then. Both work in contests; ull is more common.

Q2: What can string hashing do that KMP cannot?

A: String hashing excels at multi-string comparison (e.g., finding longest common substring, palindromic substrings), while KMP only excels at single-pattern matching. Hash + binary search can solve many string problems in O(N log N) that would require more complex KMP implementations.

Q3: Should I use BASE 31 or 131?

A: Use 31 for lowercase letters only (a prime less than 37, avoids too-small hash space). Use 131 for mixed case or digits (a prime greater than 128, covers full ASCII). The key is: BASE must be larger than the character set size and ideally a prime.


Practice Problems

Problem 3.7.1 — Two Sum with Hash 🟢 Easy Given array A, find if any two distinct elements sum to target X. Use unordered_set.

Hint For each A[i], check if (X - A[i]) is already in the hash set. Insert A[i] after checking.

Problem 3.7.2 — Substring Check 🟢 Easy Given string T and pattern P, check if P appears in T. Print all starting indices.

Hint Use Rabin-Karp rolling hash, or just use `string::find` for practice, then implement manually.

Problem 3.7.3 — Longest Palindromic Substring 🟡 Medium Find the length of the longest palindromic substring.

Hint A palindrome s[l..r] satisfies: hash(s[l..r]) == hash(reverse(s)[n-1-r..n-1-l]). Binary search + hash on both forward and reversed string.

Problem 3.7.4 — Count Distinct Substrings 🟡 Medium Given string S of length N (N ≤ 5000), count the number of distinct substrings.

Hint Insert all O(N²) substring hashes into an unordered_set, count distinct values. Use double hash to avoid collisions.

Problem 3.7.5 — String Periods 🔴 Hard Find the smallest period of a string S (smallest k such that S is a repetition of S[0..k-1]).

Hint Try each k that divides n. For each candidate k, verify using string hash comparison in O(1) per check. Total O(d(n) × n) where d(n) is number of divisors.