HyperLogLog: Count Billions with 12 KB
How a single probabilistic trick lets you estimate 10⁹ unique elements using less memory than a thumbnail image.
Exact distinct counting requires memory proportional to cardinality.
seen = set()
for item in stream:
seen.add(item)
print(len(seen)) # exact — but `seen` can be gigabytesHyperLogLog estimates the same count with ~12 KB of memory, regardless of whether you have a million or a trillion distinct elements — with ~0.81% error.
The Insight: Rare Patterns Reveal Crowd Size
You hash each input element — a user ID, an IP address, a URL, anything — using a hash function like MurmurHash3. The output is a fixed-size integer that looks uniformly random in binary, regardless of what the input was.
import mmh3
mmh3.hash("user:1", signed=False) # → 2318604284 → 10001010001...
mmh3.hash("user:2", signed=False) # → 3715833618 → 11011101011...
mmh3.hash("user:1000", signed=False) # → 0475209131 → 00011100010...
mmh3.hash("user:1001", signed=False) # → 2089657174 → 01111100100...Because the output is uniformly distributed, each bit is independently 0 or 1 with 50/50 probability. So the probability of a hash starting with k consecutive zeros is 1/2^k — same as flipping k heads in a row.
hash binary output → leading zeros (k) → probability
1............... → 0 → 1/2 (50%)
01.............. → 1 → 1/4 (25%)
001............. → 2 → 1/8 (12.5%)
0000000001...... → 9 → 1/512 (~0.2%)
00000000000001.. → 13 → 1/8192 (~0.01%)
The rarest leading-zero pattern you've seen is a fingerprint of how large your input set is. If your max k is 13, you've almost certainly seen around 2^13 = 8192 distinct elements — because you'd only encounter a hash with 13 leading zeros once in ~8192 tries.
One register, one observation:
def naive_estimate(stream):
max_lz = 0
for item in stream:
h = mmh3.hash(item, signed=False) # hash the element → uniform int
lz = count_leading_zeros(h) # how many 0s before the first 1?
max_lz = max(max_lz, lz) # keep the rarest pattern seen
return 2 ** max_lz # estimate = 2^(rarest pattern)This works, but variance is too high — one lucky hash can skew the result 2× off. The fix is splitting into many registers and averaging.
The Algorithm: Many Registers, Harmonic Mean
Split every hash into two parts: the first b bits select a register, the rest count leading zeros.
hash(item) = [ register index (b bits) | leading-zero counter (remaining bits) ]
Track the max leading zeros per register, then take the harmonic mean across all m = 2^b registers.
m = 16384 # 2^14 registers → 0.81% error
registers = [0] * m
b = 14 # bits used for register index
def add(item):
h = hash_64bit(item)
j = h >> (64 - b) # which register
w = h & ((1 << (64 - b)) - 1) # remaining bits
registers[j] = max(registers[j], leading_zeros(w) + 1)
def count():
alpha = 0.7213 / (1 + 1.079 / m) # empirical bias constant
Z = sum(2.0 ** -r for r in registers)
return int(alpha * m * m / Z)Error formula: ±1.04 / √m
m=256 → ±6.5% error, ~192 B
m=1024 → ±3.2% error, ~768 B
m=4096 → ±1.6% error, ~3 KB
m=16384 → ±0.81% error, ~12 KB ← Redis default
Halving the error costs 4× the memory.
Bonus: two sketches merge for free — just take the element-wise max of both register arrays.
merged = [max(a, b) for a, b in zip(registers_a, registers_b)]This is why HLL is ideal for distributed systems — compute sketches independently, merge at query time.
In Practice: Redis
You never implement this. Redis has shipped it since v2.8.
# Add elements
PFADD visitors:2026-06-17 user:1 user:2 user:3 user:2 # duplicate ignored
PFCOUNT visitors:2026-06-17
# (integer) 3
# Merge across multiple days
PFMERGE visitors:week \
visitors:2026-06-11 \
visitors:2026-06-12 \
visitors:2026-06-17
PFCOUNT visitors:week
# (integer) ~41200 ← estimated, ±0.81%Same idea in analytics databases:
-- BigQuery
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events WHERE date = '2026-06-17';
-- Snowflake
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;
-- PostgreSQL (pg_hll extension)
SELECT hll_cardinality(hll_add_agg(hll_hash_text(user_id::text))) FROM events;Elasticsearch cardinality aggregation, Apache Flink, Spark, Redshift, and Vertica all use HLL internally too.
When NOT to Use It
| Situation | Why HLL fails | Use instead |
|---|---|---|
| Exact billing / compliance counts | Estimate only, no hard guarantee | Exact COUNT(DISTINCT …) |
| "Was element X ever added?" | Stores no elements, only sketch state | Bloom filter or a Set |
| Tiny sets (< ~100 items) | Relative error is high at small cardinalities | Exact set |
| Need to list/retrieve the elements | One-way sketch — no element retrieval possible | Set |
| Adversarial inputs (rate-limit, DDoS detection) | Attacker who knows your hash function can craft inputs that collapse the estimate | Randomized seed per instance |
Summary
Problem: COUNT(DISTINCT x) on billions of rows is slow and memory-hungry.
HLL idea: hash each element → track max leading zeros per register
→ harmonic mean → cardinality estimate.
Cost: 12 KB memory, O(1) add, O(1) count, free merges.
Error: ~0.81% (Redis default).
Use when: approximate count is fine, scale is large, mergeability matters.
Skip when: you need exact results, element lookup, or the input is adversarial.