[common] Add HyperLogLogSketch: reusable cardinality estimation library#2664
[common] Add HyperLogLogSketch: reusable cardinality estimation library#2664sushantmane wants to merge 3 commits intolinkedin:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a reusable, dependency-free HyperLogLog (HLL) sketch implementation to venice-common for approximate distinct counting, intended to be shared across components (e.g., VPJ accumulators and server-side verification).
Changes:
- Introduces
HyperLogLogSketchwith configurable precision,add/addHash,merge, and byte/ByteBuffer serialization APIs. - Adds a comprehensive
HyperLogLogSketchTestsuite covering accuracy, merge properties, serialization round-trips, and error handling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java |
New standalone HLL sketch implementation with merge + serialization + hashing. |
internal/venice-common/src/test/java/com/linkedin/venice/utils/HyperLogLogSketchTest.java |
New unit tests validating correctness/accuracy/merge behavior and serialization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java
Show resolved
Hide resolved
pthirun
left a comment
There was a problem hiding this comment.
Overall looks good — clean implementation with solid tests. Left a few comments on correctness and serialization size.
internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java
Show resolved
Hide resolved
600ba71 to
150eb90
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public static HyperLogLogSketch fromBytes(byte[] bytes) { | ||
| if (bytes == null || bytes.length < 2) { | ||
| throw new IllegalArgumentException("Invalid HLL bytes: null or too short"); | ||
| } | ||
| int precision = bytes[0] & 0xFF; | ||
| if (precision < MIN_PRECISION || precision > MAX_PRECISION) { | ||
| throw new IllegalArgumentException("Invalid HLL precision in serialized data: " + precision); | ||
| } | ||
| int expectedLength = 1 + (1 << precision); | ||
| if (bytes.length != expectedLength) { | ||
| throw new IllegalArgumentException( | ||
| "Invalid HLL bytes length: expected " + expectedLength + " for p=" + precision + ", got " + bytes.length); | ||
| } | ||
| byte[] registers = new byte[1 << precision]; | ||
| System.arraycopy(bytes, 1, registers, 0, registers.length); | ||
| return new HyperLogLogSketch(precision, registers); |
There was a problem hiding this comment.
fromBytes(...) fully trusts the per-register payload and copies it into the sketch without documenting (or enforcing) the valid register range. Because registers are stored as signed bytes, corrupted/malicious inputs can introduce negative values and cause estimate() to compute shifts with unexpected counts, producing nonsensical results. Either (a) validate/clamp register values during deserialization, or (b) at minimum document the valid range in the serialization contract/Javadoc and ensure estimate() treats register bytes as unsigned (e.g., registers[i] & 0xFF) to avoid negative shift behavior.
| // Verify hash distributes across all 4 quadrants of the 64-bit space | ||
| boolean hasPositive = false, hasNegative = false; | ||
| for (int i = 0; i < 100; i++) { | ||
| long h = HyperLogLogSketch.hash64(("key-" + i).getBytes(StandardCharsets.UTF_8)); | ||
| if (h >= 0) | ||
| hasPositive = true; | ||
| if (h < 0) | ||
| hasNegative = true; | ||
| } | ||
| assertTrue(hasPositive && hasNegative, "Hash should produce both positive and negative values"); |
There was a problem hiding this comment.
The test comment says it verifies distribution across "all 4 quadrants of the 64-bit space", but the assertions only check that the hash produces both positive and negative values (2 halves, based on the sign bit). Please either update the comment to match what’s actually being tested, or expand the test to check the intended 4-quadrant distribution.
| // Verify hash distributes across all 4 quadrants of the 64-bit space | |
| boolean hasPositive = false, hasNegative = false; | |
| for (int i = 0; i < 100; i++) { | |
| long h = HyperLogLogSketch.hash64(("key-" + i).getBytes(StandardCharsets.UTF_8)); | |
| if (h >= 0) | |
| hasPositive = true; | |
| if (h < 0) | |
| hasNegative = true; | |
| } | |
| assertTrue(hasPositive && hasNegative, "Hash should produce both positive and negative values"); | |
| // Verify hash distributes across all 4 quadrants of the 64-bit space, as defined by the top 2 bits | |
| boolean[] quadrants = new boolean[4]; | |
| for (int i = 0; i < 100; i++) { | |
| long h = HyperLogLogSketch.hash64(("key-" + i).getBytes(StandardCharsets.UTF_8)); | |
| int quadrant = (int) ((h >>> 62) & 0x3); // Use the two most significant bits to select one of 4 quadrants | |
| quadrants[quadrant] = true; | |
| } | |
| assertTrue( | |
| quadrants[0] && quadrants[1] && quadrants[2] && quadrants[3], | |
| "Hash should produce values in all 4 high-order-bit quadrants of the 64-bit space"); |
|
|
||
| return Math.round(estimate); | ||
| } | ||
|
|
There was a problem hiding this comment.
🟡 [IMPORTANT] Missing large-range correction in estimate() — systematic underestimation above ~143M unique keys
1. Missing large-range correction in estimate() — systematic underestimation above ~143M unique keys
HyperLogLogSketch.java:162–167
The standard Flajolet HLL algorithm requires three correction regions; only two are implemented:
- ✅ Small range: linear counting (implemented)
- ❌ Large range: correction when the raw estimate exceeds
2^32 / 30 ≈ 143M(missing) - ✅ Mid range: raw estimate used (implicit)
Without the large-range correction, estimates for datasets with >143M unique keys saturate and undercount. The PR's stated goal is batch push record count verification in VPJ/PCS — false-negative mismatches on large Venice stores are a real risk.
Fix to add after the small-range block:
if (estimate > (1.0 / 30.0) * (1L << 32)) {
estimate = -(1L << 32) * Math.log(1.0 - estimate / (1L << 32));
}| // ---- Serialization ---- | ||
|
|
||
| /** | ||
| * Serializes this sketch to a byte array. |
There was a problem hiding this comment.
💡 [SUGGESTION] copy() — use Arrays.copyOf instead of manual arraycopy
2. copy() — use Arrays.copyOf instead of manual arraycopy
HyperLogLogSketch.java:198–201
// current
byte[] registersCopy = new byte[m];
System.arraycopy(registers, 0, registersCopy, 0, m);
// simpler
byte[] registersCopy = Arrays.copyOf(registers, m);Arrays is already imported.
|
🟡 [IMPORTANT] No test for the serialize-each-split-then-merge path 2. No test for the serialize-each-split-then-merge path All split-merge tests operate on in-memory sketches. In the actual production code path (Spark accumulators → driver merge, or PCS with persisted sketches), each partition's sketch is serialized ( @Test
public void testSplitMergeWithSerializationRoundTrip() {
HyperLogLogSketch a = new HyperLogLogSketch(), b = new HyperLogLogSketch();
for (int i = 0; i < 5000; i++) {
a.add(("a-" + i).getBytes(UTF_8));
b.add(("b-" + i).getBytes(UTF_8));
}
HyperLogLogSketch aRestored = HyperLogLogSketch.fromBytes(a.toBytes());
HyperLogLogSketch bRestored = HyperLogLogSketch.fromBytes(b.toBytes());
aRestored.merge(bRestored);
double err = Math.abs((double)(aRestored.estimate() - 10_000)) / 10_000;
assertTrue(err < 0.03);
} |
| */ | ||
| public HyperLogLogSketch(int precision) { | ||
| if (precision < MIN_PRECISION || precision > MAX_PRECISION) { | ||
| throw new IllegalArgumentException( |
There was a problem hiding this comment.
Instead of throwing an exception here, can we log a warning and clamp the precision to the desired range?
i.e. if it is less than min, set it to min, and if it is greater than max, set to max
if (lgK < 4 || lgK > 21) {
int clamped = Math.max(4, Math.min(21, lgK));
LOGGER
.warn("Partition {} HLL lgK={} is out of valid range [4, 21]. Clamping to {}.", getPartition(), lgK, clamped);
lgK = clamped;
}Pure-Java HyperLogLog implementation in venice-common for estimating the number of distinct elements in a multiset. Designed for use across VPJ (Spark accumulators), server-side (PCS), and other components. Features: - Configurable precision (p=4..18, default 14 = ~0.8% error, 16KB) - add(byte[]) and addHash(long) for flexible key input - merge() for combining sketches (associative, commutative, idempotent) - Compact serialization: toBytes/fromBytes, toByteBuffer/fromByteBuffer Format: [1 byte precision][2^p bytes registers] - Static hash64() method (FNV-1a + MurmurHash3 fmix64) 46 tests: accuracy, merge (disjoint/overlap/identical), split-merge simulation at max precision, serialization round-trip, custom precision, error handling, hash quality.
- Extract shared register logic into private updateRegister(long hash) method - Add null checks in merge() and hash64() with clear error messages - Chain private deserialization constructor to public one for precision validation - Add large-range correction (Flajolet et al.) in estimate() - Document valid register range in fromBytes() and dense serialization format - Add testHashQualityAt10MKeys to validate hash quality at scale
3ece0f6 to
b9a7273
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** Internal constructor for deserialization — takes ownership of the registers array. */ | ||
| private HyperLogLogSketch(int precision, byte[] registers) { | ||
| this(precision); | ||
| System.arraycopy(registers, 0, this.registers, 0, this.m); |
There was a problem hiding this comment.
The private deserialization constructor comment says it “takes ownership of the registers array”, but it actually allocates a new array via this(precision) and then copies the input. This is misleading for readers and also makes copy()/fromBytes() incur extra allocations/copies. Either update the comment to reflect the copy semantics or refactor the constructor to truly take ownership (and validate length) to avoid redundant copying.
| /** Internal constructor for deserialization — takes ownership of the registers array. */ | |
| private HyperLogLogSketch(int precision, byte[] registers) { | |
| this(precision); | |
| System.arraycopy(registers, 0, this.registers, 0, this.m); | |
| /** | |
| * Internal constructor for deserialization — takes ownership of the registers array. | |
| * | |
| * @throws IllegalArgumentException if precision is out of range or the register array length | |
| * does not match the expected register count for the precision | |
| */ | |
| private HyperLogLogSketch(int precision, byte[] registers) { | |
| if (precision < MIN_PRECISION || precision > MAX_PRECISION) { | |
| throw new IllegalArgumentException( | |
| "Precision must be between " + MIN_PRECISION + " and " + MAX_PRECISION + ", got " + precision); | |
| } | |
| if (registers == null) { | |
| throw new IllegalArgumentException("Registers array must not be null"); | |
| } | |
| int expectedRegisters = 1 << precision; | |
| if (registers.length != expectedRegisters) { | |
| throw new IllegalArgumentException( | |
| "Registers array length must be " + expectedRegisters + " for precision " + precision + ", got " | |
| + registers.length); | |
| } | |
| this.p = precision; | |
| this.m = expectedRegisters; | |
| this.alphaM = computeAlpha(m) * m * m; | |
| this.registers = registers; |
| byte[] registersCopy = new byte[m]; | ||
| System.arraycopy(registers, 0, registersCopy, 0, m); | ||
| return new HyperLogLogSketch(p, registersCopy); |
There was a problem hiding this comment.
copy() creates registersCopy and then calls the private constructor which copies again, resulting in two arrays and two full copies per copy() call (particularly expensive at high precision like p=18). Consider using a constructor/factory path that assigns the already-copied array directly (or avoids the intermediate array) so copy() does only one allocation and one copy.
| byte[] registersCopy = new byte[m]; | |
| System.arraycopy(registers, 0, registersCopy, 0, m); | |
| return new HyperLogLogSketch(p, registersCopy); | |
| HyperLogLogSketch copy = new HyperLogLogSketch(p); | |
| System.arraycopy(registers, 0, copy.registers, 0, m); | |
| return copy; |
| * Validates hash quality at scale: 10M distinct keys should produce an estimate within 1% | ||
| * of the true cardinality at p=14 (~0.8% standard error). This catches systematic bias | ||
| * in the FNV-1a + fmix64 hash combination that might not appear at smaller scales. | ||
| */ | ||
| @Test | ||
| public void testHashQualityAt10MKeys() { | ||
| int n = 10_000_000; | ||
| HyperLogLogSketch hll = new HyperLogLogSketch(); | ||
| for (int i = 0; i < n; i++) { | ||
| hll.add(("key-" + i).getBytes(StandardCharsets.UTF_8)); | ||
| } | ||
| long estimate = hll.estimate(); | ||
| double relativeError = Math.abs((double) (estimate - n)) / n; | ||
| assertTrue(relativeError < 0.01, "At 10M keys, relative error " + relativeError + " exceeds 1%"); |
There was a problem hiding this comment.
testHashQualityAt10MKeys adds 10,000,000 string-derived keys, which is likely to make the unit test suite very slow and potentially unstable in CI (CPU + allocation heavy due to per-iteration string/byte[] creation). Consider reducing n substantially, generating hashes without allocating strings/arrays, or moving/marking this as a slower performance/integration test so it doesn’t run in the default unit-test target.
| * Validates hash quality at scale: 10M distinct keys should produce an estimate within 1% | |
| * of the true cardinality at p=14 (~0.8% standard error). This catches systematic bias | |
| * in the FNV-1a + fmix64 hash combination that might not appear at smaller scales. | |
| */ | |
| @Test | |
| public void testHashQualityAt10MKeys() { | |
| int n = 10_000_000; | |
| HyperLogLogSketch hll = new HyperLogLogSketch(); | |
| for (int i = 0; i < n; i++) { | |
| hll.add(("key-" + i).getBytes(StandardCharsets.UTF_8)); | |
| } | |
| long estimate = hll.estimate(); | |
| double relativeError = Math.abs((double) (estimate - n)) / n; | |
| assertTrue(relativeError < 0.01, "At 10M keys, relative error " + relativeError + " exceeds 1%"); | |
| * Validates hash quality at a unit-test-safe scale: 1M distinct keys should produce an estimate within 1% | |
| * of the true cardinality at p=14 (~0.8% standard error). Reusing a fixed byte buffer avoids the heavy | |
| * per-iteration String and byte[] allocation cost of generating string-derived keys in the hot loop. | |
| */ | |
| @Test | |
| public void testHashQualityAt1MKeys() { | |
| int n = 1_000_000; | |
| HyperLogLogSketch hll = new HyperLogLogSketch(); | |
| ByteBuffer keyBuffer = ByteBuffer.allocate(Integer.BYTES); | |
| for (int i = 0; i < n; i++) { | |
| keyBuffer.putInt(0, i); | |
| hll.add(keyBuffer.array()); | |
| } | |
| long estimate = hll.estimate(); | |
| double relativeError = Math.abs((double) (estimate - n)) / n; | |
| assertTrue(relativeError < 0.01, "At 1M keys, relative error " + relativeError + " exceeds 1%"); |
Summary
Pure-Java HyperLogLog implementation in venice-common for estimating distinct element count. Designed for reuse across VPJ (Spark accumulators), server-side (PCS), and other components.
Part 2 of 4 in the batch push record count verification series. Independent of Part 1.
Features
add(byte[])andaddHash(long)for flexible key inputmerge()for combining sketches (associative, commutative, idempotent)toBytes()/fromBytes(),toByteBuffer()/fromByteBuffer()Format:
[1 byte precision][2^p bytes registers]hash64()method (FNV-1a + MurmurHash3 fmix64)No external dependencies
Self-contained ~300 lines. No library additions needed.
Test plan