[common] Add HyperLogLogSketch: reusable cardinality estimation library by sushantmane · Pull Request #2664 · linkedin/venice

sushantmane · 2026-03-26T23:39:42Z

Summary

Pure-Java HyperLogLog implementation in venice-common for estimating distinct element count. Designed for reuse across VPJ (Spark accumulators), server-side (PCS), and other components.

Part 2 of 4 in the batch push record count verification series. Independent of Part 1.

Features

Configurable precision (p=4..18, default 14 = ~0.8% error, 16KB)
add(byte[]) and addHash(long) for flexible key input
merge() for combining sketches (associative, commutative, idempotent)
Compact serialization: toBytes()/fromBytes(), toByteBuffer()/fromByteBuffer()
Format: [1 byte precision][2^p bytes registers]
Static hash64() method (FNV-1a + MurmurHash3 fmix64)

No external dependencies

Self-contained ~300 lines. No library additions needed.

Test plan

46 tests: accuracy, merge (disjoint/overlap/identical), split-merge simulation at max precision (p=18), serialization round-trip, custom precision, error handling, hash quality

Copilot

Pull request overview

Adds a reusable, dependency-free HyperLogLog (HLL) sketch implementation to venice-common for approximate distinct counting, intended to be shared across components (e.g., VPJ accumulators and server-side verification).

Changes:

Introduces HyperLogLogSketch with configurable precision, add/addHash, merge, and byte/ByteBuffer serialization APIs.
Adds a comprehensive HyperLogLogSketchTest suite covering accuracy, merge properties, serialization round-trips, and error handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java`	New standalone HLL sketch implementation with merge + serialization + hashing.
`internal/venice-common/src/test/java/com/linkedin/venice/utils/HyperLogLogSketchTest.java`	New unit tests validating correctness/accuracy/merge behavior and serialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java

pthirun

Overall looks good — clean implementation with solid tests. Left a few comments on correctness and serialization size.

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T21:24:16Z

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java

+  public static HyperLogLogSketch fromBytes(byte[] bytes) {
+    if (bytes == null || bytes.length < 2) {
+      throw new IllegalArgumentException("Invalid HLL bytes: null or too short");
+    }
+    int precision = bytes[0] & 0xFF;
+    if (precision < MIN_PRECISION || precision > MAX_PRECISION) {
+      throw new IllegalArgumentException("Invalid HLL precision in serialized data: " + precision);
+    }
+    int expectedLength = 1 + (1 << precision);
+    if (bytes.length != expectedLength) {
+      throw new IllegalArgumentException(
+          "Invalid HLL bytes length: expected " + expectedLength + " for p=" + precision + ", got " + bytes.length);
+    }
+    byte[] registers = new byte[1 << precision];
+    System.arraycopy(bytes, 1, registers, 0, registers.length);
+    return new HyperLogLogSketch(precision, registers);


fromBytes(...) fully trusts the per-register payload and copies it into the sketch without documenting (or enforcing) the valid register range. Because registers are stored as signed bytes, corrupted/malicious inputs can introduce negative values and cause estimate() to compute shifts with unexpected counts, producing nonsensical results. Either (a) validate/clamp register values during deserialization, or (b) at minimum document the valid range in the serialization contract/Javadoc and ensure estimate() treats register bytes as unsigned (e.g., registers[i] & 0xFF) to avoid negative shift behavior.

Copilot · 2026-03-31T21:24:16Z

internal/venice-common/src/test/java/com/linkedin/venice/utils/HyperLogLogSketchTest.java

+    // Verify hash distributes across all 4 quadrants of the 64-bit space
+    boolean hasPositive = false, hasNegative = false;
+    for (int i = 0; i < 100; i++) {
+      long h = HyperLogLogSketch.hash64(("key-" + i).getBytes(StandardCharsets.UTF_8));
+      if (h >= 0)
+        hasPositive = true;
+      if (h < 0)
+        hasNegative = true;
+    }
+    assertTrue(hasPositive && hasNegative, "Hash should produce both positive and negative values");


The test comment says it verifies distribution across "all 4 quadrants of the 64-bit space", but the assertions only check that the hash produces both positive and negative values (2 halves, based on the sign bit). Please either update the comment to match what’s actually being tested, or expand the test to check the intended 4-quadrant distribution.

Suggested change

// Verify hash distributes across all 4 quadrants of the 64-bit space

boolean hasPositive = false, hasNegative = false;

for (int i = 0; i < 100; i++) {

long h = HyperLogLogSketch.hash64(("key-" + i).getBytes(StandardCharsets.UTF_8));

if (h >= 0)

hasPositive = true;

if (h < 0)

hasNegative = true;

}

assertTrue(hasPositive && hasNegative, "Hash should produce both positive and negative values");

// Verify hash distributes across all 4 quadrants of the 64-bit space, as defined by the top 2 bits

boolean[] quadrants = new boolean[4];

for (int i = 0; i < 100; i++) {

long h = HyperLogLogSketch.hash64(("key-" + i).getBytes(StandardCharsets.UTF_8));

int quadrant = (int) ((h >>> 62) & 0x3); // Use the two most significant bits to select one of 4 quadrants

quadrants[quadrant] = true;

}

assertTrue(

quadrants[0] && quadrants[1] && quadrants[2] && quadrants[3],

"Hash should produce values in all 4 high-order-bit quadrants of the 64-bit space");

m-nagarajan · 2026-04-02T19:33:36Z

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java

+
+    return Math.round(estimate);
+  }
+


🟡 [IMPORTANT] Missing large-range correction in estimate() — systematic underestimation above ~143M unique keys

1. Missing large-range correction in estimate() — systematic underestimation above ~143M unique keys
HyperLogLogSketch.java:162–167

The standard Flajolet HLL algorithm requires three correction regions; only two are implemented:

✅ Small range: linear counting (implemented)

❌ Large range: correction when the raw estimate exceeds 2^32 / 30 ≈ 143M (missing)

✅ Mid range: raw estimate used (implicit)

Without the large-range correction, estimates for datasets with >143M unique keys saturate and undercount. The PR's stated goal is batch push record count verification in VPJ/PCS — false-negative mismatches on large Venice stores are a real risk.

Fix to add after the small-range block:

if (estimate > (1.0 / 30.0) * (1L << 32)) { estimate = -(1L << 32) * Math.log(1.0 - estimate / (1L << 32)); }

m-nagarajan · 2026-04-02T19:33:58Z

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java

+  // ---- Serialization ----
+
+  /**
+   * Serializes this sketch to a byte array.


💡 [SUGGESTION] copy() — use Arrays.copyOf instead of manual arraycopy

2. copy() — use Arrays.copyOf instead of manual arraycopy
HyperLogLogSketch.java:198–201

// current byte[] registersCopy = new byte[m]; System.arraycopy(registers, 0, registersCopy, 0, m); // simpler byte[] registersCopy = Arrays.copyOf(registers, m);

Arrays is already imported.

m-nagarajan · 2026-04-02T19:34:13Z

🟡 [IMPORTANT] No test for the serialize-each-split-then-merge path

2. No test for the serialize-each-split-then-merge path
HyperLogLogSketchTest.java — missing test

All split-merge tests operate on in-memory sketches. In the actual production code path (Spark accumulators → driver merge, or PCS with persisted sketches), each partition's sketch is serialized (toBytes()), transported, and then deserialized (fromBytes()) before merging. A corrupted register during serialization round-trip would silently produce wrong estimates. None of the 46 tests cover this path:

@Test
public void testSplitMergeWithSerializationRoundTrip() {
    HyperLogLogSketch a = new HyperLogLogSketch(), b = new HyperLogLogSketch();
    for (int i = 0; i < 5000; i++) {
        a.add(("a-" + i).getBytes(UTF_8));
        b.add(("b-" + i).getBytes(UTF_8));
    }
    HyperLogLogSketch aRestored = HyperLogLogSketch.fromBytes(a.toBytes());
    HyperLogLogSketch bRestored = HyperLogLogSketch.fromBytes(b.toBytes());
    aRestored.merge(bRestored);
    double err = Math.abs((double)(aRestored.estimate() - 10_000)) / 10_000;
    assertTrue(err < 0.03);
}

pthirun · 2026-04-02T21:26:41Z

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java

+   */
+  public HyperLogLogSketch(int precision) {
+    if (precision < MIN_PRECISION || precision > MAX_PRECISION) {
+      throw new IllegalArgumentException(


Instead of throwing an exception here, can we log a warning and clamp the precision to the desired range?
i.e. if it is less than min, set it to min, and if it is greater than max, set to max

if (lgK < 4 || lgK > 21) { int clamped = Math.max(4, Math.min(21, lgK)); LOGGER .warn("Partition {} HLL lgK={} is out of valid range [4, 21]. Clamping to {}.", getPartition(), lgK, clamped); lgK = clamped; }

Pure-Java HyperLogLog implementation in venice-common for estimating the number of distinct elements in a multiset. Designed for use across VPJ (Spark accumulators), server-side (PCS), and other components. Features: - Configurable precision (p=4..18, default 14 = ~0.8% error, 16KB) - add(byte[]) and addHash(long) for flexible key input - merge() for combining sketches (associative, commutative, idempotent) - Compact serialization: toBytes/fromBytes, toByteBuffer/fromByteBuffer Format: [1 byte precision][2^p bytes registers] - Static hash64() method (FNV-1a + MurmurHash3 fmix64) 46 tests: accuracy, merge (disjoint/overlap/identical), split-merge simulation at max precision, serialization round-trip, custom precision, error handling, hash quality.

- Extract shared register logic into private updateRegister(long hash) method - Add null checks in merge() and hash64() with clear error messages - Chain private deserialization constructor to public one for precision validation - Add large-range correction (Flajolet et al.) in estimate() - Document valid register range in fromBytes() and dense serialization format - Add testHashQualityAt10MKeys to validate hash quality at scale

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T16:56:12Z

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java

+  /** Internal constructor for deserialization — takes ownership of the registers array. */
+  private HyperLogLogSketch(int precision, byte[] registers) {
+    this(precision);
+    System.arraycopy(registers, 0, this.registers, 0, this.m);


The private deserialization constructor comment says it “takes ownership of the registers array”, but it actually allocates a new array via this(precision) and then copies the input. This is misleading for readers and also makes copy()/fromBytes() incur extra allocations/copies. Either update the comment to reflect the copy semantics or refactor the constructor to truly take ownership (and validate length) to avoid redundant copying.

Suggested change

/** Internal constructor for deserialization — takes ownership of the registers array. */

private HyperLogLogSketch(int precision, byte[] registers) {

this(precision);

System.arraycopy(registers, 0, this.registers, 0, this.m);

/**

* Internal constructor for deserialization — takes ownership of the registers array.

*

* @throws IllegalArgumentException if precision is out of range or the register array length

* does not match the expected register count for the precision

*/

private HyperLogLogSketch(int precision, byte[] registers) {

if (precision < MIN_PRECISION || precision > MAX_PRECISION) {

throw new IllegalArgumentException(

"Precision must be between " + MIN_PRECISION + " and " + MAX_PRECISION + ", got " + precision);

}

if (registers == null) {

throw new IllegalArgumentException("Registers array must not be null");

}

int expectedRegisters = 1 << precision;

if (registers.length != expectedRegisters) {

throw new IllegalArgumentException(

"Registers array length must be " + expectedRegisters + " for precision " + precision + ", got "

+ registers.length);

}

this.p = precision;

this.m = expectedRegisters;

this.alphaM = computeAlpha(m) * m * m;

this.registers = registers;

Copilot · 2026-04-03T16:56:13Z

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java

+    byte[] registersCopy = new byte[m];
+    System.arraycopy(registers, 0, registersCopy, 0, m);
+    return new HyperLogLogSketch(p, registersCopy);


copy() creates registersCopy and then calls the private constructor which copies again, resulting in two arrays and two full copies per copy() call (particularly expensive at high precision like p=18). Consider using a constructor/factory path that assigns the already-copied array directly (or avoids the intermediate array) so copy() does only one allocation and one copy.

Suggested change

byte[] registersCopy = new byte[m];

System.arraycopy(registers, 0, registersCopy, 0, m);

return new HyperLogLogSketch(p, registersCopy);

HyperLogLogSketch copy = new HyperLogLogSketch(p);

System.arraycopy(registers, 0, copy.registers, 0, m);

return copy;

Copilot · 2026-04-03T16:56:13Z

internal/venice-common/src/test/java/com/linkedin/venice/utils/HyperLogLogSketchTest.java

+   * Validates hash quality at scale: 10M distinct keys should produce an estimate within 1%
+   * of the true cardinality at p=14 (~0.8% standard error). This catches systematic bias
+   * in the FNV-1a + fmix64 hash combination that might not appear at smaller scales.
+   */
+  @Test
+  public void testHashQualityAt10MKeys() {
+    int n = 10_000_000;
+    HyperLogLogSketch hll = new HyperLogLogSketch();
+    for (int i = 0; i < n; i++) {
+      hll.add(("key-" + i).getBytes(StandardCharsets.UTF_8));
+    }
+    long estimate = hll.estimate();
+    double relativeError = Math.abs((double) (estimate - n)) / n;
+    assertTrue(relativeError < 0.01, "At 10M keys, relative error " + relativeError + " exceeds 1%");


testHashQualityAt10MKeys adds 10,000,000 string-derived keys, which is likely to make the unit test suite very slow and potentially unstable in CI (CPU + allocation heavy due to per-iteration string/byte[] creation). Consider reducing n substantially, generating hashes without allocating strings/arrays, or moving/marking this as a slower performance/integration test so it doesn’t run in the default unit-test target.

Suggested change

* Validates hash quality at scale: 10M distinct keys should produce an estimate within 1%

* of the true cardinality at p=14 (~0.8% standard error). This catches systematic bias

* in the FNV-1a + fmix64 hash combination that might not appear at smaller scales.

*/

@Test

public void testHashQualityAt10MKeys() {

int n = 10_000_000;

HyperLogLogSketch hll = new HyperLogLogSketch();

for (int i = 0; i < n; i++) {

hll.add(("key-" + i).getBytes(StandardCharsets.UTF_8));

}

long estimate = hll.estimate();

double relativeError = Math.abs((double) (estimate - n)) / n;

assertTrue(relativeError < 0.01, "At 10M keys, relative error " + relativeError + " exceeds 1%");

* Validates hash quality at a unit-test-safe scale: 1M distinct keys should produce an estimate within 1%

* of the true cardinality at p=14 (~0.8% standard error). Reusing a fixed byte buffer avoids the heavy

* per-iteration String and byte[] allocation cost of generating string-derived keys in the hot loop.

*/

@Test

public void testHashQualityAt1MKeys() {

int n = 1_000_000;

HyperLogLogSketch hll = new HyperLogLogSketch();

ByteBuffer keyBuffer = ByteBuffer.allocate(Integer.BYTES);

for (int i = 0; i < n; i++) {

keyBuffer.putInt(0, i);

hll.add(keyBuffer.array());

}

long estimate = hll.estimate();

double relativeError = Math.abs((double) (estimate - n)) / n;

assertTrue(relativeError < 0.01, "At 1M keys, relative error " + relativeError + " exceeds 1%");

Copilot AI review requested due to automatic review settings March 26, 2026 23:39

Copilot started reviewing on behalf of sushantmane March 26, 2026 23:40 View session

sushantmane mentioned this pull request Mar 26, 2026

[vpj] Add HLL-based repush pipeline integrity verification #2666

Open

1 task

Copilot AI reviewed Mar 26, 2026

View reviewed changes

pthirun reviewed Mar 31, 2026

View reviewed changes

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java Show resolved Hide resolved

pthirun reviewed Mar 31, 2026

View reviewed changes

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java Show resolved Hide resolved

pthirun reviewed Mar 31, 2026

View reviewed changes

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java Show resolved Hide resolved

pthirun reviewed Mar 31, 2026

View reviewed changes

internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java Show resolved Hide resolved

Copilot AI review requested due to automatic review settings March 31, 2026 21:20

sushantmane force-pushed the sumane/pr2-hll-algorithm-library branch from 600ba71 to 150eb90 Compare March 31, 2026 21:20

Copilot started reviewing on behalf of sushantmane March 31, 2026 21:21 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

m-nagarajan reviewed Apr 2, 2026

View reviewed changes

pthirun reviewed Apr 2, 2026

View reviewed changes

sushantmane added 3 commits April 3, 2026 09:52

Fix SpotBugs ICAST_INTEGER_MULTIPLY_CAST_TO_LONG in HLL test

ef39465

Copilot AI review requested due to automatic review settings April 3, 2026 16:52

sushantmane force-pushed the sumane/pr2-hll-algorithm-library branch from 3ece0f6 to b9a7273 Compare April 3, 2026 16:52

Copilot started reviewing on behalf of sushantmane April 3, 2026 16:53 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

-  /** Internal constructor for deserialization — takes ownership of the registers array. */
-  private HyperLogLogSketch(int precision, byte[] registers) {
-    this(precision);
-    System.arraycopy(registers, 0, this.registers, 0, this.m);
+  /**
+   * Internal constructor for deserialization — takes ownership of the registers array.
+   *
+   * @throws IllegalArgumentException if precision is out of range or the register array length
+   *                                  does not match the expected register count for the precision
+   */
+  private HyperLogLogSketch(int precision, byte[] registers) {
+    if (precision < MIN_PRECISION || precision > MAX_PRECISION) {
+      throw new IllegalArgumentException(
+          "Precision must be between " + MIN_PRECISION + " and " + MAX_PRECISION + ", got " + precision);
+    }
+    if (registers == null) {
+      throw new IllegalArgumentException("Registers array must not be null");
+    }
+    int expectedRegisters = 1 << precision;
+    if (registers.length != expectedRegisters) {
+      throw new IllegalArgumentException(
+          "Registers array length must be " + expectedRegisters + " for precision " + precision + ", got "
+              + registers.length);
+    }
+    this.p = precision;
+    this.m = expectedRegisters;
+    this.alphaM = computeAlpha(m) * m * m;
+    this.registers = registers;

Conversation

sushantmane commented Mar 26, 2026

Summary

Features

No external dependencies

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pthirun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

m-nagarajan Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

m-nagarajan Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

m-nagarajan commented Apr 2, 2026

Uh oh!

pthirun Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants