Skip to content

[common] Add HyperLogLogSketch: reusable cardinality estimation library#2664

Open
sushantmane wants to merge 3 commits intolinkedin:mainfrom
sushantmane:sumane/pr2-hll-algorithm-library
Open

[common] Add HyperLogLogSketch: reusable cardinality estimation library#2664
sushantmane wants to merge 3 commits intolinkedin:mainfrom
sushantmane:sumane/pr2-hll-algorithm-library

Conversation

@sushantmane
Copy link
Copy Markdown
Contributor

Summary

Pure-Java HyperLogLog implementation in venice-common for estimating distinct element count. Designed for reuse across VPJ (Spark accumulators), server-side (PCS), and other components.

Part 2 of 4 in the batch push record count verification series. Independent of Part 1.

Features

  • Configurable precision (p=4..18, default 14 = ~0.8% error, 16KB)
  • add(byte[]) and addHash(long) for flexible key input
  • merge() for combining sketches (associative, commutative, idempotent)
  • Compact serialization: toBytes()/fromBytes(), toByteBuffer()/fromByteBuffer()
    Format: [1 byte precision][2^p bytes registers]
  • Static hash64() method (FNV-1a + MurmurHash3 fmix64)

No external dependencies

Self-contained ~300 lines. No library additions needed.

Test plan

  • 46 tests: accuracy, merge (disjoint/overlap/identical), split-merge simulation at max precision (p=18), serialization round-trip, custom precision, error handling, hash quality

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a reusable, dependency-free HyperLogLog (HLL) sketch implementation to venice-common for approximate distinct counting, intended to be shared across components (e.g., VPJ accumulators and server-side verification).

Changes:

  • Introduces HyperLogLogSketch with configurable precision, add/addHash, merge, and byte/ByteBuffer serialization APIs.
  • Adds a comprehensive HyperLogLogSketchTest suite covering accuracy, merge properties, serialization round-trips, and error handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
internal/venice-common/src/main/java/com/linkedin/venice/utils/HyperLogLogSketch.java New standalone HLL sketch implementation with merge + serialization + hashing.
internal/venice-common/src/test/java/com/linkedin/venice/utils/HyperLogLogSketchTest.java New unit tests validating correctness/accuracy/merge behavior and serialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@pthirun pthirun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good — clean implementation with solid tests. Left a few comments on correctness and serialization size.

Copilot AI review requested due to automatic review settings March 31, 2026 21:20
@sushantmane sushantmane force-pushed the sumane/pr2-hll-algorithm-library branch from 600ba71 to 150eb90 Compare March 31, 2026 21:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +227 to +242
public static HyperLogLogSketch fromBytes(byte[] bytes) {
if (bytes == null || bytes.length < 2) {
throw new IllegalArgumentException("Invalid HLL bytes: null or too short");
}
int precision = bytes[0] & 0xFF;
if (precision < MIN_PRECISION || precision > MAX_PRECISION) {
throw new IllegalArgumentException("Invalid HLL precision in serialized data: " + precision);
}
int expectedLength = 1 + (1 << precision);
if (bytes.length != expectedLength) {
throw new IllegalArgumentException(
"Invalid HLL bytes length: expected " + expectedLength + " for p=" + precision + ", got " + bytes.length);
}
byte[] registers = new byte[1 << precision];
System.arraycopy(bytes, 1, registers, 0, registers.length);
return new HyperLogLogSketch(precision, registers);
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromBytes(...) fully trusts the per-register payload and copies it into the sketch without documenting (or enforcing) the valid register range. Because registers are stored as signed bytes, corrupted/malicious inputs can introduce negative values and cause estimate() to compute shifts with unexpected counts, producing nonsensical results. Either (a) validate/clamp register values during deserialization, or (b) at minimum document the valid range in the serialization contract/Javadoc and ensure estimate() treats register bytes as unsigned (e.g., registers[i] & 0xFF) to avoid negative shift behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +622 to +631
// Verify hash distributes across all 4 quadrants of the 64-bit space
boolean hasPositive = false, hasNegative = false;
for (int i = 0; i < 100; i++) {
long h = HyperLogLogSketch.hash64(("key-" + i).getBytes(StandardCharsets.UTF_8));
if (h >= 0)
hasPositive = true;
if (h < 0)
hasNegative = true;
}
assertTrue(hasPositive && hasNegative, "Hash should produce both positive and negative values");
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test comment says it verifies distribution across "all 4 quadrants of the 64-bit space", but the assertions only check that the hash produces both positive and negative values (2 halves, based on the sign bit). Please either update the comment to match what’s actually being tested, or expand the test to check the intended 4-quadrant distribution.

Suggested change
// Verify hash distributes across all 4 quadrants of the 64-bit space
boolean hasPositive = false, hasNegative = false;
for (int i = 0; i < 100; i++) {
long h = HyperLogLogSketch.hash64(("key-" + i).getBytes(StandardCharsets.UTF_8));
if (h >= 0)
hasPositive = true;
if (h < 0)
hasNegative = true;
}
assertTrue(hasPositive && hasNegative, "Hash should produce both positive and negative values");
// Verify hash distributes across all 4 quadrants of the 64-bit space, as defined by the top 2 bits
boolean[] quadrants = new boolean[4];
for (int i = 0; i < 100; i++) {
long h = HyperLogLogSketch.hash64(("key-" + i).getBytes(StandardCharsets.UTF_8));
int quadrant = (int) ((h >>> 62) & 0x3); // Use the two most significant bits to select one of 4 quadrants
quadrants[quadrant] = true;
}
assertTrue(
quadrants[0] && quadrants[1] && quadrants[2] && quadrants[3],
"Hash should produce values in all 4 high-order-bit quadrants of the 64-bit space");

Copilot uses AI. Check for mistakes.

return Math.round(estimate);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [IMPORTANT] Missing large-range correction in estimate() — systematic underestimation above ~143M unique keys

1. Missing large-range correction in estimate() — systematic underestimation above ~143M unique keys
HyperLogLogSketch.java:162–167

The standard Flajolet HLL algorithm requires three correction regions; only two are implemented:

  • ✅ Small range: linear counting (implemented)
  • ❌ Large range: correction when the raw estimate exceeds 2^32 / 30 ≈ 143M (missing)
  • ✅ Mid range: raw estimate used (implicit)

Without the large-range correction, estimates for datasets with >143M unique keys saturate and undercount. The PR's stated goal is batch push record count verification in VPJ/PCS — false-negative mismatches on large Venice stores are a real risk.

Fix to add after the small-range block:

if (estimate > (1.0 / 30.0) * (1L << 32)) {
    estimate = -(1L << 32) * Math.log(1.0 - estimate / (1L << 32));
}

// ---- Serialization ----

/**
* Serializes this sketch to a byte array.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 [SUGGESTION] copy() — use Arrays.copyOf instead of manual arraycopy

2. copy() — use Arrays.copyOf instead of manual arraycopy
HyperLogLogSketch.java:198–201

// current
byte[] registersCopy = new byte[m];
System.arraycopy(registers, 0, registersCopy, 0, m);

// simpler
byte[] registersCopy = Arrays.copyOf(registers, m);

Arrays is already imported.


@m-nagarajan
Copy link
Copy Markdown
Contributor

🟡 [IMPORTANT] No test for the serialize-each-split-then-merge path

2. No test for the serialize-each-split-then-merge path
HyperLogLogSketchTest.java — missing test

All split-merge tests operate on in-memory sketches. In the actual production code path (Spark accumulators → driver merge, or PCS with persisted sketches), each partition's sketch is serialized (toBytes()), transported, and then deserialized (fromBytes()) before merging. A corrupted register during serialization round-trip would silently produce wrong estimates. None of the 46 tests cover this path:

@Test
public void testSplitMergeWithSerializationRoundTrip() {
    HyperLogLogSketch a = new HyperLogLogSketch(), b = new HyperLogLogSketch();
    for (int i = 0; i < 5000; i++) {
        a.add(("a-" + i).getBytes(UTF_8));
        b.add(("b-" + i).getBytes(UTF_8));
    }
    HyperLogLogSketch aRestored = HyperLogLogSketch.fromBytes(a.toBytes());
    HyperLogLogSketch bRestored = HyperLogLogSketch.fromBytes(b.toBytes());
    aRestored.merge(bRestored);
    double err = Math.abs((double)(aRestored.estimate() - 10_000)) / 10_000;
    assertTrue(err < 0.03);
}

*/
public HyperLogLogSketch(int precision) {
if (precision < MIN_PRECISION || precision > MAX_PRECISION) {
throw new IllegalArgumentException(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of throwing an exception here, can we log a warning and clamp the precision to the desired range?
i.e. if it is less than min, set it to min, and if it is greater than max, set to max

    if (lgK < 4 || lgK > 21) {
      int clamped = Math.max(4, Math.min(21, lgK));
      LOGGER
          .warn("Partition {} HLL lgK={} is out of valid range [4, 21]. Clamping to {}.", getPartition(), lgK, clamped);
      lgK = clamped;
    }

Pure-Java HyperLogLog implementation in venice-common for estimating the
number of distinct elements in a multiset. Designed for use across VPJ
(Spark accumulators), server-side (PCS), and other components.

Features:
- Configurable precision (p=4..18, default 14 = ~0.8% error, 16KB)
- add(byte[]) and addHash(long) for flexible key input
- merge() for combining sketches (associative, commutative, idempotent)
- Compact serialization: toBytes/fromBytes, toByteBuffer/fromByteBuffer
  Format: [1 byte precision][2^p bytes registers]
- Static hash64() method (FNV-1a + MurmurHash3 fmix64)

46 tests: accuracy, merge (disjoint/overlap/identical), split-merge
simulation at max precision, serialization round-trip, custom precision,
error handling, hash quality.
- Extract shared register logic into private updateRegister(long hash) method
- Add null checks in merge() and hash64() with clear error messages
- Chain private deserialization constructor to public one for precision
validation
- Add large-range correction (Flajolet et al.) in estimate()
- Document valid register range in fromBytes() and dense serialization format
- Add testHashQualityAt10MKeys to validate hash quality at scale
Copilot AI review requested due to automatic review settings April 3, 2026 16:52
@sushantmane sushantmane force-pushed the sumane/pr2-hll-algorithm-library branch from 3ece0f6 to b9a7273 Compare April 3, 2026 16:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +78 to +81
/** Internal constructor for deserialization — takes ownership of the registers array. */
private HyperLogLogSketch(int precision, byte[] registers) {
this(precision);
System.arraycopy(registers, 0, this.registers, 0, this.m);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The private deserialization constructor comment says it “takes ownership of the registers array”, but it actually allocates a new array via this(precision) and then copies the input. This is misleading for readers and also makes copy()/fromBytes() incur extra allocations/copies. Either update the comment to reflect the copy semantics or refactor the constructor to truly take ownership (and validate length) to avoid redundant copying.

Suggested change
/** Internal constructor for deserialization — takes ownership of the registers array. */
private HyperLogLogSketch(int precision, byte[] registers) {
this(precision);
System.arraycopy(registers, 0, this.registers, 0, this.m);
/**
* Internal constructor for deserializationtakes ownership of the registers array.
*
* @throws IllegalArgumentException if precision is out of range or the register array length
* does not match the expected register count for the precision
*/
private HyperLogLogSketch(int precision, byte[] registers) {
if (precision < MIN_PRECISION || precision > MAX_PRECISION) {
throw new IllegalArgumentException(
"Precision must be between " + MIN_PRECISION + " and " + MAX_PRECISION + ", got " + precision);
}
if (registers == null) {
throw new IllegalArgumentException("Registers array must not be null");
}
int expectedRegisters = 1 << precision;
if (registers.length != expectedRegisters) {
throw new IllegalArgumentException(
"Registers array length must be " + expectedRegisters + " for precision " + precision + ", got "
+ registers.length);
}
this.p = precision;
this.m = expectedRegisters;
this.alphaM = computeAlpha(m) * m * m;
this.registers = registers;

Copilot uses AI. Check for mistakes.
Comment on lines +195 to +197
byte[] registersCopy = new byte[m];
System.arraycopy(registers, 0, registersCopy, 0, m);
return new HyperLogLogSketch(p, registersCopy);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy() creates registersCopy and then calls the private constructor which copies again, resulting in two arrays and two full copies per copy() call (particularly expensive at high precision like p=18). Consider using a constructor/factory path that assigns the already-copied array directly (or avoids the intermediate array) so copy() does only one allocation and one copy.

Suggested change
byte[] registersCopy = new byte[m];
System.arraycopy(registers, 0, registersCopy, 0, m);
return new HyperLogLogSketch(p, registersCopy);
HyperLogLogSketch copy = new HyperLogLogSketch(p);
System.arraycopy(registers, 0, copy.registers, 0, m);
return copy;

Copilot uses AI. Check for mistakes.
Comment on lines +613 to +626
* Validates hash quality at scale: 10M distinct keys should produce an estimate within 1%
* of the true cardinality at p=14 (~0.8% standard error). This catches systematic bias
* in the FNV-1a + fmix64 hash combination that might not appear at smaller scales.
*/
@Test
public void testHashQualityAt10MKeys() {
int n = 10_000_000;
HyperLogLogSketch hll = new HyperLogLogSketch();
for (int i = 0; i < n; i++) {
hll.add(("key-" + i).getBytes(StandardCharsets.UTF_8));
}
long estimate = hll.estimate();
double relativeError = Math.abs((double) (estimate - n)) / n;
assertTrue(relativeError < 0.01, "At 10M keys, relative error " + relativeError + " exceeds 1%");
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testHashQualityAt10MKeys adds 10,000,000 string-derived keys, which is likely to make the unit test suite very slow and potentially unstable in CI (CPU + allocation heavy due to per-iteration string/byte[] creation). Consider reducing n substantially, generating hashes without allocating strings/arrays, or moving/marking this as a slower performance/integration test so it doesn’t run in the default unit-test target.

Suggested change
* Validates hash quality at scale: 10M distinct keys should produce an estimate within 1%
* of the true cardinality at p=14 (~0.8% standard error). This catches systematic bias
* in the FNV-1a + fmix64 hash combination that might not appear at smaller scales.
*/
@Test
public void testHashQualityAt10MKeys() {
int n = 10_000_000;
HyperLogLogSketch hll = new HyperLogLogSketch();
for (int i = 0; i < n; i++) {
hll.add(("key-" + i).getBytes(StandardCharsets.UTF_8));
}
long estimate = hll.estimate();
double relativeError = Math.abs((double) (estimate - n)) / n;
assertTrue(relativeError < 0.01, "At 10M keys, relative error " + relativeError + " exceeds 1%");
* Validates hash quality at a unit-test-safe scale: 1M distinct keys should produce an estimate within 1%
* of the true cardinality at p=14 (~0.8% standard error). Reusing a fixed byte buffer avoids the heavy
* per-iteration String and byte[] allocation cost of generating string-derived keys in the hot loop.
*/
@Test
public void testHashQualityAt1MKeys() {
int n = 1_000_000;
HyperLogLogSketch hll = new HyperLogLogSketch();
ByteBuffer keyBuffer = ByteBuffer.allocate(Integer.BYTES);
for (int i = 0; i < n; i++) {
keyBuffer.putInt(0, i);
hll.add(keyBuffer.array());
}
long estimate = hll.estimate();
double relativeError = Math.abs((double) (estimate - n)) / n;
assertTrue(relativeError < 0.01, "At 1M keys, relative error " + relativeError + " exceeds 1%");

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants