Skip to content

Make read_records return deterministic hash#1638

Draft
ilongin wants to merge 2 commits intomainfrom
ilongin/1636-deterministic-hash-for-read-records
Draft

Make read_records return deterministic hash#1638
ilongin wants to merge 2 commits intomainfrom
ilongin/1636-deterministic-hash-for-read-records

Conversation

@ilongin
Copy link
Copy Markdown
Contributor

@ilongin ilongin commented Mar 11, 2026

read_records() creates a temp dataset with a random UUID name. The starting checkpoint hash is derived from this name, making it different on every run — so checkpoints can never be reused for read_values, read_pandas, read_hf, and direct read_records calls with list data.

This adds deterministic content-based hashing: when read_records receives a concrete list (not a generator/iterator), a SHA256 hash is computed from the actual data and set as the starting checkpoint hash, overriding the random temp dataset name. read_values similarly computes a hash from its keyword arguments. This makes checkpoint reuse work for read_values, read_pandas, read_hf, and read_records with list input.

Generator/iterator-based inputs (e.g. read_database, read_records with a generator) still produce non-deterministic hashes since the data is consumed lazily and can't be hashed without buffering it all in memory. - in order to fix this we would need change how we approach to hash calculation in DataChain as currently it can be calculated without applying steps, but in order to support generators we need to move hash calculation into applying steps itself.

@ilongin ilongin linked an issue Mar 11, 2026 that may be closed by this pull request
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Mar 11, 2026

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 98fcc8e
Status: ✅  Deploy successful!
Preview URL: https://ce1bdbc6.datachain-2g6.pages.dev
Branch Preview URL: https://ilongin-1636-deterministic-h.datachain-2g6.pages.dev

View logs

@ilongin ilongin marked this pull request as draft March 11, 2026 15:11
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deterministic checkpoint hashes for read_values, read_pandas, read_hf

1 participant