Skip to content

fix: add url_hash column for cross-scraper lead deduplication#20

Open
DeryFerd wants to merge 1 commit into
vasu-devs:mainfrom
DeryFerd:fix/lead-dedup-by-url
Open

fix: add url_hash column for cross-scraper lead deduplication#20
DeryFerd wants to merge 1 commit into
vasu-devs:mainfrom
DeryFerd:fix/lead-dedup-by-url

Conversation

@DeryFerd

@DeryFerd DeryFerd commented May 6, 2026

Copy link
Copy Markdown

What's the problem?

Right now, lead deduplication only works within a single scraper. Each scraper computes a job_id by hashing the URL in its own way — scout.py uses hashlib.md5(url)[:16], free_scout.py uses lead_id(platform, url), x_scout.py uses a different hash. They all check url_exists(jid) before saving, but since the jid is derived differently per scraper, the same job posting from two different sources gets saved as two separate leads.

This means the leads table accumulates duplicates — same job URL, different job_ids, no way to detect them.

What this PR does

Adds a url_hash column to the leads table that stores a normalized hash of the URL, independent of which scraper found it. This makes cross-scraper dedup possible: any scraper can call url_exists_by_url(url) to check if a lead with that URL already exists, regardless of who saved it.

URL normalization in _url_hash():

  • Strips trailing slashes (/posting//posting)
  • Drops URL fragments (#section removed)
  • Lowercases scheme and hostname (HTTPS://Jobs.Example.COMhttps://jobs.example.com)
  • Sorts query parameters (?b=2&a=1?a=1&b=2)
  • Returns first 32 chars of SHA-256 of the normalized URL

Files changed:

  • db/client.py — New url_hash TEXT DEFAULT '' column (auto-migration). New _url_hash(url) function with the normalization logic above. New url_exists_by_url(url) that queries by url_hash. save_lead() now stores _url_hash(u) on INSERT. New backfill_url_hashes() helper to populate the column for existing leads.
  • tests/test_regressions.py — New TestUrlHashDedup class with 6 tests: trailing slash normalization, fragment stripping, case-insensitive host, query param order, empty URL handling, and different paths producing different hashes.

How to use

Scrapers can now add a url_exists_by_url(url) check before saving, which will catch duplicates even when the job_id differs. The existing url_exists(jid) check still works as before for same-scraper dedup.

For existing databases, run backfill_url_hashes() once to populate the new column.

@DeryFerd DeryFerd force-pushed the fix/lead-dedup-by-url branch from 6c7c90d to cc62af2 Compare May 6, 2026 20:22
@vasu-devs vasu-devs self-requested a review as a code owner May 14, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant