fix: add url_hash column for cross-scraper lead deduplication by DeryFerd · Pull Request #20 · vasu-devs/JustHireMe

DeryFerd · 2026-05-06T20:15:14Z

What's the problem?

Right now, lead deduplication only works within a single scraper. Each scraper computes a job_id by hashing the URL in its own way — scout.py uses hashlib.md5(url)[:16], free_scout.py uses lead_id(platform, url), x_scout.py uses a different hash. They all check url_exists(jid) before saving, but since the jid is derived differently per scraper, the same job posting from two different sources gets saved as two separate leads.

This means the leads table accumulates duplicates — same job URL, different job_ids, no way to detect them.

What this PR does

Adds a url_hash column to the leads table that stores a normalized hash of the URL, independent of which scraper found it. This makes cross-scraper dedup possible: any scraper can call url_exists_by_url(url) to check if a lead with that URL already exists, regardless of who saved it.

URL normalization in _url_hash():

Strips trailing slashes (/posting/ → /posting)
Drops URL fragments (#section removed)
Lowercases scheme and hostname (HTTPS://Jobs.Example.COM → https://jobs.example.com)
Sorts query parameters (?b=2&a=1 → ?a=1&b=2)
Returns first 32 chars of SHA-256 of the normalized URL

Files changed:

db/client.py — New url_hash TEXT DEFAULT '' column (auto-migration). New _url_hash(url) function with the normalization logic above. New url_exists_by_url(url) that queries by url_hash. save_lead() now stores _url_hash(u) on INSERT. New backfill_url_hashes() helper to populate the column for existing leads.
tests/test_regressions.py — New TestUrlHashDedup class with 6 tests: trailing slash normalization, fragment stripping, case-insensitive host, query param order, empty URL handling, and different paths producing different hashes.

How to use

Scrapers can now add a url_exists_by_url(url) check before saving, which will catch duplicates even when the job_id differs. The existing url_exists(jid) check still works as before for same-scraper dedup.

For existing databases, run backfill_url_hashes() once to populate the new column.

fix: add url_hash column for cross-scraper lead deduplication

cc62af2

DeryFerd force-pushed the fix/lead-dedup-by-url branch from 6c7c90d to cc62af2 Compare May 6, 2026 20:22

vasu-devs self-requested a review as a code owner May 14, 2026 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add url_hash column for cross-scraper lead deduplication#20

fix: add url_hash column for cross-scraper lead deduplication#20
DeryFerd wants to merge 1 commit into
vasu-devs:mainfrom
DeryFerd:fix/lead-dedup-by-url

DeryFerd commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DeryFerd commented May 6, 2026

What's the problem?

What this PR does

How to use

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant