Context
PR #174 (opened 2024-06-04 by @realmarcin, now closed) identified that the BacDive transform could produce duplicate edges when multiple BacDive strain records mapped to the same NCBITaxon ID. The PR proposed an accumulator pattern to deduplicate, but was never merged and the file has been rewritten extensively since.
The original problem
When BacDive has multiple strain entries for the same species (e.g. multiple E. coli strains with different media), each strain record would generate edges to the same media/assay nodes. This results in duplicate rows in edges.tsv like:
NCBITaxon:562 biolink:grows_in MEDIADIVE:1
NCBITaxon:562 biolink:grows_in MEDIADIVE:1 (duplicate from second strain)
What needs to happen
- Check whether the current
bacdive.py (2,809 lines as of Mar 2026) still produces duplicate edges
- If so, add deduplication — either via an accumulator pattern, or by running
drop_duplicates() on the output (which is already called at the end of run() for nodes)
- Verify edge deduplication covers all edge types: growth media, enzyme activities, metabolite utilization, isolation sources
How to test
poetry run kg transform -s bacdive
# Check for duplicate edges:
sort data/transformed/bacdive/edges.tsv | uniq -d | head
Original PR
#174 by @realmarcin — closed 2026-03-18, see closing comment for full analysis.
Context
PR #174 (opened 2024-06-04 by @realmarcin, now closed) identified that the BacDive transform could produce duplicate edges when multiple BacDive strain records mapped to the same NCBITaxon ID. The PR proposed an accumulator pattern to deduplicate, but was never merged and the file has been rewritten extensively since.
The original problem
When BacDive has multiple strain entries for the same species (e.g. multiple E. coli strains with different media), each strain record would generate edges to the same media/assay nodes. This results in duplicate rows in
edges.tsvlike:What needs to happen
bacdive.py(2,809 lines as of Mar 2026) still produces duplicate edgesdrop_duplicates()on the output (which is already called at the end ofrun()for nodes)How to test
Original PR
#174 by @realmarcin — closed 2026-03-18, see closing comment for full analysis.