Skip to content

Verify BacDive transform does not produce duplicate edges for multi-strain taxa #521

@turbomam

Description

@turbomam

Context

PR #174 (opened 2024-06-04 by @realmarcin, now closed) identified that the BacDive transform could produce duplicate edges when multiple BacDive strain records mapped to the same NCBITaxon ID. The PR proposed an accumulator pattern to deduplicate, but was never merged and the file has been rewritten extensively since.

The original problem

When BacDive has multiple strain entries for the same species (e.g. multiple E. coli strains with different media), each strain record would generate edges to the same media/assay nodes. This results in duplicate rows in edges.tsv like:

NCBITaxon:562  biolink:grows_in  MEDIADIVE:1
NCBITaxon:562  biolink:grows_in  MEDIADIVE:1   (duplicate from second strain)

What needs to happen

  1. Check whether the current bacdive.py (2,809 lines as of Mar 2026) still produces duplicate edges
  2. If so, add deduplication — either via an accumulator pattern, or by running drop_duplicates() on the output (which is already called at the end of run() for nodes)
  3. Verify edge deduplication covers all edge types: growth media, enzyme activities, metabolite utilization, isolation sources

How to test

poetry run kg transform -s bacdive
# Check for duplicate edges:
sort data/transformed/bacdive/edges.tsv | uniq -d | head

Original PR

#174 by @realmarcin — closed 2026-03-18, see closing comment for full analysis.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions