Skip to content

Add genome symlink to allow different genome names without mapping #5

@brainstorm

Description

@brainstorm

So as to solve the following situation:

b97pla@f5f59a5

That re-downloads the same data on different directories and re-runs all the alignments.

As @b97pla discussed:

(...) what is presented to the user does not need to match what goes into the pipeline (i.e. the user could be presented with e.g. "Arabidopsis thaliana (tair9)" but your application could still write "araTha_tair9" to the csv, which would guarantee no disruption to the pipeline), or am I missing the point?

The problem is that the code pointed to by Valentine is not the only place where this is used, e.g. bcbio/pipeline/alignment.py also uses this value but without having any alias hash. There are of course many places where we can handle this: in the samplesheet generator, in the csv2yaml conversion, with aliases when fetching the reference file or with multiple entries in the reference mapping .loc-file. Implementing an alias hash is probably the most flexible and future-proof solution. I'll log this as an issue.

I've been thinking that defining the following structure in biodata.yml would sove the issue:

genomes:
  - dbkey: araTha_tair9
    name: Arabidopsis thaliana (TAIR9)
  - dbkey: tair9
    name: Arabidopsis thaliana (TAIR9)
    type: symlink_to(araTha_tair9)

Then, symlink accordingly on the filesystem, instead of re-downloading the same genomes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions