Add genome symlink to allow different genome names without mapping

So as to solve the following situation:

https://github.com/b97pla/cloudbiolinux/commit/f5f59a5e9296d1883ae83f83ad2d9f6c1f9dfa4a

That re-downloads the same data on different directories and re-runs all the alignments.

As @b97pla discussed:

(...) what is presented to the user does not need to match what goes into the pipeline (i.e. the user could be presented with e.g. "Arabidopsis thaliana (tair9)" but your application could still write "araTha_tair9" to the csv, which would guarantee no disruption to the pipeline), or am I missing the point? 

The problem is that the code pointed to by Valentine is not the only place where this is used, e.g. bcbio/pipeline/alignment.py also uses this value but without having any alias hash. There are of course many places where we can handle this: in the samplesheet generator, in the csv2yaml conversion, with aliases when fetching the reference file or with multiple entries in the reference mapping .loc-file. Implementing an alias hash is probably the most flexible and future-proof solution. I'll log this as an issue.

I've been thinking that defining the following structure in biodata.yml would sove the issue:

<pre>
genomes:
  - dbkey: araTha_tair9
    name: Arabidopsis thaliana (TAIR9)
  - dbkey: tair9
    name: Arabidopsis thaliana (TAIR9)
    type: symlink_to(araTha_tair9)
</pre>


Then, symlink accordingly on the filesystem, instead of re-downloading the same genomes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add genome symlink to allow different genome names without mapping #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add genome symlink to allow different genome names without mapping #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions