So as to solve the following situation:
b97pla@f5f59a5
That re-downloads the same data on different directories and re-runs all the alignments.
As @b97pla discussed:
(...) what is presented to the user does not need to match what goes into the pipeline (i.e. the user could be presented with e.g. "Arabidopsis thaliana (tair9)" but your application could still write "araTha_tair9" to the csv, which would guarantee no disruption to the pipeline), or am I missing the point?
The problem is that the code pointed to by Valentine is not the only place where this is used, e.g. bcbio/pipeline/alignment.py also uses this value but without having any alias hash. There are of course many places where we can handle this: in the samplesheet generator, in the csv2yaml conversion, with aliases when fetching the reference file or with multiple entries in the reference mapping .loc-file. Implementing an alias hash is probably the most flexible and future-proof solution. I'll log this as an issue.
I've been thinking that defining the following structure in biodata.yml would sove the issue:
genomes:
- dbkey: araTha_tair9
name: Arabidopsis thaliana (TAIR9)
- dbkey: tair9
name: Arabidopsis thaliana (TAIR9)
type: symlink_to(araTha_tair9)
Then, symlink accordingly on the filesystem, instead of re-downloading the same genomes.
So as to solve the following situation:
b97pla@f5f59a5
That re-downloads the same data on different directories and re-runs all the alignments.
As @b97pla discussed:
(...) what is presented to the user does not need to match what goes into the pipeline (i.e. the user could be presented with e.g. "Arabidopsis thaliana (tair9)" but your application could still write "araTha_tair9" to the csv, which would guarantee no disruption to the pipeline), or am I missing the point?
The problem is that the code pointed to by Valentine is not the only place where this is used, e.g. bcbio/pipeline/alignment.py also uses this value but without having any alias hash. There are of course many places where we can handle this: in the samplesheet generator, in the csv2yaml conversion, with aliases when fetching the reference file or with multiple entries in the reference mapping .loc-file. Implementing an alias hash is probably the most flexible and future-proof solution. I'll log this as an issue.
I've been thinking that defining the following structure in biodata.yml would sove the issue:
genomes: - dbkey: araTha_tair9 name: Arabidopsis thaliana (TAIR9) - dbkey: tair9 name: Arabidopsis thaliana (TAIR9) type: symlink_to(araTha_tair9)Then, symlink accordingly on the filesystem, instead of re-downloading the same genomes.