Skip to content

Add species-specific parameters for splice-aware index generation (e.g., average exon length, sjdbOverhang) #129

@anilthanki

Description

@anilthanki

Description of feature

Currently, nf-core/references builds genome indices for splice-aware aligners (STAR, HISAT2, etc.) using mostly fixed or default parameters.

However, some of these parameters should ideally depend on species-specific genome architecture, especially for large or compact genomes where exon/intron structure varies significantly.

Motivation
When building references across multiple species (e.g., mammals vs. insects vs. plants), the same hardcoded STAR parameters can lead to suboptimal or even invalid splice junction indexes.

Allowing per-species or per-asset parameterization (e.g. via YAML keys in assets.yaml or a separate JSON schema) would make the pipeline far more general and biologically robust.

Proposed implementation
Extend the asset schema to include a params: section, e.g.:

genomes:
  - id: Homo_sapiens.GRCh38
    fasta: path/to/genome.fa
    gtf: path/to/annotation.gtf
    params:
      star:
        sjdbOverhang: 99
        genomeSAindexNbases: 14
      notes:
        avg_exon_length: 170
  - id: Drosophila_melanogaster.BDGP6
    fasta: path/to/genome.fa
    gtf: path/to/annotation.gtf
    params:
      star:
        sjdbOverhang: 74
        genomeSAindexNbases: 11
        notes:
          avg_exon_length: 280

Expose these through the pipeline as ext.args or --star_* overrides in the relevant modules.

Existing --kallisto_make_unique flag shows how such params can be exposed consistently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions