Skip to content

zavolanlab/SeqMetrics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SeqMetrics

CI

SeqMetrics computes per-sequence metrics for coding sequences, combining:

  • nucleotide composition
  • tissue-specific codon usage metrics such as CAI and fraction of optimal codons
  • species-specific tRNA adaptation index from tRNA gene copy tables
  • protein-level properties after translation

The package exposes a Python API and a seqmetrics command-line interface.

Installation

Install from a local checkout:

pip install .

For development:

pip install -e .[test]
pytest
python -m build

CLI usage

Basic analysis from a CDS FASTA file:

seqmetrics data/coding_regions.fasta > results.tsv

With tissue-specific codon usage and species-specific tRNA counts:

seqmetrics data/coding_regions.fasta \
  --tissue brain \
  --species H_sapiens \
  --tissue-usage-file helpers/tissue_codon_usage.tsv \
  --trna-counts helpers/hg19-tRNAs-confidence-set.out \
  --output results.tsv

Python usage

from seqmetrics import CodonUsageTable, SequenceAnalyzer, TRNAWeightTable
from seqmetrics.codon_definitions import SYN_CODONS_BY_AA

tissue_usage = CodonUsageTable.load_from_table(
    "helpers/tissue_codon_usage.tsv",
    syn_codons_by_aa=SYN_CODONS_BY_AA,
)

trna_table = TRNAWeightTable.from_trna_gene_file(
    species="H_sapiens",
    path="helpers/hg19-tRNAs-confidence-set.out",
)

analyzer = SequenceAnalyzer(
    tissue_usage=tissue_usage,
    trna_table=trna_table,
    default_tissue="brain",
)

rows = analyzer.analyze_fasta("data/coding_regions.fasta")

The resulting rows are plain dictionaries and can be passed directly to pandas.DataFrame if needed by downstream analysis code.

Output columns

Each analyzed sequence returns:

  • identifiers and description
  • nucleotide composition fractions (A, C, G, T)
  • tissue-specific metrics (tissue, tissue_cai, frac_opt_codons)
  • species-specific tAI (species, species_tai)
  • protein properties (length_aa, mw, pI, gravy, aromaticity)
  • secondary-structure propensity fractions (helix_frac, sheet_frac, coil_frac)
  • amino-acid composition columns (aa_A through aa_Y) as percentages

Reference data

The bundled helper files are based on these sources:

About

Package to compute metrics over protein-coding sequences

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages