GeneLM/run-as-script/README.md at main · Bioinformatics-UM6P/GeneLM

GeneLM: Gene Language Model for Translation Initiation Site Prediction in Bacteria

****

GeneLM-Script Runner (split - parallel - merge)

Quick start

First of all you need to setup genelm environment to be able to annotate gene using GeneLM.

Download: Download only this folder (GeneLM/run-as-script) - it’s enough to run the annotation scripts:
Setup:

python -m venv .genelm_env
source .genelm_env/bin/activate
pip install -r genelm/requirements.txt

Usage

1. Single sequence

Use this when your FASTA contains one record (or when you want to process a fasta file in sequential). It runs the full AnnotatorPipeline once and writes a single GFF/CSV result. --device lets you force CPU or select a GPU (e.g., cuda:0). If you pass --out_dir, the produced file is copied there; otherwise it stays in __files__/results.

python run_single.py \
  --in_fasta smoke-test/sequence_tiny.fasta \
  --format GFF \
  --device cpu \
  --out_dir __files__/results

Optional Args: --filename my_result or --filename my_result.gff: lets you specify a custom output file name (the script automatically adds .gff or .csv based on your chosen format). If you don’t set this option, the output filename will be generated from the contig names in your FASTA file, joined by underscores (_), with the appropriate .gff or .csv extension added.

New: Add this arg --extract_orf_protein to generate in the output folder predicted_orfs.fasta and predicted_proteins.fasta. By default the codon table used is 11. you can still chnage this by adding --codon_table 11.

2. Batch (multi-FASTA)

Use this for a multi-FASTA: it automatically splits the input into per-record FASTAs, runs each one in parallel (either on CPU or GPU), and then merges all per-record outputs into a single file.

python run_batch.py \
  --input_fasta smoke-test/sequence_tiny_mixt.fasta \
  --format GFF \
  --device cpu \
  --workers 3 \
  --job_name test-smoke \
  --output __files__/results/t-smoke.gff

Argument Table for High-Level Runner

Argument	Required	Description	Allowed Values / Notes	Default
`--input_fasta`	Yes	Path to the multi-FASTA file containing multiple sequences.	Any valid FASTA file path.	None
`--format`	No	Output format used for per-record results and the final merged output.	`GFF`, `CSV`	`GFF`
`--device`	Yes	Device to run run_single.py on.	`cpu` → force CPU mode `gpu` → auto-GPU (not tested)	`None`
`--workers`	No	Number of parallel threads (Python threads) used to run multiple FASTA chunks simultaneously.	GPU: Must be `1` (avoid GPU memory collisions) CPU: Can be >1 (parallel CPU processing)	`1`
`--job_name`	No	Name placed in merged GFF header for tracking the batch job.	Any string	`"batch_job"`
`--output`	Yes	Final merged output file path (`.gff` or `.csv`).	Path must end with `.gff` or `.csv`.	None
`--keep_temp`	No	Keep temporary directory containing split FASTAs + per-record outputs.	Flag; no value. If present → temp dirs kept.
`--extract_orf_protein`	No	After merging, run ORF & protein FASTA extraction	Flag	`False`
`--codon_table`	No	NCBI codon table ID used for translation	Integer (11 recommended for bacteria/archaea)	`11`
`--verbose`	No	Enable verbose logging passed to run_single.py.	Flag; if set → verbose mode.

Below is a minimal test using a tiny mixed FASTA (sequence_tiny_mixt.fasta) to quickly verify GPU scheduling and batch processing.

1 - Use all GPUs automatically

python run_batch_gpu.py \
  --input_fasta smoke-test/sequence_tiny_mixt.fasta \
  --output __files__/results/t-smoke.gff

2 - Use only GPU 0 and GPU 1

python run_batch_gpu.py \
  --input_fasta smoke-test/sequence_tiny_mixt.fasta \
  --output __files__/results/t-smoke.gff \
  --device "cuda:0,cuda:1"

Argument Table for High-Level Runner

Argument	Required	Description	Allowed Values / Notes	Default
`--input_fasta`	Yes	Multi-FASTA file containing multiple sequences. Script splits this into per-sequence FASTAs.	Any valid FASTA file path	None
`--format`	No	Output format for per-chunk results and final merged file.	`GFF`, `CSV`	`GFF`
`--device`	No	Controls GPU behavior. If set to a specific GPU (`cuda:0`), only that GPU is used. If not set, script auto-detects all GPUs.	- auto-detect all GPUs by default - `cuda:0` → use GPU cuda:0 only - `"cuda:0,cuda:1"` → use only GPU 0 and 1	`None`
`--job_name`	No	Name inserted into GFF output header and used by the merger.	Any string	`"batch_job"`
`--output`	Yes	Final merged output file (`.gff` or `.csv`).	Must end with `.gff` or `.csv`.	None
`--keep_temp`	No	Keep temporary directory with split FASTAs and per-chunk results. Useful for debugging.	Flag; if present → keep temp folder	`False`
`--extract_orf_protein`	No	After merging, run ORF & protein FASTA extraction	Flag	`False`
`--codon_table`	No	NCBI codon table ID used for translation	Integer (11 recommended for bacteria/archaea)	`11`
`--verbose`	No	Enables verbose mode for each GPU job (passed to run_single.py).	Flag; if present → verbose	`False`

2.1. Debug

When running run_batch.py or run_batch_gpu.py, the script normally removes all temporary working files after finishing. To help with debugging, you can preserve all intermediate files by adding: --keep_temp. This keeps the full temporary workspace so you can inspect every step of processing.

When enabled, the script prints a detailed warning like:

================================================================================
⚠️ DEBUG MODE ENABLED (--keep_temp)
Temporary folder was NOT deleted.
Debug folder: /tmp/genelm_gpu_batch_abcd1234

Inside this folder you will find:
  • split/          — one FASTA file per sequence
  • chunk_results/  — per-sequence GFF/CSV results (from run_single)
  • merge_logs/     — merged output (if created)
  • /tmp/genelm_gpu_batch_abcd1234/**/* — additional GeneLM-related temporary files
================================================================================

The preserved folder looks like:

genelm_gpu_batch_xxxxxx/
│
├── split/
│      00001_recordA.fasta
│      00002_recordB.fasta
│      ...
│
└── chunk_results/
       00001_recordA.gff
       00002_recordB.gff

Logs are execptionally here:

__files__/results
│
└── ***.log

3. Extract ORFs & Proteins from FASTA + GFF

After running GeneLM and obtaining the GFF annotation, you can optionally extract:

Predicted ORF nucleotide sequences
Predicted protein amino-acid sequences

This helper script works for bacterial genomes (Genetic Code 11) and if no output directory is provided, files are written to the current directory.

Usage

python extract_orf_protein.py \
  path/to/genome.fasta \
  path/to/annotations.gff \

Or specify an output directory:

python extract_orf_protein.py \
  path/to/genome.fasta \
  path/to/annotations.gff \
  path/to/outputdir/where/to/save/ \
  --codon_table 11

This script produces: predicted_orfs.fasta and predicted_proteins.fasta

4. HPC (SLURM)

Use the provided SLURM helper to run the batch workflow on a cluster. Arguments: INPUT_FASTA FORMAT DEVICE WORKERS OUT_FILE JOB_NAME. Pick DEVICE=cpu for CPU nodes or cuda:0 on a GPU partition (and set #SBATCH --gres=gpu:1 inside the script if needed).

sbatch hpc_slurm_example.sh  data/multi.fna GFF cpu 16 __files__/results/merged.gff my_job

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick start

Usage

1. Single sequence

2. Batch (multi-FASTA)

3. Extract ORFs & Proteins from FASTA + GFF

4. HPC (SLURM)

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Quick start

Usage

1. Single sequence

2. Batch (multi-FASTA)

3. Extract ORFs & Proteins from FASTA + GFF

4. HPC (SLURM)