Skip to content

zavolanlab/APA_localization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

APA localization - Analysis and Pipelines

This repository contains the computational workflows and downstream analysis notebooks related to the analysis of alternative polyadenylation isoforms in subcellular compartments.

The repository is optimized for running BOTH the workflows and analysis in jupyter notebook on HPC cluster.

On sciCORE HPC, running jupyter notebook on a computational node is nicely enabled by OnDemand service.

We utilize a hybrid approach: Snakemake for robust, scalable data processing on HPC clusters (sciCORE), and Jupyter Notebooks for interactive downstream analysis and visualization.

Current state

Currently, we've reanalyzed the bulk RNA-seq data from the study System-wide analysis of RNA and protein subcellular localization dynamics We've run basic processing, alignment, and quantification of gene expression using customly prepared .gtf file and FeatureCounts utility. We further quantified relative usage of polyadenylation sites (PASs) with a PAQR2 workflow. At that, we've used the modified version of the workflow from the paper Leveraging multi-omics data to infer regulators of mRNA 3’ end processing in glioblastoma. As input for PAQR2, we used human PolyASite Atlas v3.0 filtered for 62% stringency level (see the paper for explanation of the optimality of that particular threshold value).

We further focused on NFYA gene and alternative polyadenylation at its terminal exon, in a collaborative project with the group of Prof. Dr. Paolo Gandellini.

Action plan In general, different sub-projects related to subcellular localization of APA isoforms, will be corresponding to different jupyter notebooks. For now, only NFYA project code and data are present.

Repository Structure

.
├── NFYA_project.ipynb                  # a Jupyter notebook dedicated to NFYA project, includes analysis and workflow configuration
├── APA_localization.template.env       # Template for required environment variables/paths
└── WF/                                 # Snakemake Workflow Engine
    ├── Snakefile-prepare-faster        # Pipeline Step 1: RNA-seq data processing (alignment, FastQC)
    ├── Snakefile-quantification-faster # Pipeline Step 2: Quantification of gene expression with FeatureCounts, separating .bam files by chromosomes for efficiency, preparation of coverages for PAQR quantification
    ├── Snakefile-PAQR-quantify         # Pipeline Step 3: Running final PAQR quantification to obtain PAS-vs-sample count matrix
    ├── config.template.yaml            # Template configuration for Snakemake parameters
    ├── envs/                           # Conda environments isolated for specific Snakemake rules
    ├── profile/                        # SLURM execution profile for the HPC
    └── scripts/                        # Python and R scripts utilized by both Snakemake and Jupyter

Quick Start & Setup

To ensure strict reproducibility and security, this project uses .env files to manage all absolute paths (data directories, genome annotations, etc.). Do not hardcode paths into the Python or Snakemake files.

1. Clone the Repository

Clone this repository into your local user space ($HOME):

git clone https://github.com/zavolanlab/APA_localization.git
cd APA_localization

2. Configure Environment Paths

You must map the project to your local HPC paths. First, copy the template, rename it, and fill in your absolute paths, for example like that:

cp APA_localization.template.env APA_localization.scicore.env
# Open .env and edit the "Base Directories" section to match your system
  • Recommended if you are a Zavolan group member on sciCORE: move the APA_localization.scicore.env to Project GROUP folder and symlink into your local repository directory:
    ln -s <a file with specified sciCORE paths> APA_localization.scicore.env

This way APA_localization.scicore.env will be automatically accessible by group members but will not be tracked by git. (Note: *.env files are ignored by git to protect private cluster paths, except the APA_localization.template.env file). **(APA_localization.scicore.env does exist in the GROUP folder of the Project on Scicore. Look for README there.)

3. Install the conda environment with zavolab_pyutils

Analysis in the notebook is largely based on the functions from zavolab_pyutils repository. Follow the instruction from that repo "Developer Setup from source, with conda environment". Use the created conda environment "zavolab_pyutils" to execute the Jupyter Notebook.

4. Essential for developpers! Install nbstripout

When in the APA_localization directory, run:

nbstripout --install

This will automatically hide the output of cells in juputer notebooks when pushed to github! Otherwise there is a risk of exposing your HPC cluster paths to public.

5. Use the juputer notebook to configure the workflow and input table preparation

Configuration of the workflows (i.e. creation of input .tsv with sample specification and .yaml config is done inside the jupyter notebook)

6. Executing the Workflows

The heavy lifting is divided into (currently, three) separate Snakemake workflows located in the WF/ directory.

Bash commands are also prepared inside the jupyter notebook. They should be further copied into command line and executed.

On an HPC cluster like sciCORE, workflows should be executed on a login node. Snakemake further automatically submits jobs to computational nodes.

7. Downstream Analysis

Once the Snakemake workflows are complete, all results are routed to the shared group directories defined in your .env file.

Use respective sections of the Jupyter Notebook to analyze the outputs.

The notebook automatically loads your .env paths using python-dotenv, allowing it to dynamically locate all workflow results, figures, and metadata regardless of where you cloned this repository.

About

This repository is dedicated to lab's projects related to the studying of subcellular localization of alternative polyadenylation (APA) isoforms.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors