Skip to content

BaderLab/APA-Net

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

APA-Net

APA-Net is a deep learning model designed for learning context-specific APA (Alternative Polyadenylation) usage. This guide covers the steps necessary to set up and run APA-Net.

Requirements

  • Python 3.8 or higher
  • PyTorch 1.8.0 or higher
  • NumPy
  • Pandas
  • SciPy
  • tqdm
  • wandb (optional, for experiment tracking)

Installation

Option 1: Install from source (Recommended)

  1. Clone this repository to your local machine:
git clone https://github.com/BaderLab/APA-Net.git
cd APA-Net
  1. Install dependencies manually for better control:
# For CPU-only version (smaller download)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# For GPU version (if you have CUDA)
pip install torch torchvision torchaudio

# Install other dependencies
pip install numpy pandas scipy tqdm wandb
  1. Install the package:
pip install .

Option 2: One-command installation

pip install .

Note: This will install the full PyTorch with CUDA support, which is a large download (~2GB).

Data Format

APA-Net expects input data in .npy format with the following structure:

  • Shape: (n_samples, 9) where each row represents one sample
  • Columns:
    • Column 0: Float value (sample ID/index)
    • Column 1: String (cell type name)
    • Column 2: String (additional metadata)
    • Column 3: Float value
    • Column 4: String (additional metadata)
    • Column 5: String (genomic coordinates/switch name)
    • Column 6: NumPy array of shape (4, 4000) - one-hot encoded DNA sequence
    • Column 7: Float (target APA usage value)
    • Column 8: NumPy array of shape (327,) - cell type profile features

Usage

Training the Model

To train the APA-Net model, use the train_script.py script:

cd apamodel
python train_script.py \
  --train_data "/path/to/train_data.npy" \
  --valid_data "/path/to/valid_data.npy" \
  --modelfile "/path/to/model_output.pt" \
  --batch_size 64 \
  --epochs 200 \
  --device "cpu" \
  --use_wandb "False"

Testing the Model

You can test the model with sample data:

# Create a simple test script
python -c "
import sys
sys.path.append('./apamodel')
from model import APANET, APAData
import numpy as np
import torch

# Load your data
data = np.load('your_data.npy', allow_pickle=True)

# Configure model (using CPU)
config = {
    'device': 'cpu',
    'opt': 'Adam',
    'loss': 'mse',
    'lr': 2.5e-05,
    'adam_weight_decay': 0.09,
    'conv1kc': 128,
    'conv1ks': 12,
    'conv1st': 1,
    'pool1ks': 16,
    'pool1st': 16,
    'cnvpdrop1': 0,
    'Matt_heads': 8,
    'Matt_drop': 0.2,
    'fc1_dims': [8192, 4048, 1024, 512, 256],
    'fc1_dropouts': [0.25, 0.25, 0.25, 0, 0],
    'fc2_dims': [128, 32, 16, 1],
    'fc2_dropouts': [0.2, 0.2, 0, 0],
    'psa_query_dim': 128,
    'psa_num_layers': 1,
    'psa_nhead': 1,
    'psa_dim_feedforward': 1024,
    'psa_dropout': 0
}

# Create and test model
model = APANET(config)
model.compile()
print('Model created successfully!')
"

Command Line Arguments

  • --train_data: Path to the training data file (required)
  • --valid_data: Path to the validation data file (required)
  • --modelfile: Path where the trained model will be saved (required)
  • --batch_size: Batch size for training (default: 64)
  • --epochs: Number of training epochs (default: 200)
  • --project_name: Name of the project for wandb logging (default: "APA-Net_Training")
  • --device: Device to run the training on - use "cpu" or "cuda:0" (default: "cuda:0")
  • --use_wandb: Enable wandb logging - "True" or "False" (default: "True")

Model Architecture

APA-Net is a deep neural network that combines:

  • Convolutional layers for sequence feature extraction
  • Self-attention mechanism for capturing long-range dependencies
  • Fully connected layers for prediction
  • Cell type profile integration for context-specific modeling

The model has approximately 301M parameters and processes:

  • Input: DNA sequences (4×4000) + cell type profiles (327 features)
  • Output: APA usage prediction (single value)

Troubleshooting

Common Issues

  1. CUDA errors: If you encounter CUDA-related errors, install the CPU-only version of PyTorch:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
  2. Memory issues: Reduce batch size if you encounter out-of-memory errors:

    --batch_size 32
  3. Data format errors: Ensure your data has the correct shape (n_samples, 9) with sequences of shape (4, 4000) and cell type profiles of shape (327,).

CPU vs GPU Usage

  • CPU: Slower but more compatible. Use --device "cpu"
  • GPU: Faster training. Use --device "cuda:0" (requires CUDA-compatible PyTorch installation)

Example

Here's a complete example of training APA-Net:

# Navigate to the model directory
cd APA-Net/apamodel

# Train the model
python train_script.py \
  --train_data "../test_fold_0.npy" \
  --valid_data "../test_fold_0.npy" \
  --modelfile "./trained_model.pt" \
  --batch_size 32 \
  --epochs 50 \
  --device "cpu" \
  --use_wandb "False" \
  --project_name "APA-Net_Test"

Analysis and Figures

The analysis_and_figures/ directory contains all the code and notebooks used to reproduce the results and figures from our APA-Net research paper. This comprehensive analysis pipeline covers data processing, model evaluation, comparative analysis, and visualization.

Directory Structure

analysis_and_figures/
├── model_performance/          # APA-Net model evaluation and performance analysis
├── data_processing/            # Data preparation and preprocessing for APA-Net
├── comparative_analysis/       # Comparative studies (APA vs DE, correlations)
├── gene_expression/            # Differential gene expression analysis
├── pathway_analysis/           # Gene set enrichment and pathway analysis
├── preprocessing/              # Single-cell RNA-seq data preprocessing pipeline
└── functions/                  # Utility functions and helper scripts

Getting Started with Analysis

  1. Prerequisites: Make sure you have the following R and Python packages installed:

R packages:

install.packages(c("dplyr", "ggplot2", "tidyr", "viridis", "patchwork", 
                   "readxl", "gridExtra", "ggpubr", "ggrepel", "reshape2", 
                   "corrplot", "pheatmap", "boot", "Seurat", "scCustomize"))

Python packages:

pip install pandas numpy scipy matplotlib seaborn scikit-learn statsmodels
  1. Data Requirements: The analysis scripts expect data in specific locations. You may need to adjust file paths in the notebooks to match your data directory structure.

Analysis Modules

1. Model Performance (model_performance/)

  • APA-NET_performance_plots.ipynb: Generates correlation plots showing model performance across cell types
  • APA-Net_filter_interactions.ipynb: Analyzes convolutional filter interactions and RBP binding patterns
  • APA-Net_heatmap_for_filter_interactions.ipynb: Creates heatmaps showing filter-RBP interactions

2. Data Processing (data_processing/)

  • Process_inputs_for_APA-Net.ipynb: Main data preprocessing pipeline for APA-Net training data
    • Processes RNA sequences and APA usage data
    • Generates one-hot encoded sequences
    • Creates 5-fold cross-validation splits
    • Formats data for model training
  • APA_quantification_maaper_apalog_Dec2024.ipynb: APA event quantification using MAAPER
  • emprical_fdr_thresholds_maaper_apalog.ipynb: Determines empirical FDR thresholds for significance testing

3. Comparative Analysis (comparative_analysis/)

  • APA_vs_DE.ipynb: Compares APA changes with differential expression
    • Correlation analysis between APA usage and gene expression changes
    • Cell-type-specific comparisons
    • Statistical significance testing
  • apa_correlation_across_celltypes.ipynb: Cross-cell-type APA correlation analysis
  • rbp_co_occurance_dissimilarity.ipynb: RNA-binding protein co-occurrence analysis

4. Gene Expression (gene_expression/)

  • DEG_ALS_genes.R: Analysis of ALS-associated gene expression
  • DEG_MAST_analysis.R: MAST-based differential expression analysis
  • DEG_pathway_analysis.R: Pathway enrichment analysis for DEGs
  • DEG_visualization.R: Visualization of differential expression results

5. Pathway Analysis (pathway_analysis/)

  • APA_pathway_analysis.R: Gene set enrichment analysis for APA-affected genes
    • GO term enrichment
    • Reactome pathway analysis
    • Custom gene set analysis

6. Preprocessing (preprocessing/)

  • processing_annotation/: Single-cell RNA-seq processing pipeline
    • 01_snRNA_cellranger_preprocess.sh: Cell Ranger preprocessing
    • 02_snRNA_process_QC.R: Quality control and filtering
    • 03_snRNA_clustering_annotation.R: Cell clustering and annotation
    • 04a_snRNA_NSForest1.ipynb & 04b_snRNA_NSForest2.ipynb: NSForest cell type classification
  • independent_datasets/: Processing of additional validation datasets
    • 01_read_matrices.R: Matrix reading and preprocessing
    • 02_harmony_int.R: Harmony integration for batch correction
    • 03_doublet_removal_annotation.R: Doublet detection and removal

Reproducing Key Results

Figure Generation

To reproduce the main figures from the paper:

  1. Model Performance Plots:

    cd analysis_and_figures/model_performance
    jupyter notebook APA-NET_performance_plots.ipynb
  2. APA Usage Analysis:

    cd analysis_and_figures/visualization  
    jupyter notebook maaper_volcanos_barplots_figure6.ipynb
  3. Comparative Analysis:

    cd analysis_and_figures/comparative_analysis
    jupyter notebook APA_vs_DE.ipynb

Data Processing Pipeline

To process your own data through the complete pipeline:

  1. Start with raw single-cell data:

    cd analysis_and_figures/preprocessing/processing_annotation
    bash 01_snRNA_cellranger_preprocess.sh
  2. Process and prepare for APA-Net:

    cd analysis_and_figures/data_processing
    jupyter notebook Process_inputs_for_APA-Net.ipynb

Key Results and Interpretations

  • Model Performance: APA-Net achieves correlation coefficients of 0.56-0.67 across cell types
  • Cell-Type Specificity: Microglia show highest model performance, indicating stronger APA regulatory patterns
  • Condition Comparison: Strong correlations (0.65-0.84) between C9ALS and sALS APA changes across cell types
  • Biological Validation: APA changes correlate with known ALS pathways and RBP targets

Data Availability

The analysis scripts reference several data sources:

  • Single-cell RNA-seq count matrices
  • APA usage quantification results
  • Cell type annotations
  • RBP expression profiles
  • Reference genome and annotations

Please ensure you have access to the appropriate datasets before running the analysis scripts.

Citation

If you use this analysis pipeline, please cite our paper:

[[Paper citation to be added upon publication]](https://www.biorxiv.org/content/10.1101/2023.12.22.573083v2)

For questions about the analysis pipeline, please open an issue in the GitHub repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages