VIRGO2 Mapping and Taxonomy Pipeline
Overview
The VIRGO2 pipeline is used to map reads to VIRGO2 and annotate the reads with taxonomic and functional information. It processes paired-end sequencing data and generates comprehensive taxonomic and functional annotations using the VIRGO2 database.
VIRGO2 is a resource developed by the Ravel lab at University of Maryland School of Medicine. The manuscript is available here: doi: 10.1101/2025.03.04.641479
This workflow assumes you have already quality filtered, adapter trimmed, and host filtered your reads.
Pipeline Steps
The workflow performs the following steps:
- Maps sequencing reads to the VIRGO2 database
- Generates taxonomic annotations
- Calculates relative abundances
- Produces summary reports and annotated output files
Prerequisites
- conda
- Snakemake (version 8.20.0 or later)
For O2 setup instructions, visit here.
Installation
To install the workflow, run:
curl -L https://github.com/kwondry/virgo2_mapping_and_taxonomy/archive/refs/heads/main.zip -o main.zip
unzip main.zip && rm main.zip
Database Setup
The VIRGO2 database is currently available via Dropbox. After publication, the files will be available from the Ravel lab on Zenodo.
To set up the database:
- Get the Dropbox link from Michael France
- Download and extract the database files to your preferred location
- Update the
virgo2
section inconfig/config.yaml
with the absolute path to your database location
Note: The database location can be anywhere on your system - it does not need to be within this workflow directory.
Running the Pipeline
Test Data
A test dataset is provided in resources/test_data/
containing three sample pairs. To run the workflow with test data:
Ensure you’re in the workflow directory:
cd virgo2_mapping_and_taxonomy
Run the workflow with test data:
snakemake --use-conda --configfile config/config.yaml
Running with Your Own Data
To run the workflow with your own data:
Prepare a samplesheet in CSV format with the following columns:
- sample: Sample identifier
- fastq_1: Path to first read file
- fastq_2: Path to second read file
Update the configuration in
config/config.yaml
:- Set the path to your samplesheet
- Adjust resource requirements as needed
Run the workflow:
snakemake --use-conda --configfile config/config.yaml
Running on O2 Cluster
To submit the workflow to the O2 cluster, use the provided submission script:
sbatch ./submit_jobs.sh
This is currently configured for the Kwon lab on the O2 cluster.
Output Files
The workflow generates several output files:
*.summary.NR.txt
: Summary of mapping results at the gene level*_virgo2_NR_anno.csv
: Results with the gene lengths and annotations added*_virgo2_metagenomic_taxa.csv
: Taxonomic relative abundances calculated from the gene counts corrected for gene length