VIRGO2 Mapping and Taxonomy Pipeline
Overview
The VIRGO2 pipeline is used to map reads to VIRGO2 and annotate the reads with taxonomic and functional information. It processes paired-end sequencing data and generates comprehensive taxonomic and functional annotations using the VIRGO2 database.
VIRGO2 is a resource developed by the Ravel lab at University of Maryland School of Medicine. The manuscript is available here: doi: 10.1101/2025.03.04.641479
This workflow assumes you have already quality filtered, adapter trimmed, and host filtered your reads.
Pipeline Steps
The workflow performs the following steps:
- Maps sequencing reads to the VIRGO2 database
 - Generates taxonomic annotations
 - Calculates relative abundances
 - Produces summary reports and annotated output files
 
Prerequisites
- conda
 - Snakemake (version 8.20.0 or later)
 
For O2 setup instructions, visit here.
Installation
To install the workflow, run:
curl -L https://github.com/kwondry/virgo2_mapping_and_taxonomy/archive/refs/heads/main.zip -o main.zip
unzip main.zip && rm main.zipDatabase Setup
The VIRGO2 database is currently available via Dropbox. After publication, the files will be available from the Ravel lab on Zenodo.
To set up the database:
- Get the Dropbox link from Michael France
 - Download and extract the database files to your preferred location
 - Update the 
virgo2section inconfig/config.yamlwith the absolute path to your database location 
Note: The database location can be anywhere on your system - it does not need to be within this workflow directory.
Running the Pipeline
Test Data
A test dataset is provided in resources/test_data/ containing three sample pairs. To run the workflow with test data:
Ensure you’re in the workflow directory:
cd virgo2_mapping_and_taxonomyRun the workflow with test data:
snakemake --use-conda --configfile config/config.yaml
Running with Your Own Data
To run the workflow with your own data:
Prepare a samplesheet in CSV format with the following columns:
- sample: Sample identifier
 - fastq_1: Path to first read file
 - fastq_2: Path to second read file
 
Update the configuration in
config/config.yaml:- Set the path to your samplesheet
 - Adjust resource requirements as needed
 
Run the workflow:
snakemake --use-conda --configfile config/config.yaml
Running on O2 Cluster
To submit the workflow to the O2 cluster, use the provided submission script:
sbatch ./submit_jobs.shThis is currently configured for the Kwon lab on the O2 cluster.
Output Files
The workflow generates several output files:
*.summary.NR.txt: Summary of mapping results at the gene level*_virgo2_NR_anno.csv: Results with the gene lengths and annotations added*_virgo2_metagenomic_taxa.csv: Taxonomic relative abundances calculated from the gene counts corrected for gene length