VIRGO2 Mapping and Taxonomy Pipeline

Overview

The VIRGO2 pipeline is used to map reads to VIRGO2 and annotate the reads with taxonomic and functional information. It processes paired-end sequencing data and generates comprehensive taxonomic and functional annotations using the VIRGO2 database.

Note

VIRGO2 is a resource developed by the Ravel lab at University of Maryland School of Medicine. The manuscript is available here: doi: 10.1101/2025.03.04.641479

This workflow assumes you have already quality filtered, adapter trimmed, and host filtered your reads.

Pipeline Steps

The workflow performs the following steps:

Maps sequencing reads to the VIRGO2 database
Generates taxonomic annotations
Calculates relative abundances
Produces summary reports and annotated output files

Prerequisites

conda
Snakemake (version 8.20.0 or later)

For O2 setup instructions, visit here.

Installation

To install the workflow, run:

curl -L https://github.com/kwondry/virgo2_mapping_and_taxonomy/archive/refs/heads/main.zip -o main.zip
unzip main.zip && rm main.zip

Database Setup

Important

The VIRGO2 database is currently available via Dropbox. After publication, the files will be available from the Ravel lab on Zenodo.

To set up the database:

Get the Dropbox link from Michael France
Download and extract the database files to your preferred location
Update the virgo2 section in config/config.yaml with the absolute path to your database location

Note: The database location can be anywhere on your system - it does not need to be within this workflow directory.

Running the Pipeline

Test Data

A test dataset is provided in resources/test_data/ containing three sample pairs. To run the workflow with test data:

Ensure you’re in the workflow directory:
```
cd virgo2_mapping_and_taxonomy
```

Run the workflow with test data:

snakemake --use-conda --configfile config/config.yaml

Running with Your Own Data

To run the workflow with your own data:

Prepare a samplesheet in CSV format with the following columns:
- sample: Sample identifier
- fastq_1: Path to first read file
- fastq_2: Path to second read file
Update the configuration in config/config.yaml:
- Set the path to your samplesheet
- Adjust resource requirements as needed

Run the workflow:

snakemake --use-conda --configfile config/config.yaml

Running on O2 Cluster

To submit the workflow to the O2 cluster, use the provided submission script:

sbatch ./submit_jobs.sh

This is currently configured for the Kwon lab on the O2 cluster.

Output Files

The workflow generates several output files:

*.summary.NR.txt: Summary of mapping results at the gene level
*_virgo2_NR_anno.csv: Results with the gene lengths and annotations added
*_virgo2_metagenomic_taxa.csv: Taxonomic relative abundances calculated from the gene counts corrected for gene length