VIRGO2 Mapping and Taxonomy Pipeline

Overview

The VIRGO2 pipeline is used to map reads to VIRGO2 and annotate the reads with taxonomic and functional information. It processes paired-end sequencing data and generates comprehensive taxonomic and functional annotations using the VIRGO2 database.

Note

VIRGO2 is a resource developed by the Ravel lab at University of Maryland School of Medicine. The manuscript is available here: doi: 10.1101/2025.03.04.641479

This workflow assumes you have already quality filtered, adapter trimmed, and host filtered your reads.

Pipeline Steps

The workflow performs the following steps:

  1. Maps sequencing reads to the VIRGO2 database
  2. Generates taxonomic annotations
  3. Calculates relative abundances
  4. Produces summary reports and annotated output files

Prerequisites

  • conda
  • Snakemake (version 8.20.0 or later)

For O2 setup instructions, visit here.

Installation

To install the workflow, run:

curl -L https://github.com/kwondry/virgo2_mapping_and_taxonomy/archive/refs/heads/main.zip -o main.zip
unzip main.zip && rm main.zip

Database Setup

Important

The VIRGO2 database is currently available via Dropbox. After publication, the files will be available from the Ravel lab on Zenodo.

To set up the database:

  1. Get the Dropbox link from Michael France
  2. Download and extract the database files to your preferred location
  3. Update the virgo2 section in config/config.yaml with the absolute path to your database location

Note: The database location can be anywhere on your system - it does not need to be within this workflow directory.

Running the Pipeline

Test Data

A test dataset is provided in resources/test_data/ containing three sample pairs. To run the workflow with test data:

  1. Ensure you’re in the workflow directory:

    cd virgo2_mapping_and_taxonomy
  2. Run the workflow with test data:

    snakemake --use-conda --configfile config/config.yaml

Running with Your Own Data

To run the workflow with your own data:

  1. Prepare a samplesheet in CSV format with the following columns:

    • sample: Sample identifier
    • fastq_1: Path to first read file
    • fastq_2: Path to second read file
  2. Update the configuration in config/config.yaml:

    • Set the path to your samplesheet
    • Adjust resource requirements as needed
  3. Run the workflow:

    snakemake --use-conda --configfile config/config.yaml

Running on O2 Cluster

To submit the workflow to the O2 cluster, use the provided submission script:

sbatch ./submit_jobs.sh

This is currently configured for the Kwon lab on the O2 cluster.

Output Files

The workflow generates several output files:

  • *.summary.NR.txt: Summary of mapping results at the gene level
  • *_virgo2_NR_anno.csv: Results with the gene lengths and annotations added
  • *_virgo2_metagenomic_taxa.csv: Taxonomic relative abundances calculated from the gene counts corrected for gene length