16S Pipeline Tutorial

A Snakemake workflow for processing 16S rRNA sequencing data using DADA2, VALENCIA for community state type (CST) classification, and custom scripts for spike-in adjustments.

This workflow assumes your reads are demultiplexed.

Prerequisites

Required Software:
- Snakemake (v8.20 or higher)
- Conda or Mamba
- Python 3.10 or higher
- R 4.3 or higher
Required Databases:
- HISAT2 human reference database
- GTDB reference databases
- SpeciateIT VALENCIA database

Installation

To install the workflow, run:

# Clone the repository without history and enter the project directory
git clone --depth=1 https://github.com/kwondry/16s_dada2_valencia.git

cd 16s_dada2_valencia

This will create a fresh copy of the workflow without any git history.

Configuration and set up

Database Paths: Edit config/config.yaml to set the paths to your reference databases:
```
hisat_db: "/path/to/hisat2/human/reference"  # Path to HISAT2 index of T2T human genome
gtdb_tax_db: "/path/to/GTDB/taxonomy/database"  # DADA2 GTDB taxonomy database
gtdb_species_db: "/path/to/GTDB/species/database"  # DADA2 GTDB species database 
speciateit_db: "/path/to/speciateit_valencia"  # SpeciateIT VALENCIA database
```
To obtain these databases:
- HISAT2 human reference: Download the T2T human genome from NCBI and index it using HISAT2’s hisat2-build command
- GTDB databases: Download the DADA2-formatted GTDB reference files from Zenodo
- SpeciateIT database: Clone the speciateIT databases and VALENCIA scripts from the Ravel Lab GitHub

DADA2 Parameters: The workflow includes default parameters for DADA2 processing. You can modify these in config/config.yaml:

dada2:
  pe:
    maxEE: 3
    truncLen: [286, 260]
    trimLeft: [10, 10]
    merge:
      minOverlap: 8
      maxMismatch: 1
  se:
    maxEE: 1
    truncLen: 230
    trimLeft: 10

Primer Sequences: Update the primer sequences in config/config.yaml if using different primers:

primers:
  v3v4:
    forward: "ACTCCTRCGGGAGGCAGCAG"
    reverse: "GGACTACHVGGGTWTCTAAT"

Input Data

This workflow can run process multiple sequencing runs in parralel.

Note the folder path of demultiplexed .fastq files and mapping file for each run.
Create/edit a sequencing runsheet CSV file with the following columns:
- run_id: Run identifier
- mapping_file: Path to mapping file containing sample IDs
- fastq_dir: Path to directory containing FASTQ files
- region: Sequencing region used (V4 or V3V4)
The mapping file should contain unique sample identifiers in a column #SampleID that match the FASTQ filenames in the fastq_dir.
The workflow will automatically process both single-end and paired-end data based on the FASTQ files present.

Running the Workflow

Test Run:
```
snakemake -n
```
Local Execution:
```
snakemake --use-conda --cores all
```
SLURM Execution:
```
 sbatch submit_jobs.sh
```

Output

The workflow generates two main outputs in the outputs/results directory:

Two versions of a phyloseq object containing ASV counts, taxonomy assignments, CST assignment, and sample metadata with either a vaginal-specific taxonomy db, speciateIT or GTDB.
A MultiQC report summarizing read quality metrics and taxonomic composition