16S Pipeline Tutorial
A Snakemake workflow for processing 16S rRNA sequencing data using DADA2, VALENCIA for community state type (CST) classification, and custom scripts for spike-in adjustments.
This workflow assumes your reads are demultiplexed.
Prerequisites
- Required Software:
- Snakemake (v8.20 or higher)
 - Conda or Mamba
 - Python 3.10 or higher
 - R 4.3 or higher
 
 - Required Databases:
- HISAT2 human reference database
 - GTDB reference databases
 - SpeciateIT VALENCIA database
 
 
Installation
To install the workflow, run:
# Clone the repository without history and enter the project directory
git clone --depth=1 https://github.com/kwondry/16s_dada2_valencia.git
cd 16s_dada2_valenciaThis will create a fresh copy of the workflow without any git history.
Configuration and set up
Database Paths: Edit
config/config.yamlto set the paths to your reference databases:hisat_db: "/path/to/hisat2/human/reference" # Path to HISAT2 index of T2T human genome gtdb_tax_db: "/path/to/GTDB/taxonomy/database" # DADA2 GTDB taxonomy database gtdb_species_db: "/path/to/GTDB/species/database" # DADA2 GTDB species database speciateit_db: "/path/to/speciateit_valencia" # SpeciateIT VALENCIA databaseTo obtain these databases:
DADA2 Parameters: The workflow includes default parameters for DADA2 processing. You can modify these in
config/config.yaml:dada2: pe: maxEE: 3 truncLen: [286, 260] trimLeft: [10, 10] merge: minOverlap: 8 maxMismatch: 1 se: maxEE: 1 truncLen: 230 trimLeft: 10Primer Sequences: Update the primer sequences in
config/config.yamlif using different primers:primers: v3v4: forward: "ACTCCTRCGGGAGGCAGCAG" reverse: "GGACTACHVGGGTWTCTAAT"
Input Data
This workflow can run process multiple sequencing runs in parralel.
Note the folder path of demultiplexed
.fastqfiles and mapping file for each run.Create/edit a sequencing runsheet CSV file with the following columns:
run_id: Run identifiermapping_file: Path to mapping file containing sample IDsfastq_dir: Path to directory containing FASTQ filesregion: Sequencing region used (V4 or V3V4)
The mapping file should contain unique sample identifiers in a column
#SampleIDthat match the FASTQ filenames in the fastq_dir.The workflow will automatically process both single-end and paired-end data based on the FASTQ files present.
Running the Workflow
Test Run:
snakemake -nLocal Execution:
snakemake --use-conda --cores allSLURM Execution:
sbatch submit_jobs.sh
Output
The workflow generates two main outputs in the outputs/results directory:
- Two versions of a phyloseq object containing ASV counts, taxonomy assignments, CST assignment, and sample metadata with either a vaginal-specific taxonomy db, 
speciateITorGTDB. - A MultiQC report summarizing read quality metrics and taxonomic composition