16S Pipeline Tutorial
A Snakemake workflow for processing 16S rRNA sequencing data using DADA2, VALENCIA for community state type (CST) classification, and custom scripts for spike-in adjustments.
This workflow assumes your reads are demultiplexed.
Prerequisites
- Required Software:
- Snakemake (v8.20 or higher)
- Conda or Mamba
- Python 3.10 or higher
- R 4.3 or higher
- Required Databases:
- HISAT2 human reference database
- GTDB reference databases
- SpeciateIT VALENCIA database
Installation
To install the workflow, run:
# Clone the repository without history and enter the project directory
git clone --depth=1 https://github.com/kwondry/16s_dada2_valencia.git
cd 16s_dada2_valencia
This will create a fresh copy of the workflow without any git history.
Configuration and set up
Database Paths: Edit
config/config.yaml
to set the paths to your reference databases:hisat_db: "/path/to/hisat2/human/reference" # Path to HISAT2 index of T2T human genome gtdb_tax_db: "/path/to/GTDB/taxonomy/database" # DADA2 GTDB taxonomy database gtdb_species_db: "/path/to/GTDB/species/database" # DADA2 GTDB species database speciateit_db: "/path/to/speciateit_valencia" # SpeciateIT VALENCIA database
To obtain these databases:
DADA2 Parameters: The workflow includes default parameters for DADA2 processing. You can modify these in
config/config.yaml
:dada2: pe: maxEE: 3 truncLen: [286, 260] trimLeft: [10, 10] merge: minOverlap: 8 maxMismatch: 1 se: maxEE: 1 truncLen: 230 trimLeft: 10
Primer Sequences: Update the primer sequences in
config/config.yaml
if using different primers:primers: v3v4: forward: "ACTCCTRCGGGAGGCAGCAG" reverse: "GGACTACHVGGGTWTCTAAT"
Input Data
This workflow can run process multiple sequencing runs in parralel.
Note the folder path of demultiplexed
.fastq
files and mapping file for each run.Create/edit a sequencing runsheet CSV file with the following columns:
run_id
: Run identifiermapping_file
: Path to mapping file containing sample IDsfastq_dir
: Path to directory containing FASTQ filesregion
: Sequencing region used (V4 or V3V4)
The mapping file should contain unique sample identifiers in a column
#SampleID
that match the FASTQ filenames in the fastq_dir.The workflow will automatically process both single-end and paired-end data based on the FASTQ files present.
Running the Workflow
Test Run:
snakemake -n
Local Execution:
snakemake --use-conda --cores all
SLURM Execution:
sbatch submit_jobs.sh
Output
The workflow generates two main outputs in the outputs/results
directory:
- Two versions of a phyloseq object containing ASV counts, taxonomy assignments, CST assignment, and sample metadata with either a vaginal-specific taxonomy db,
speciateIT
orGTDB
. - A MultiQC report summarizing read quality metrics and taxonomic composition