library(DESeq2)
library(tidyverse)
library(microViz)
<- read_tsv("salmon.merged.gene_counts_length_scaled.tsv")
counts
<- counts %>% select(-gene_name) %>%
counts_for_deseq as.data.frame() %>%
column_to_rownames("gene_id") %>%
as.matrix()
<- counts %>% select(gene_id, gene_name) %>%
gene_names_for_deseq as.data.frame() %>%
column_to_rownames("gene_id")
<- tibble(sample_id = counts %>% select(-gene_id, -gene_name) %>% colnames()) %>% left_join (read_csv("SKH_Expt_P10_metadata.csv")) %>%
sample_metadata_for_deseq as.data.frame() %>%
column_to_rownames("sample_id")
<- DESeqDataSetFromMatrix(countData=round(counts_for_deseq), colData=sample_metadata_for_deseq, rowData=gene_names_for_deseq, design=~treatment_code) deseq_dataset
Background
Work in progress
Introduction to the pipeline
The goal of this pipeline is to take raw short-reads generated from an RNASeq library generation and sequencing protocol and do all the basic QC steps as well as quantify gene expression. Currently it only works for human RNAseq data - the reference data base is hard coded.
Inputs
This pipeline takes a sample sheet and fastq files (ideally in /n/groups/kwon/data1/sequencing_run_archive_DO_NOT_EDIT/
) as input and works for both single and paired-end reads.
input sample_sheet.csv
columns
sample | fastq_1 | fastq_2 | strandedness |
---|---|---|---|
Sample IDs | File path for forward reads | File path for reverse reads. If your sequences are single end, leave this column empty | Set as auto for each sample. However, for samples with high contaminants coming from smartseq2, it should be set as both . |
See the walkthrough for information on how to make your samplesheet.
outputs: QC
After successfully running the pipeline, you will have a results
folder in your project directory with the output from the pipeline.
First, take a look at the MultiQC report in /n/groups/kwon/johnathan/projects/simran_rnaseq/my_rnaseq_run/rnaseq_gene_expression-main/results/multiqc/star_salmon/nfcore_rnaseq_multiqc_report.html
You’ll see a summary table at the beginning:
Here you want to check how many aligned sequences you had for each sample. You want 20-30 million reads per sample.
Also take a look at the breakdown of the types of sequences you got:
Depending on the type of library prep you did the might be different expectations here. But most library preps for gene expression analysis enrich for mature mRNA, so you should see mostly exonic reads.
For a comprehensive breakdown of the different QC plots you expect, check https://nf-co.re/rnaseq/dev/docs/output.
outputs: gene expression.
In the results/star_salmon folder, you’ll see a ton of folders and files. The main ones to focus on are:
star_salmon.merged.gene_counts.tsv
: Matrix of gene-level raw counts across all samples.
star_salmon.gene_tpm.tsv
: Matrix of gene-level TPM values across all samples.
star_salmon.merged.gene_lengths.tsv
: Matrix of average within-sample transcript lengths for each gene across all samples.
star_salmon.merged.gene_counts_scaled.tsv
: Matrix of gene-level library size-scaled estimated counts across all samples.
For import into DESeq2 we’ll use star_salmon.merged.gene_lengths.tsv
. If you want to plot heatmaps and nromalized gene expression boxplots, use star_salmon.gene_tpm.tsv
.
Beyond the pipeline
Typically you’ll want to explore your RNAseq dataset for more quality checks and sanity checks, and look for differences in gene expression with different treatments.
The following code is an example of how to get your data into deseq2. (Make sure you have the recommended packages installed by following the R installation instructions).
Then, read through and follow along with the DEseq2 vignette, starting with the “Pre-filtering” section.
As an example, here is a code chunk that combines the pre-filtering example and the PCA code from the vignette.
<- 3
smallestGroupSize <- rowSums(counts(deseq_dataset) >= 10) >= smallestGroupSize
keep <- deseq_dataset[keep,]
dds
<- vst(dds, blind=FALSE)
vsd <- plotPCA(vsd, intgroup=c("polyamine", "inhibitor"), returnData=TRUE)
pcaData <- round(100 * attr(pcaData, "percentVar"))
percentVar ggplot(pcaData, aes(PC1, PC2, color=polyamine, shape=inhibitor)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
microViz
hack, largely untested.
This following code is a hack to get your already normalized data into a phyloseq
object, which is typically only used for 16S analysis. However, there are very nice ordination functions and heatmap functions available for phyloseq objects in the microViz
package, so it might be worth the hackiness.
library(microViz)
<- read_tsv("salmon.merged.gene_tpm.tsv")
gene_abundance
<- gene_abundance %>%
gene_count_table select(-gene_name) %>%
as.data.frame() %>%
column_to_rownames("gene_id") %>%
as.matrix() %>%
::otu_table(taxa_are_rows = TRUE)
phyloseq
<- gene_abundance %>% select(gene_id, gene_name) %>%
gene_annotations as.data.frame() %>%
column_to_rownames("gene_id") %>%
as.matrix() %>%
::tax_table()
phyloseq
<- tibble(sample_id = counts %>% select(-gene_id, -gene_name) %>% colnames()) %>% left_join (read_csv("SKH_Expt_P10_metadata.csv")) %>%
sample_metadata as.data.frame() %>%
column_to_rownames("sample_id")
::sam_data()
phyloseq
<- phyloseq::phyloseq(gene_count_table, gene_annotations, sample_metadata) %>% tax_fix() rnaseq_dataset