# Downloading the pipeline
//github.com/kwondry/rnaseq_gene_expression/archive/refs/heads/main.zip
wget https:
# Unzip and rename the folder
zip -d my_rnaseq_run && rm main.zip
unzip main.
# Navigate into your rnaseq pipeline folder
/rnaseq_gene_expression-main cd my_rnaseq_run
RNA-Seq Pipeline Tutorial
Detailed steps on processing RNA Sequencing data
Here we walk through the Kwon Lab RNA-Seq pipeline. The pipeline integrates steps from an nf-core rnaseq pipeline into an executable snakemake
pipeline. The end product is a gene expression matrix and a QC report for further downstream analysis.
Estimated time for completion ~1 hour
Prequisites
Ensure you have O2 access with
mamba
andsnakemake
installed. For O2 setup instructions, visit here.We begin with raw sequencing data ( must be .gz files) on O2 in
/n/groups/kwon/data1/sequencing_run_archive_DO_NOT_EDIT/<your sequencing run>
. Please note where the sequences for your samples are located.
Downloading the pipeline
Launch VSCode on an O2 compute node. More instructions can be found here.
Open a terminal within VSCode. Click terminal in the top bar and new terminal. You should be on a compute node and within your groups folder. You can check where you currently are in the file system with pwd
.
Download the pipeline by running the following commands. This is where your pipeline will be located and you can go into the folder with cd
. Running ls
will show all pipeline files.
You will now have a rnaseq_gene_expression-main
folder that contains all the required files.
Creating a sample sheet
A sample sheet will define which samples you want to include in the run and provide file paths for each sample.
Download an example sample sheet here.
Edit this sample sheet with explicit file paths for each sample that want included in the run. You can edit the raw csv in Excel or any editor you’d like and add your samples with explicit, absolute filepaths starting from /n/
.
input sample_sheet.csv
columns
sample | fastq_1 | fastq_2 | strandedness |
---|---|---|---|
Sample IDs | File path for forward reads | File path for reverse reads. If your sequences are single end, leave this column empty | Set as auto for each sample. However, for samples with high contaminants coming from smartseq2, it should be set as both . |
The filepaths should be similar in structure to:
/n/groups/kwon/data1/sequencing_run_archive_DO_NOT_EDIT/<your sequencing run>/<filename_for_sample>
.
All files must be gzipped (have .gz extension.)
After updating the example sample sheet, save it as a comma-delimited file or .csv
.
Uploading your samplesheet
Next, we need to upload the updated sample sheet into the pipeline folder. This can be done by dragging that the updated folder from its current location into the VSCode sidebar in the input folder. VSCode may prompt you asking to if you want to replace the current file and confirm.
You can click the sample_sheet.csv to confirm it has your updated samples.
Running the pipeline
Testing the pipeline
To ensure everything is set up properly, we can run the pipeline on a test dataset.
Navigate back to the VSCode terminal and check that you are in your pipeline folder. To run the test data, activate the snakemake environment and run snakemake with the following.
# Activates the mamba environment
mamba activate snakemake
# Runs the test data through the pipeline
--use-conda test snakemake
This will run the test data interactivately and you will be able to watch the pipeline run each step. This may up to 40 minutes, so now is a good time for a tea break.
After the pipeline finishes, you should see a folder generated in your pipeline folder called test_results
. If you see a multiqc/star_salmon/nfcore_rnaseq_test_multiqc_report.html
file inside the two sub folders, the pipeline has successfully finished.
Running the pipeline
Almost there! All that’s left is running your samples through the pipeline.
Once you have the updated sample sheet with your samples in the input folder inside the pipeline directory, navigate back again to the VSCode terminal. Submit the pipeline with the following.
# Activates the mamba environment
mamba activate snakemake
# Submits a job with slurm
sbatch submit_jobs.sh
This will submit the snakemake pipeline as a job, so you do not need to keep everything open. Depending on the number of samples and reads, the pipeline may take a few hours to days to finish.