Data analysis step 2: quality control of RNA-seq data

In a previous post, we downloaded RNA-seq data from GEO (GSE55123) . Lets continue with the processing of this data by performing QC analysis. In a previous post, I went into a bit more detail, but here we will simply use fastx_quality_stats from the fastx toolkit to have a look at quality scores among the data sets.

The general strategy was to unzip the data on the fly, convert to tabular format and then select a random 1 million sequences and then submit these to fastx_quality_stats. So having a look through the output file, shows very high quality scores with median scores >36 which suggests this dataset is very high quality. Below see the code used and a graph of median quality scores throughout the run.

#!/bin/bash
for FQZ in *bz2
do echo $FQZ
pbzip2 -dc $FQZ | paste - - - - | shuf | head -1000000 \
| tr '\t' '\n' | fastx_quality_stats | cut -f1,6-9
done | tee quality_analysis.txt

Mean cycle base quality for GSE55123 RNA-seq data shows very high quality sequence reads.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?