Posts

Showing posts with the label quality control

Screen for rRNA contamination in RNA-seq data

Image
Ribosomal RNA (rRNA) is very abundant in cells (~80% of total RNA), so it is useful to deplete rRNA when doing genomewide assays to have sufficient coverage of other genes including protein coding and non-protein coding genes.

There are two major strategies for achieving rRNA removal, being (1) poly A enrichment and (2) rRNA depletion. Poly A enrichment uses an oligo dT coupled magnetic bead to "pull-out" RNA molecules with a polyA tag, a common feature of protein coding transcripts. rRNA depletion can be achieved using kits such as Ribo-Zero (Illumina), Ribo-Minus (LifeTech) and NEBNext® rRNA Depletion Kit (NEB). These kits contain oligonucleotide probes that either hybridize and immobilise the rRNA or hybridize and degrade the unwanted rRNA.

Once you have some sequence data, you will need to check whether the rRNA depletion has worked. This is somewhat different to a genome-wide analysis I've mentioned in earlier posts because rRNA genes exist in multiple copies and r…

Data set curation - quality trimming and adapter clipping

After demultiplexing (covered in the last post), you'll need to perform a few other steps before aligning Illumina sequence data to the genome reference. Primarily, these are quality filtering and adapter clipping. These may not be very important for short read data, but are pretty important when working with long reads, where the quality starts to drop off and the read might contain some adapter sequence.

Quality filtering can be done a few ways, by filtering out entire reads which have poor base quality scores, by converting poor quality base-calls to "N" or hard trimming reads to a certain length before the Q scores start to rapidly decline. I'd much rather use a quality based trimming tool which starts at the 3' end of the read and removes bases below a certain Q threshold. This can be done using fastq_quality_trimmer on the command line or in galaxy. You set the threshold you want to use, in this case Q30, as well as the minimum length of sequence to keep, w…

Demultiplexing Illumina Sequence data

Demultiplexing is a key step in many sequencing based applications, but it isn't always necessary, as the newer Illumina pipeline software provides demultiplexed data as a standard. But if you need to do this yourself, here is an example using fastx_toolkit designed for sequence data with a 6nt barcode (Illumina barcode sequences 1-12). After a run, the Genome Analyzer software provides sequence files like this for read 1 (insert sequence):
FC123_1_1_sequence.txt And for the barcode/index read: FC123_1_2_sequence.txt So here goes:
#Enter dataset parametersFC='FC123 FC124'LANES='1 2 3 4 5 6 7 8'#Create the bcfile
echo 'BC01_ATCACG     ATCACG
BC02_CGATGT     CGATGT
BC03_TTAGGC     TTAGGC
BC04_TGACCA     TGACCA
BC05_ACAGTG     ACAGTG
BC06_GCCAAT     GCCAAT
BC07_CAGATC     CAGATC
BC08_ACTTGA     ACTTGA
BC09_GATCAG     GATCAG
BC10_TAGCTT     TAGCTT
BC11_GGCTAC     GGCTAC
BC12_CTTGTA     CTTGTA' > bcfile.txtfor flowcell in $FC
do
for lane in $LANES
do
paste ${flowcell}_${lane}_1…

Quality control of sequence datasets

Image
Before investing a lot of time in analysis of a sequencing run, it is ALWAYS a good idea to ensure that your sequence data quality is up to scratch. If you have a high proportion of low quality bases in your dataset, you're likely to have many spurious alignments. These can cause problems for  virtually all NGS applications from mRNA-seq through to SNP detection and de novo assembly.

There are two main types of QC analysis for sequencing runs. The first type, which only describes the quality of the fastq file, and the second type, which describes the quality of the alignment (sam/bam file format). Lets begin with the simple fastq file analysis and we'll cover the alignment QC at a later stage.

The fastq file format has become the standard raw data format for high throughput sequence datasets. Each base is given a Phred quality score represented as an ASCII character. The higher the Qscore, the more confidence you can have in the identity of the base. But there are other things…