Posts

Showing posts with the label fastq

SRA toolkit tips and workarounds

Image
The Short Read Archive (SRA) is the main repository for next generation sequencing (NGS) raw data. Considering the sheer rate at which NGS is generated (and accelerating), the team at NCBI should be congratulated for providing this service to the scientific community. Take a look at the growth of SRA:

SRA however doesn't provide directly the fastq files that we commonly work with, they prefer the .sra archive that require specialised software (sra-toolkit) to extract. Sra-toolkit has been described as buggy and painful; and I've had my frustrations with it. In this post, I'll share some of my best tips sra-toolkit tips that I've found.

Get the right version of the software and configure it When downloading, make sure you download the newest version from the NCBI website (link). Don't download it from GitHub or from Ubuntu software centre (or apt-get), as it will probably be an older version. In the binary directory (looks like /path/to/sratoolkit.2.4.3-ubuntu64/bin…

Test-driving parallel compression software

Image
Next generation sequencing is driving genomics into realm of big data, and this will only continue as NGS moves into the hands of more and more researchers and clinicians. Genome sequence datasets are huge and place an immense demand on informatics infrastructure including disk space, memory and CPU usage. Bioinformaticians need to be aware of the performance of different compression software available to make the most out of their hardware.

In this post, we'll test the performance of some compression algorithms in terms of their disk-saving ability as well as speed of decompression/compression.

The file that we're working on is an RNA-seq fastq file that is 5.0 GB in size and contains 25.8 million sequence reads on 103.0 million lines. Here is the top 10 lines.

@SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
ACTTGCTGCTAATTAAAACCAACAATAGAACAGTGA
+SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
B55??4024/13425>;>>CDD@DBB<<<BB<4<<9
@SRR117…