SRA toolkit tips and workarounds
The Short Read Archive (SRA) is the main repository for next generation sequencing (NGS) raw data. Considering the sheer rate at which NGS is generated (and accelerating), the team at NCBI should be congratulated for providing this service to the scientific community. Take a look at the growth of SRA:
SRA however doesn't provide directly the fastq files that we commonly work with, they prefer the .sra archive that require specialised software (sra-toolkit) to extract. Sra-toolkit has been described as buggy and painful; and I've had my frustrations with it. In this post, I'll share some of my best tips sra-toolkit tips that I've found.
fastq-dump -A SRR900186 -Z > SRR900186.fastq
axel -n5 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX709/SRX709649/SRR1585277/SRR1585277.sra
fastq-dump -A SRR900186.sra -Z --split-files
You can download the compressed fastq files from ENA for forward and reverse reads
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/SRR504687/SRR504687_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/SRR504687/SRR504687_2.fastq.gz
You can download the SRA archive from DNAnexus too.
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR504/SRR504687/SRR504687.sra
fastq-dump -A SRR764858 -Z \
| fastq_quality_trimmer -l 25 -t 20 -Q33 \
| olego -t 8 Athaliana.TAIR10.23.dna_sm.genome.fa - \
| samtools view -uSh - \
| samtools sort - SRR764858_sra.sort
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/SRR764858/SRR764858.fastq.gz \
| pigz -d | fastq_quality_trimmer -t 20 -l 25 -Q33 \
| olego -t 8 Athaliana.TAIR10.23.dna_sm.genome.fa - \
| samtools view -uSh - \
| samtools sort - SRR764858_ebi.sort
Occasionally you'll come across data in color-space format. After downloading the SRA archive do the following.
abi-dump -A SRR1657115.sra
That will dump the sequence in fasta format (SRR1657115.sra.csfasta) and the quality string (SRR1657115.sra.qual) in separate files. Then I use solid-trimmer.py to do quality trimming. Here's an example:
solid-trimmer.py -c SRR1657115.sra.csfast -q SRR1657115.sra.qual --moving-average 7:12 --min-read-length 25 > SRR1657115.fasta
Happy data mining!
Growth of SRA data (http://www.ncbi.nlm.nih.gov/Traces/sra/i/g.png) |
SRA however doesn't provide directly the fastq files that we commonly work with, they prefer the .sra archive that require specialised software (sra-toolkit) to extract. Sra-toolkit has been described as buggy and painful; and I've had my frustrations with it. In this post, I'll share some of my best tips sra-toolkit tips that I've found.
Get the right version of the software and configure it
When downloading, make sure you download the newest version from the NCBI website (link). Don't download it from GitHub or from Ubuntu software centre (or apt-get), as it will probably be an older version. In the binary directory (looks like /path/to/sratoolkit.2.4.3-ubuntu64/bin) there will be a file called sratoolkit.jar. In linux use "java -jar sratoolkit" to open the graphical interface. in the preferences menu, enable the local repository and select a path for it. By doing this, you can then use sra-toolkit to "stream" fastq data (see below).
EDIT: if you are seeing an error like this one:
EDIT: if you are seeing an error like this one:
/data/app/sratoolkit.2.4.3-ubuntu64/bin/fastq-dump --split-files -A ERR366438
2015-02-15T21:47:01 fastq-dump.2.4.3 err: binary large object corrupt while reading binary large object within virtual database module - failed ERR366438
=============================================================
An error occurred during processing.
A report was generated into the file '/data/home/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to 'sra@ncbi.nlm.nih.gov' for assistance.
=============================================================
Then grab the new sra-toolkit version 2.4.4 which seems to fix problems with SRA archives using reference based compression (when submitters provide data in aligned bam format).
Try streaming the data
You can convert sra to fastq on-the-fly by doing either of the following:
fastq-dump -A SRR1722641 -O SRR1722641.fastq
fastq-dump -A SRR900186 -Z > SRR900186.fastq
Streaming paired-end data could be problematic. Use the following to save forward and reverse reads as separate files. Thanks to the folks at Biostars for this idea.
SRR=SRR1041311 ; fastq-dump -X 10 --split-files -I -Z $SRR \
| tee >(grep '@.*\.1\s' -A3 --no-group-separator \
> ${SRR}_1.fastq) >(grep '@.*\.2\s' -A3 --no-group-separator \
> ${SRR}_2.fastq) >/dev/null
| tee >(grep '@.*\.1\s' -A3 --no-group-separator \
> ${SRR}_1.fastq) >(grep '@.*\.2\s' -A3 --no-group-separator \
> ${SRR}_2.fastq) >/dev/null
Use download accelerator
The SRA team actually recommend using Aspera connect to speed up the download of SRA files. If the stream isn't working for you, give Aspera a try using this script. If you struggle to get Aspera configured, you can try a download accelerator such as axel or aria2c. Here's an example with axel.
axel -n5 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX709/SRX709649/SRR1585277/SRR1585277.sra
After downloading the SRA archive, dump the fastq:
fastq-dump -A SRR900186.sra -Z --split-files
Via the browser
Here are two useful approaches suggested by SeqAnswers. You can download each fastq.gz file individually from your web-browser (not command line interface) replacing the digits after SSR in this link:
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=dload&run_list=SRR515925&format=fastq
or batch download with a link like:
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=dload&run_list=SRR294514,SRR352727,SRR364895&format=fastq
or batch download with a link like:
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=dload&run_list=SRR294514,SRR352727,SRR364895&format=fastq
Alternatively find your study accession number (ie. SRP013698) and go to the SRA run selector:
http://trace.ncbi.nlm.nih.gov/Traces/study/?go=home
Search with your SRP number, then click on the "Run" link. Click on the "Reads" tab, then click "Filtered Download", change the format to "FASTQ" and hit "Download".
SRA mirrors
Most of the data on SRA is mirrorred at ENA or DNAnexus.You can download the compressed fastq files from ENA for forward and reverse reads
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/SRR504687/SRR504687_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/SRR504687/SRR504687_2.fastq.gz
You can download the SRA archive from DNAnexus too.
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR504/SRR504687/SRR504687.sra
Stream directly into your analysis pipeline
You can send the data straight through your QC and alignment pipeline without saving intermediate files. Here is an example using SRA toolkit for Olego alignment:
fastq-dump -A SRR764858 -Z \
| fastq_quality_trimmer -l 25 -t 20 -Q33 \
| olego -t 8 Athaliana.TAIR10.23.dna_sm.genome.fa - \
| samtools view -uSh - \
| samtools sort - SRR764858_sra.sort
And another using curl from EBI:
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/SRR764858/SRR764858.fastq.gz \
| pigz -d | fastq_quality_trimmer -t 20 -l 25 -Q33 \
| olego -t 8 Athaliana.TAIR10.23.dna_sm.genome.fa - \
| samtools view -uSh - \
| samtools sort - SRR764858_ebi.sort
Dump color-space sequence
abi-dump -A SRR1657115.sra
That will dump the sequence in fasta format (SRR1657115.sra.csfasta) and the quality string (SRR1657115.sra.qual) in separate files. Then I use solid-trimmer.py to do quality trimming. Here's an example:
solid-trimmer.py -c SRR1657115.sra.csfast -q SRR1657115.sra.qual --moving-average 7:12 --min-read-length 25 > SRR1657115.fasta