Data analysis step 1: download from GEO and convert to fastq

August 29, 2014

In this post we will be downloading human RNA-seq data from GEO accession GSE55123. Now you would have thought that this would be easy, but you have to understand that the data we download from GEO is in NCBI's short read archive format (SRA). To unpack the original sequence files can be a bit tricky at first, even the size of the SRA toolkit manual is enough to make you cringe.

So start (in linux) by making a text file containing all the SRA file links fron the NCBI ftp site. Let's call it "url.txt".
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP038/SRP038101/SRR1171523/SRR1171523.sra
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP038/SRP038101/SRR1171524/SRR1171524.sra
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP038/SRP038101/SRR1171525/SRR1171525.sra
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP038/SRP038101/SRR1171526/SRR1171526.sra
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP038/SRP038101/SRR1171527/SRR1171527.sra
http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP038/SRP038101/SRR1171528/SRR1171528.sra

OK now we'll make a little script that goes and downloads each of these URLs. Notice that we use axel, which is a really cool downloading utility that can make multiple server connections and get the most out of your bandwidth. axel has a few bugs. For instance it doesn't handle long URLs well and in those cases you might want to try aria2c as an alternative.

URLFILE=url.txt
NUM=8
for URL in `cat $URLFILE`
do axel --num-connections=$NUM $URL
done

OK so now you'll have a bunch of SRA files and you need to get them into fastq format. In most cases, I would recommend downloading the CentOS version, even if you're running Ubuntu.

wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.3.5-2/sratoolkit.2.3.5-2-centos_linux64.tar.gz
tar xvfz sratoolkit.2.3.5-2-centos_linux64.tar.gzcd sratoolkit.2.3.5-2-centos_linux64/bin/
sudo ln * /usr/local/bin

Now go back to the directory with the SRA files and create a script that converts them (dump) to compressed fastq.

for SRA in *sra
do
fastq-dump --gzip --split-3 $SRA
done

Search This Blog

Genome Spot

Data analysis step 1: download from GEO and convert to fastq

Popular posts from this blog

Uploading data to GEO - which method is faster?

Using GTF tools to get gene lengths

Data analysis step 8: Pathway analysis with GSEA