Posts

Showing posts with the label RNA-Seq

DEE2 gets published

The dee2.io project has been a labor of love since 2013/2014, has undergone a major overhaul and has finally been published online in GigaScience. The great thing about this journal is not only are the articles open access, but also the reviewer's comments. We had great suggestions and they improved the resource tremendously.

It's great that it has been published finally, but publication is not the end goal of the project. The goal is to democratize omics data to a point where it can be done by biologists without any coding experience, undergrad students, high school students, practically anyone with a smart phone and an internet connection. So instead of being the end of the project, this is really the end of the beginning. Not only will we be keeping up with new SRA submissions over the next year of so, we will be incorporating new features, new species and perhaps some new data types.

If you have suggestions, feedback of comments I would be very grateful!

Using the DEE2 bulk data dumps

Image
The DEE2 project makes freely available bulk data dumps of expression for each of the nine species. 
The data is organised as tab separated values (tsv) in "long" format. This is different to a standard gene expression matrix - see below.

The long format is prefered for really huge datasets because it can be readily indexed and converted to a database, as compared to wide matrix format. 
You'll notice that there are 4 files for each species, with each having a different suffix. They are all compressed with bz2. You can use pbzip2 to de-compress these quickly.
*accessions.tsv.bz2This is a list of runs included in DEE2, along with other SRA/GEO accession numbers. 

*se.tsv.bz2These are the STAR based gene expression counts in 'long format' tables with the columns: 'dataset', 'gene', 'count'.
*ke.tsv.bz2These are the Kallisto estimated transcript expression counts also in long format
*qc.tsv.bz2These are the QC metrics are also available in long …

Incorporate dee2 data into your R-based RNA-seq workflow

Dee2.io is a portal for accessing gene expression data derived from public RNA-seq datasets. So far there are over 400k available datasets and its growing every day. While there are existing databases of such as Expression Atlas, Recount2 and ARCHS4, dee2.io offers a number of unique benefits. For instance, dee2 includes gene-wise counts fron STAR as well as transcript-wise quantifications from Kallisto. There are a few ways you can access these data. Firstly, there is a nice web interface that is mobile friendly. Secondly, there are data dumps available if you are running a large scale analysis.  But the purpose of this post is to demonstrate the improved R interface in action together with SRAdbv2 and statistics with edgeR and DESeq. The official documentation is available on GitHub.
Getting started This tutorial provides a walkthrough for how to work with dee2 expression data, starting with dataset searches, obtaining the data from dee2.io and then performing a differential analysi…

Get the newest Reactome gene sets for pathway analysis

Image
For first-pass pathway analysis I find Reactome to be the most useful database of gene sets for biologists to understand. For a long time I have been using Reactome gene sets as deposited to the GSEA/MSigDB website. Recently a colleague pointed me to the gene matrix file offered directly on the Reactome webpage (Thanks Dr Okabe).

There are some differences. Firstly there are more gene sets in the one from the Reactome webpage (accessed 2018-05-09)
$ wc -l *gmt     674 c2.cp.reactome.v6.1.symbols.gmt    2022 ReactomePathways.gmt    2696 total
Secondly, there are more genes included in one or more gene sets:
$ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | tr '\t' '\n' | sort -u | wc -l 6025 $ cut -f3- ReactomePathways.gmt | tr '\t' '\n' | sort -u | wc -l 10852
And overall there are about threefold more gene-pathway entries 
$ cut -f3- ReactomePathways.gmt | wc -w 106405 $ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | wc -w 37601
I also looked at whether the gen…

Publishing datasets on the dat network - benefits and pitfalls

Image
As I mentioned in an earlier post, Dat is a new data sharing tool that uses concepts of bittorrent and git to enable peer-to-peer sharing of versioned data. This is cool for sharing datasets that change over time, because when you sync the dataset, only the changes are retrieved, sort of like git. As it uses peer-to-peer technology, it is fairly resilient to node failures as the datasets are mirrored between peers. The "dat publish" command registers the repository on datbase.org, meaning that the files can be retrieved by anyone via a normal browser.

To demonstrate, I have released the bulk data dumps from my RNA-seq data processing project, DEE2, which consists of 158 GB of gene expression data. These data are freely available via a browser at https://datbase.org/dee2/bulk or by using the dat command-line tool.


If you're after a single file, then you can use the following syntax to retrieve over https:

wget https://datbase.org/download/<long dat address>/<file…

Has RNA-seq overtaken microarrays?

Image
We know RNA-seq has a number of advantages over array based analyses, but is RNA-seq taking over in terms of number of datasets published? I got curious and thought I'd investigate with some PubMed searches. I searched for "RNA-seq" and "microarray" and downloaded the CSV file which summarises the number of citations per year. As a type of control, I also searched "gene expression".

I divided the yearly "RNA-seq" and "Microarray" citation counts by the "Gene expression" counts then multiplied by 1000 to give the numbers seen below.

You can see that microarray is still more frequent in PubMed as compared to RNA-seq, but the gap is getting much narrower and the cross will likely occur in the next two years.

Next, I will look at the rate of GEO data deposition. (Updates soon)

Update on DEE2 project for Jan 2018

Image
Today I'd like to share some updates on the DEE2 project, which I wrote about  in an earlier post. The  project code can be viewed on github here. Pipeline and images As the pipeline was recently finalised, I was able to roll out the working docker image. To facilitate users without root access, this image was ported to singularity. This took a lot of effort and some expertise from our local HPC team to get things working (many thanks to the Massive/M3 team). The singularity image is available from the webserver (link) and instructions for running it are available on github here. I have started testing a heavyweight singularity image, which includes the genome indexes, which will be more efficient for running jobs with large genomes and will make it available once testing is complete. Queue management It may sound simple to write a script to determine which datasets have been completed and add new datasets to the queue but when taking about tens of thousands of datasets it's …

Update on the DEE project Dec 2017

Back in 2015, our group described DEE, a user friendly repository of uniformly processed RNA-seq data, which I covered in detail in a previous post. Ours was the first such repository that wasn't limited to human or mouse and included sequencing data from a variety of instruments and library types. The purpose of this post is to reflect on the mixed success of DEE and outline where this project is going in future.

Overall I've received a lot of positive feedback from users and a number of citations to our poster. Thanks to everyone who used, gave suggestions, comments, bug reports, etc! However our attempt to have the repository published wasn't so successful due to reviewer niggles over what I consider minor points but hard to implement quickly. The main points raised by reviewers were:

Is it reasonable to treat all data sets as if they were single end? For this one, the reviewers were split, one said it was OK and the other was adamant that it was unacceptable despite my …

Screen for mycoplasma contamination in DNA-seq and RNA-seq data

Image
If you work in a lab dealing with mammalian cell cultures, then you've probabaly heard of Mycoplasma, these are obligate intracellular parasitic bacteria that not only cause human infection and disease, but also common contaminants in cell culture experiments. Mycoplasma infection can cause many changes to cell biology that can invalidate experimental results which is alarming. These bacteria are also resistant to most antibiotics used in culture experiments like streptomycin and penicillin. Mycoplasma detection is also not straight-forward, as these bugs are not visible with the light microscopes used by most researchers. PCR and Elisa tests can be used but there many researchers out there who simply don't perform these tests. Last year, a study published in NAR showed that about 11% of RNA-seq studies were affected by Mycoplasma contamination, furthermore the study also identified a panel of 61 host genes that were strongly correlated with the presence of Mycoplasma.

In thi…