DEE2 database gets HDF5

Here I’ll show you how to download and work with the new HDF5 datasets from DEE2 (dee2.io).

HDF5 files are provide fast random access to large and complex datasets while occupying less disk space. Overall the bulk data files are 50% smaller than the previously used BZ2. It also makes selecting datasets of interest quicker and obviates the need to convert data from "long" to "wide" formats, which takes a long time and lots of RAM. In short, this is a big upgrade in end user accessibility to power large scale analysis of DEE2 transcriptome data.

The materials here are mostly based on the rhdf5 package here.

I’ll demonstrate with E. coli, but this should also work for other organisms.

First step is to load the rhdf5 library and download the h5 file.

library("rhdf5")
library("tictoc")

if ( file.exists("ecoli_se.h5") ) {
  message("HDF5 file exists")
} else {
  message("Downloading HDF5 file")
  download.file("https://dee2.io/mx/ecoli_se.h5",destfile="ecoli_se.h5")
  download.file("https://dee2.io/mx/ecoli_qc.h5",destfile="ecoli_qc.h5")
}
## HDF5 file exists

Now we can take a look at the contents of the two HDF5 files. The first one called “ecoli_se.h5” contains the STAR gene counts, which are integers. The rows represent SRR runs and the columns represent genes. The “ecoli_qc.h5” file contains the quality control data.

h5ls("ecoli_se.h5")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
##   group      name       otype  dclass          dim
## 0     / bigmatrix H5I_DATASET INTEGER 17293 x 4497
## 1     /  colnames H5I_DATASET  STRING         4497
## 2     /  rownames H5I_DATASET  STRING        17293
h5ls("ecoli_qc.h5")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
##   group      name       otype dclass        dim
## 0     / bigmatrix H5I_DATASET STRING 17293 x 29
## 1     /  colnames H5I_DATASET STRING         29
## 2     /  rownames H5I_DATASET STRING      17293

If you want to do a large scale analysis, then you can load the data into memory.

Warning, this will occupy a large amount of RAM which could exceed your system memory.

For E. coli, the 17293 SRR runs occupies 298MB of RAM, which is manageable for most systems.

yy1 <- h5read(file = "ecoli_se.h5", name = "/bigmatrix")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
dim(yy1)
## [1] 17293  4497
format(object.size(yy1),units= "Mb")
## [1] "296.7 Mb"

The STAR gene count matrix is loaded, but it doesn’t have row or column names. Let’s fix that.

yy1[1:10,1:4]
##       [,1]  [,2] [,3] [,4]
##  [1,] 3519  5342  887 2282
##  [2,]  676   779  129  507
##  [3,] 3399  9331 2152 5793
##  [4,] 1480  3326  995 1479
##  [5,] 1459  3353  159  721
##  [6,] 5189  6788  835 2498
##  [7,] 2186  7111 2105 3335
##  [8,]  387   621  115  378
##  [9,] 6751 16233 3058 7130
## [10,] 5983 17701 5327 7233
mycolnames <- h5read(file = "ecoli_se.h5", name = "/colnames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
myrownames <- h5read(file = "ecoli_se.h5", name = "/rownames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
colnames(yy1) <- mycolnames
rownames(yy1) <- myrownames
yy1[1:6,1:4]
##           b0001 b0002 b0003 b0004
## DRR021383  3519  5342   887  2282
## DRR021384   676   779   129   507
## DRR021385  3399  9331  2152  5793
## DRR021386  1480  3326   995  1479
## DRR021387  1459  3353   159   721
## DRR021388  5189  6788   835  2498

That’s not really practical when working with a truly huge dataset like over 100GB, so here is a more efficient way.

This method allows us to select data before it gets read from the disk, which makes it extremely efficient.

See here that the yy2 object doesn’t occupy any memory, yet allows us to select rows and columns.

h5f <- H5Fopen("ecoli_se.h5")
yy2 <- h5f&'bigmatrix'
format(object.size(yy2),units= "Mb")
## [1] "0 Mb"
dim(yy2) # doesn't work
## NULL
format(object.size(yy2),units= "Mb")
## [1] "0 Mb"
yy2[1:10,1:4]
##       [,1]  [,2] [,3] [,4]
##  [1,] 3519  5342  887 2282
##  [2,]  676   779  129  507
##  [3,] 3399  9331 2152 5793
##  [4,] 1480  3326  995 1479
##  [5,] 1459  3353  159  721
##  [6,] 5189  6788  835 2498
##  [7,] 2186  7111 2105 3335
##  [8,]  387   621  115  378
##  [9,] 6751 16233 3058 7130
## [10,] 5983 17701 5327 7233

Still, the data doesn’t have row or column names, so we should get those. Note that here I’m loading the row and column names into the memory. These will only occupy a few MB, so aren’t a big problem.

mycolnames <- h5read(file = "ecoli_se.h5", name = "/colnames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
myrownames <- h5read(file = "ecoli_se.h5", name = "/rownames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.

A typical workflow requires sselection of SRR runs of interest. In this toy example we will select some randomly.

set.seed(42)
mySRR <- sort(sample(myrownames,10,replace=FALSE))
mySRR
##  [1] "ERR4763003"  "SRR13080361" "SRR18399252" "SRR19643516" "SRR19643982"
##  [6] "SRR21776468" "SRR5456731"  "SRR7057997"  "SRR8517438"  "SRR9163261"

Now get the row indexes.

myrowindexes <- which(myrownames %in% mySRR)
myrowindexes
##  [1]  1252  3954  7700  8826  9290 10289 13440 14360 15506 16740

Now it is simple enough to select the rows we need, and in this case, retrieving these rows takes less than one second.

tic()
dat <- yy2[myrowindexes,]
toc()
## 0.062 sec elapsed
str(dat)
##  int [1:10, 1:4497] 197 32 20 427 1439 0 1160 590 1 28 ...
dat[1:10,1:6]
##       [,1]  [,2] [,3] [,4] [,5] [,6]
##  [1,]  197   593  163  323   27  202
##  [2,]   32   211   77  156   25  495
##  [3,]   20   114   44   64    6   23
##  [4,]  427   973  323  423   33  156
##  [5,] 1439 18821 3944 3728   33   91
##  [6,]    0    25    2    4    0   11
##  [7,] 1160   698  278  431   14  217
##  [8,]  590 18776 4037 6165  218  478
##  [9,]    1     2    1    2    2   11
## [10,]   28  8771 3598 4436  133  127

But we still need row andd column names.

We can also transpose the data, so it complies with the standard count matrix format with rows representing genes and column representing samples.

rownames(dat) <- mySRR
colnames(dat) <- mycolnames
dat[1:10,1:6]
##             b0001 b0002 b0003 b0004 b0005 b0006
## ERR4763003    197   593   163   323    27   202
## SRR13080361    32   211    77   156    25   495
## SRR18399252    20   114    44    64     6    23
## SRR19643516   427   973   323   423    33   156
## SRR19643982  1439 18821  3944  3728    33    91
## SRR21776468     0    25     2     4     0    11
## SRR5456731   1160   698   278   431    14   217
## SRR7057997    590 18776  4037  6165   218   478
## SRR8517438      1     2     1     2     2    11
## SRR9163261     28  8771  3598  4436   133   127
tdat <- t(dat)
tdat[1:10,1:6]
##       ERR4763003 SRR13080361 SRR18399252 SRR19643516 SRR19643982 SRR21776468
## b0001        197          32          20         427        1439           0
## b0002        593         211         114         973       18821          25
## b0003        163          77          44         323        3944           2
## b0004        323         156          64         423        3728           4
## b0005         27          25           6          33          33           0
## b0006        202         495          23         156          91          11
## b0007         28          34           9          56         203           4
## b0008       4632        3186         202        3602        6336          39
## b0009        315         188          21         131         473           8
## b0010        219          52           5          61         424           0

For completeness, let’s take a look at the quality control information for these data. You may want to eliminate any poor runs.

h5f2 <- H5Fopen("ecoli_qc.h5")
h5ls(h5f2)
##   group      name       otype dclass        dim
## 0     / bigmatrix H5I_DATASET STRING 17293 x 29
## 1     /  colnames H5I_DATASET STRING         29
## 2     /  rownames H5I_DATASET STRING      17293
qc1 <- h5read(file = "ecoli_qc.h5", name = "/bigmatrix")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
mycolnames <- h5read(file = "ecoli_qc.h5", name = "/colnames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
myrownames <- h5read(file = "ecoli_qc.h5", name = "/rownames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
myrowindexes <- which(myrownames %in% mySRR)

qc1[myrowindexes,]
Researchers using the HDF5 files should keep a copy of the bulk data files for their records. 
Thanks to A/Prof Alexey Sergushichev for the suggestion to use HDF5 format!

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Generate an RNA-seq count matrix with featureCounts