DEE2 database gets HDF5
Here I’ll show you how to download and work with the new HDF5 datasets from DEE2 (dee2.io).
HDF5 files are provide fast random access to large and complex datasets while occupying less disk space. Overall the bulk data files are 50% smaller than the previously used BZ2. It also makes selecting datasets of interest quicker and obviates the need to convert data from "long" to "wide" formats, which takes a long time and lots of RAM. In short, this is a big upgrade in end user accessibility to power large scale analysis of DEE2 transcriptome data.
The materials here are mostly based on the rhdf5 package here.
I’ll demonstrate with E. coli, but this should also work for other organisms.
First step is to load the rhdf5
library and download the h5 file.
library("rhdf5")
library("tictoc")
if ( file.exists("ecoli_se.h5") ) {
message("HDF5 file exists")
} else {
message("Downloading HDF5 file")
download.file("https://dee2.io/mx/ecoli_se.h5",destfile="ecoli_se.h5")
download.file("https://dee2.io/mx/ecoli_qc.h5",destfile="ecoli_qc.h5")
}
## HDF5 file exists
Now we can take a look at the contents of the two HDF5 files. The first one called “ecoli_se.h5” contains the STAR gene counts, which are integers. The rows represent SRR runs and the columns represent genes. The “ecoli_qc.h5” file contains the quality control data.
h5ls("ecoli_se.h5")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
## group name otype dclass dim
## 0 / bigmatrix H5I_DATASET INTEGER 17293 x 4497
## 1 / colnames H5I_DATASET STRING 4497
## 2 / rownames H5I_DATASET STRING 17293
h5ls("ecoli_qc.h5")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
## group name otype dclass dim
## 0 / bigmatrix H5I_DATASET STRING 17293 x 29
## 1 / colnames H5I_DATASET STRING 29
## 2 / rownames H5I_DATASET STRING 17293
If you want to do a large scale analysis, then you can load the data into memory.
Warning, this will occupy a large amount of RAM which could exceed your system memory.
For E. coli, the 17293 SRR runs occupies 298MB of RAM, which is manageable for most systems.
yy1 <- h5read(file = "ecoli_se.h5", name = "/bigmatrix")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
dim(yy1)
## [1] 17293 4497
format(object.size(yy1),units= "Mb")
## [1] "296.7 Mb"
The STAR gene count matrix is loaded, but it doesn’t have row or column names. Let’s fix that.
yy1[1:10,1:4]
## [,1] [,2] [,3] [,4]
## [1,] 3519 5342 887 2282
## [2,] 676 779 129 507
## [3,] 3399 9331 2152 5793
## [4,] 1480 3326 995 1479
## [5,] 1459 3353 159 721
## [6,] 5189 6788 835 2498
## [7,] 2186 7111 2105 3335
## [8,] 387 621 115 378
## [9,] 6751 16233 3058 7130
## [10,] 5983 17701 5327 7233
mycolnames <- h5read(file = "ecoli_se.h5", name = "/colnames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
myrownames <- h5read(file = "ecoli_se.h5", name = "/rownames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
colnames(yy1) <- mycolnames
rownames(yy1) <- myrownames
yy1[1:6,1:4]
## b0001 b0002 b0003 b0004
## DRR021383 3519 5342 887 2282
## DRR021384 676 779 129 507
## DRR021385 3399 9331 2152 5793
## DRR021386 1480 3326 995 1479
## DRR021387 1459 3353 159 721
## DRR021388 5189 6788 835 2498
That’s not really practical when working with a truly huge dataset like over 100GB, so here is a more efficient way.
This method allows us to select data before it gets read from the disk, which makes it extremely efficient.
See here that the yy2
object doesn’t occupy any memory, yet allows us to select rows and columns.
h5f <- H5Fopen("ecoli_se.h5")
yy2 <- h5f&'bigmatrix'
format(object.size(yy2),units= "Mb")
## [1] "0 Mb"
dim(yy2) # doesn't work
## NULL
format(object.size(yy2),units= "Mb")
## [1] "0 Mb"
yy2[1:10,1:4]
## [,1] [,2] [,3] [,4]
## [1,] 3519 5342 887 2282
## [2,] 676 779 129 507
## [3,] 3399 9331 2152 5793
## [4,] 1480 3326 995 1479
## [5,] 1459 3353 159 721
## [6,] 5189 6788 835 2498
## [7,] 2186 7111 2105 3335
## [8,] 387 621 115 378
## [9,] 6751 16233 3058 7130
## [10,] 5983 17701 5327 7233
Still, the data doesn’t have row or column names, so we should get those. Note that here I’m loading the row and column names into the memory. These will only occupy a few MB, so aren’t a big problem.
mycolnames <- h5read(file = "ecoli_se.h5", name = "/colnames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
myrownames <- h5read(file = "ecoli_se.h5", name = "/rownames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
A typical workflow requires sselection of SRR runs of interest. In this toy example we will select some randomly.
set.seed(42)
mySRR <- sort(sample(myrownames,10,replace=FALSE))
mySRR
## [1] "ERR4763003" "SRR13080361" "SRR18399252" "SRR19643516" "SRR19643982"
## [6] "SRR21776468" "SRR5456731" "SRR7057997" "SRR8517438" "SRR9163261"
Now get the row indexes.
myrowindexes <- which(myrownames %in% mySRR)
myrowindexes
## [1] 1252 3954 7700 8826 9290 10289 13440 14360 15506 16740
Now it is simple enough to select the rows we need, and in this case, retrieving these rows takes less than one second.
tic()
dat <- yy2[myrowindexes,]
toc()
## 0.062 sec elapsed
str(dat)
## int [1:10, 1:4497] 197 32 20 427 1439 0 1160 590 1 28 ...
dat[1:10,1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 197 593 163 323 27 202
## [2,] 32 211 77 156 25 495
## [3,] 20 114 44 64 6 23
## [4,] 427 973 323 423 33 156
## [5,] 1439 18821 3944 3728 33 91
## [6,] 0 25 2 4 0 11
## [7,] 1160 698 278 431 14 217
## [8,] 590 18776 4037 6165 218 478
## [9,] 1 2 1 2 2 11
## [10,] 28 8771 3598 4436 133 127
But we still need row andd column names.
We can also transpose the data, so it complies with the standard count matrix format with rows representing genes and column representing samples.
rownames(dat) <- mySRR
colnames(dat) <- mycolnames
dat[1:10,1:6]
## b0001 b0002 b0003 b0004 b0005 b0006
## ERR4763003 197 593 163 323 27 202
## SRR13080361 32 211 77 156 25 495
## SRR18399252 20 114 44 64 6 23
## SRR19643516 427 973 323 423 33 156
## SRR19643982 1439 18821 3944 3728 33 91
## SRR21776468 0 25 2 4 0 11
## SRR5456731 1160 698 278 431 14 217
## SRR7057997 590 18776 4037 6165 218 478
## SRR8517438 1 2 1 2 2 11
## SRR9163261 28 8771 3598 4436 133 127
tdat <- t(dat)
tdat[1:10,1:6]
## ERR4763003 SRR13080361 SRR18399252 SRR19643516 SRR19643982 SRR21776468
## b0001 197 32 20 427 1439 0
## b0002 593 211 114 973 18821 25
## b0003 163 77 44 323 3944 2
## b0004 323 156 64 423 3728 4
## b0005 27 25 6 33 33 0
## b0006 202 495 23 156 91 11
## b0007 28 34 9 56 203 4
## b0008 4632 3186 202 3602 6336 39
## b0009 315 188 21 131 473 8
## b0010 219 52 5 61 424 0
For completeness, let’s take a look at the quality control information for these data. You may want to eliminate any poor runs.
h5f2 <- H5Fopen("ecoli_qc.h5")
h5ls(h5f2)
## group name otype dclass dim
## 0 / bigmatrix H5I_DATASET STRING 17293 x 29
## 1 / colnames H5I_DATASET STRING 29
## 2 / rownames H5I_DATASET STRING 17293
qc1 <- h5read(file = "ecoli_qc.h5", name = "/bigmatrix")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
mycolnames <- h5read(file = "ecoli_qc.h5", name = "/colnames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
myrownames <- h5read(file = "ecoli_qc.h5", name = "/rownames")
## Warning in h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native =
## native): An open HDF5 file handle exists. If the file has changed on disk
## meanwhile, the function may not work properly. Run 'h5closeAll()' to close all
## open HDF5 object handles.
myrowindexes <- which(myrownames %in% mySRR)
qc1[myrowindexes,]
Researchers using the HDF5 files should keep a copy of the bulk data files for their records.