OPTIMIZING
SPARSE ARRAY STORAGE
FOR GENOMICS

TECHNOLOGY

01 / FAST

Using high-level APIs provided in C++, Java*, and Spark*, users can both write and read variant records to and from GenomicsDB shared-nothing instances in parallel using multiple processes in a Single Process Multiple Data (SPMD) manner.

02 / SCALABLE

GenomicsDB uses columnar sparse arrays where samples are mapped to rows and genome positions or sites of variants are mapped to columns. These columns are partitioned in a shared-nothing fashion across thousands of machines, enabling the joint genotyping workflow in Broad Institute’s genome analyzer toolkit (GATK) to scale to 100,000 samples and beyond.

03 / EFFICIENT

GenomicsDB allows bioinformaticians to achieve analysis results with high statistical confidence. The low-level storage format enables faster and more efficient retrievals from disk compared to the use of files. Additionally, using libraries optimized for Intel® architecture to compress data on disk, GenomicsDB cumulatively achieves orders of magnitude improvement in performance compared to existing tools.

Product

CHARTER

OUR STORY

GenomicsDB was initially developed by Intel in collaboration with the Broad Institute of MIT & Harvard. GenomicsDB is an open sourced library and tools with a focus on optimizing sparse array storage specifically for genomic data. It is currently being hosted and developed by the open-source community sponsored by dātma Health Science.