Genetic Information and the Data Avalanche in Biology
The human genome comprises 6 billion pairs of its 4 constituent units (“bases”) of DNA. It programs the development of an exquisitely sculpted organism with 100 trillion cells arranged into a myriad of organs, muscles and bones, as well as a brain with 100 billion neurons, each with an average of 10,000 connections.
It is 57 years since Watson and Crick solved the structure of DNA. The field gathered speed in the mid-1970s with the introduction of DNA cloning and sequencing, which enabled decoding of the information, much of which, surprisingly, lay outside conventional genes and is assumed (incorrectly) to be junk. These technologies became progressively more sophisticated over the ensuing years, to the point that by 1990 it was feasible to consider the decoding the entire genome. The first draft sequence was completed ten years later. It involved thousands of DNA sequencing machines and cost several billion dollars. It was a tour-de-force, the attempt of which, let alone achievement, was inconceivable just 20 years earlier.
However, this was just the beginning, and the pace of change since has been dizzying. Over the past few years a beautiful intersection of nanotechnologies, optical technologies and DNA technologies has revolutionized DNA sequencing. The volume of data has exploded, and the cost is dropping like a stone. The latest generation of automated machines can generate 200 billion bases (“gigabases”) of DNA sequence per run (over 60 human genome equivalents), in just a few days.
The cost of sequencing a human genome is now less than $10,000. The amount of DNA sequence being deposited in databases is growing at 5-10-fold per annum, which makes Moore’s Law of computing look lethargic. New technologies are coming, such as reading sequence by electrical charge disturbance as molecules pass through nanopores, and is expected that sequencing costs will fall much further in the near future.
This information avalanche is transforming our understanding of biology. Sooner rather than later, every genome of scientific or practical interest will be sequenced. Individual genome sequencing will soon become standard in medicine. This will be expanded by data concerning the epigenome (contextual chemical changes to DNA and the proteins around which it is wrapped) and the transcriptome (the repertoire of RNAs produced from the genome), both of which are far more complex that the genome itself.
This is creating huge challenges that for present and future global collaboration, and therefore dependency on network connectivity between researchers, clinicians and HPC, well beyond that which is available today. These include a massive increase in data storage, which will soon rise to exabytes, and then zettabytes, as well as requirements for data repositories that are scalable in both disk capacity and I/O performance. There will be equal challenges (and opportunities) associated with the development and integration of new software and visualization tools to store and interrogate this data deluge - the “Fourth Paradigm” enunciated by Jim Gray. More exciting is the emerging realization that evolution discovered the power of advanced information systems well before we did: the human genome is the most sophisticated zip file of hardware and software specification yet known, with many lessons to be learnt for both biology and IT.
|
John Mattick's Biography |