Ten years ago, when scientists amassed raw data for the first mapped human genome, it required years of collaborative effort. Today, BGI Shenzhen, the world's largest genomics research institute, can do it in two hours or less.
Genomes have made biology a big-data science, with technicians shifting through billions of nucleotides—the building blocks of DNA—to map a single genome. While sequencing has accelerated, the ability to analyze, store and transmit data has lagged. Even BGI which has over 3,000 technicians is overwhelmed.
"For many projects, they are generating more data than they can handle," said Scott Edmunds, executive editor at GigaScience, an open-data publication recently launched by BGI and publisher BioMedCentral.
The dilemma has pushed BGI to build infrastructure, streamline its computing processes and to make its data more widely available, Edmunds said.
"We want to offer this infrastructure to others," Edmunds added. "We want to expand access to the data we have."
An abundance of data
There are approximately three billion data points in a human genome. Within this longer string of data points are sequences that code for certain proteins—what we call genes. With only 22,000 genes identified in the human genome we've barely scratched the surface.
To add to the challenge, the data that BGI's sequencing machines spit out is fragmented and unordered. Instead of yielding a single string containing billions of properly organized nucleotides – a task no sequencing machine is capable of yet– machines produce shorter sections, each only a few hundred or thousand nucleotides long. It's up to computers and technicians to organize them in a meaningful way, which has left researchers with a bottleneck.
BGI hopes to overcome the bottleneck by sharing data. For World Hunger Day in May, for instance, BGI made over 13 terabytes of rice genome data available. Researchers can download data on individual strains of rice from BGI's website. Once analyzed, the information will help scientists breed and engineer higher-yield strains of rice.