Ten years ago, when scientists amassed raw data for the first mapped human genome, it required years of collaborative effort. Today, BGI Shenzhen, the world's largest genomics research institute, can do it in two hours or less.
Genomes have made biology a big-data science, with technicians shifting through billions of nucleotides—the building blocks of DNA—to map a single genome. While sequencing has accelerated, the ability to analyze, store and transmit data has lagged. Even BGI which has over 3,000 technicians is overwhelmed.
"For many projects, they are generating more data than they can handle," said Scott Edmunds, executive editor at GigaScience, an open-data publication recently launched by BGI and publisher BioMedCentral.
The dilemma has pushed BGI to build infrastructure, streamline its computing processes and to make its data more widely available, Edmunds said.
"We want to offer this infrastructure to others," Edmunds added. "We want to expand access to the data we have."
An abundance of data
There are approximately three billion data points in a human genome. Within this longer string of data points are sequences that code for certain proteins—what we call genes. With only 22,000 genes identified in the human genome we've barely scratched the surface.
To add to the challenge, the data that BGI's sequencing machines spit out is fragmented and unordered. Instead of yielding a single string containing billions of properly organized nucleotides – a task no sequencing machine is capable of yet– machines produce shorter sections, each only a few hundred or thousand nucleotides long. It's up to computers and technicians to organize them in a meaningful way, which has left researchers with a bottleneck.
BGI hopes to overcome the bottleneck by sharing data. For World Hunger Day in May, for instance, BGI made over 13 terabytes of rice genome data available. Researchers can download data on individual strains of rice from BGI's website. Once analyzed, the information will help scientists breed and engineer higher-yield strains of rice.
"We want to get it out there early so others can work on it," Edmunds said.
With large quantities of data,storage and transportation is an issue. When BGI offers open-access data, they rely on in-house cloud-storage to facilitate sharing.However, many of BGI's large commercial projects are still transported via hard drives. In an ideal world, BGI would be able to offer access to data remotely.
"The problem we face is bandwidth," said Xu Xing, the director of BGI's cloud computing project. "Things move too slowly."
This is a common problem for China-based companies where bandwidth is expensive, he said. BGI chose to locate GigaScience in Hong Kong, where data-transfer options are better.
Delivering data in a manageable way is another issue. One reason for sharing data is that benchmarks provided by previously-coded genomes make it easier for computers and technicians to put together subsequent genomes. The more genomes are coded, the easier it is to compare and contrast.
"You need to include all the metadata," for this to be effective, Xu said. "You must include where the sampleis from and what parameters you used to analyze—so researchers both know where data is and where it comes from."
Making analysis easier
BGI is developing a number of platforms to deal with bigother data issues by making analytical tools easily available.
Its Easy Genomics platform aims to offer data access throughcloud storage. The platform enables researchers to perform analysisin a web browser using built-in computation algorithms and datamanagement tools and doesn't require local data storage.
BGI isn't the only company determined to make the expanding mass of genomic data publicly available. Companies like DNA Nexus andBiospace offer access on public cloud networks.