From the beginning, we designed BaseSpace to be a place where things moved quickly. We did this largely so that our customers and partners could easily deploy their own apps on the platform (more to come on that in a later post), but we also did it out of necessity to keep pace with the breakneck innovation from our sequencing systems. As many of you no doubt have seen, Illumina recently announced a new high-throughput sequencing instrument called the HiSeq2500, which will produce high quality, high coverage human genomes in a single day. This week, at the Advances in Genome Biology and Technology (AGBT) meeting, Illumina is releasing the first public dataset to be run on this new system, described in the Application Note here.
The Coriell sample NA18507 was prepared using a modified version of the TruSeq DNA sample prep protocol and sequenced at 40X depth in “rapid run” mode. It yielded >90% reads above Q30, for an output of ~135Gb. Bcl files generated by the HiSeq 2500 were converted to fastq’s and aligned against human reference build hg19; BAM and variant calls were generated using CASAVA v1.8.2 and produced >95% dbSNP concordance. A few additional secondary build metrics:
In keeping with our rapid deployment ethos, this week the results of this first dataset are being made available for navigation within BaseSpace using a prototype genome browser. You can view histograms of both snps and coverage plots when zoomed out, move seamlessly from chromosome level to the base pair level, view directional stacked reads displaying variants to reference, use the gene track to link out to NCBI, and download variants in VCF format.
More importantly, it’s fast enough that serving a 135Gbyte dataset to a thin client feels about as smooth as running through MiSeq data. The fact that data from the world’s fastest human genome sequencing machine can be analyzed this way shows that BaseSpace is on its way to dramatically reducing the complexity of sequencing humans.
The browser highlights what’s possible in terms of “big data” storage and visualization in BaseSpace. But note that it is a prototype we’ve stood up just for the purpose of examining this dataset; we’ll have something far more feature-forward in the coming months, and that browser will go into general release for all users. So with that disclaimer, please enjoy the dataset, explore the nooks and crannies of BaseSpace, and use the feedback button on the right to let us know how we can make it better.
Access Genome-in-a-Day dataset:
– Jason Blue-Smith, Product Manager