Hap.py app on BaseSpace® Sequence Hub: GA4GH benchmarking of small variant calls

Bioinfomatics tools are a key component in the Next-generation Sequencing (NGS) workflow and can have a significant impact on the results. Alignment and variant calling, in particular, involve complex algorithms, each with unique strengths and weaknesses. The Broad Institute’s BWA+GATK application is among the most popular, but over the last few years more alignment+variant calling methods have been released by companies including Illumina, Edico Genome, and Sentieon. With the emergence of multiple methods comes a clear need for comparison between the results obtained by these methods so that people who use these tools can select the best one for their purpose.

The new Hap.py app available on BaseSpace Sequence Hub enables users to compare diploid genotypes at the haplotype level by generating and matching alternate sequences in a small region of the genome that contains one or more variants. Hap.py makes it easy to compare any variant call set against a range of packaged gold-standard truth sets1,2 to perform routine benchmarking.

Benchmarking variant calls has an important role in a variety of applications, such as validating a sequencing pipeline, testing new software, and routine quality assurance. In these contexts, the benchmarking workflow involves sequencing a known reference sample with a corresponding gold-standard truth set available, processing this sequence data with an alignment and variant calling pipeline, then comparing the resulting call set against the gold standard.

GA4GH Benchmarking

The Global Alliance for Genomics and Health (GA4GH) is an international collaborative effort which aims to establish standards that in turn encourage the development of interoperable bioinformatics tools. This standardization effort is vital as the field of genomics moves from academic research into clinical research applications. Among the many GA4GH initiatives, a benchmarking group was formed with the specific aim of establishing best practices for how variant benchmarking should be done to ensure accurate and reproducible results.

This GA4GH benchmarking initiative has developed best practices for small variant benchmarking which are now implemented for BaseSpace Sequence Hub users by the BaseSpace Hap.py Benchmarking application. Specifically, these best practices recommend:

  • RTG-tools vcfeval as the variant comparison engine, with its ability to match even partial haplotypes between truth set and query alleles
  • Quantification by the hap.py command line tool of true positive, false positive, false negative and unassessed calls
  • Optional stratification of benchmarking metrics into difficult regions such as regions of low complexity or biased sequence composition

Documentation and links to the outputs from the GA4GH Benchmarking team can be found on their benchmarking tools github repository. Further details of the benchmarking best practices as developed by the GA4GH team will be described in an upcoming manuscript.

Hap.py application features

Latest truth sets — Hap.py includes the latest 2017 version of the Illumina Platinum Genomes truth sets (for samples NA12878 and NA12877) as well as the most recent truth sets from Genome in a Bottle (v3.3.2), which cover 4 alternative reference samples (Ashkenazim trio and Chinese son) in addition to NA12878. There is also the option to use a custom truth set if your chosen reference material is not covered by these truth sets.

Precision-recall curves — variant calling pipelines often apply filtering at a level chosen by the user. To fairly compare pipelines, we must compare the full range of accessible precision and recall metrics that can be achieved by varying a threshold from its most sensitive (high recall but lower precision) to its most conservative (lower recall with high precision). Precision-recall curves offer a broad view of this tradeoff and are useful, for example, when choosing thresholding levels for further analysis.

Figure 1: Precision-recall plots produced by the Hap.py Benchmarking app. Insertions and deletions (indels) and single nucleotide variants (SNVs) are shown side-by-side. The curve shows how precision and recall vary as the level of filtering is increased from ALL (the unfiltered input VCF, bottom-right point on each curve) up to PASS (the query VCF with all filters applied, second point) and further to more conservative levels of filtering.
  • Stratification regions— by selecting a set of regions by which to stratify benchmarking metrics, you can accurately identify where exactly pipelines may differ. For example, the effect of performing Polymerase Chain Reaction (PCR) prior to sequencing relative to a PCR-free protocol could be investigated by comparing truth set precision and recall particularly in repetitive regions, where PCR cycles may introduce additional false positive insertion-deletion (indel) calls in long homopolymers.

Relation to VCAT

BaseSpace Sequence Hub already offers the Variant Calling Assessment Tool (VCAT) which includes Hap.py. These applications are complementary and choosing one depends on your purpose. See Table 1 to determine support provided for each particular case.

Table 1: A comparison of the Hap.py and VCAT apps

I want to: Hap.py VCAT
Run a single variant call file (VCF) against the latest Platinum Genomes truth set using GA4GH benchmarking best practices  Supported Not supported
Run 2-10 VCFs at once and compare them against each other too Not supported  Supported
Run 100s to 1000s of samples using the BaseSpace Sequence Hub Command Line Interface Supported  Not supported
Choose an optimal threshold for filtering my VCF by seeing how precision and recall vary over a range of cutoffs Supported Not supported
Restrict the analysis to targeted regions on my data that have been sequenced with an Illumina targeted panel Supported but requires you to upload a browser extensible data (BED) file Supported

Why is it called Hap.py?

Hap.py Benchmarking wraps the open-source hap.py commandline tool. Rather than simply comparing VCF records position-by-position, hap.py is able to compare local haplotypes to identify matching truth and query records even if their absolute representation differs. Hap.py is partly written in python (with its .py extension), so the haplotype comparison tool became hap.py


  1. Eberle MA et al. (2017) A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Research, 27: 157-164. http://dx.doi.org/10.1101/gr.210500.116
  2. Zook JM et al. (2014) Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnology, 32: 246-251. http://dx.doi.org/10.1038/nbt.2835



Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.