A brief introduction to xHMMER3x2

Background & Motivation

Over the past decade, the OMICs frenzy from arrays to sequencing has swarmed genomic research with voluminous amount of data and elucidated lists of candidate genes/proteins. Yet, many of these genes/proteins remained not well-understood in relation to the observed phenotype. The major gap stems from our lack in complete gene/protein functions which impedes researchers from assembling the sets of biomolecular mechanisms that can sufficiently explain the observed phenotype.

Despite so, experimental characterization of gene/protein function still receives insufficient attention nowadays. This can be attested by the dwindling number of characterized genes/proteins reported over the past decade. Sadly, one can be expected this number to continue to grow at a slow rate.

On this basis, the only viable approach to characterize genes/proteins is to computationally transfer function annotation of the well-studied gene/protein sequences to the less-studied or novel ones for functional hints. Besides this, in-silico function annotations must always be kept consistent regardless of the revisions made to function annotation databases or associated software tools used during the computations.

A widely popular method of functional annotations of proteins is through the domain finding using HMMER software, with HMMER3 being the latest release. While the local-mode HMMER3 is notable for its massive speed improvement, the slower glocal-mode HMMER2 is more exact for domain annotation by enforcing full sequence-to-domain alignments. Since a unit of domain necessarily implies a unit of function, local-mode HMMER3 alone remains insufficient for precise function annotation tasks. In addition, the incomparable E-values ranges of the same domain model by different HMMER builds (ie HMMER2 & HMMER3) creates difficulty when checking for domain annotation consistency on a large-scale basis.

Thus the aim of xHMMER3x2 is to utilize HMMER3's speed and HMMER2's glocal alignment mode for large-scale protein domain annotation. To put it in simple term, we want to have a "fast" glocal-mode HMMER2.


The "fast" glocal-mode HMMER2 is achived through a three-stages xHMMER3x2 workflow as shown in Figure 1 below.

Figure 1. The xHMMER3x2 workflow (click on figure itself for larger version)

Stage 1: Domain scanning with the bulk of domains scanned by HMMER3
The first stage of xHMMER3x2 is to scan for most probably domains using both HMMER2 & HMMER3. Pfam domains will first be partitioned into 2 portions: one portion of it will be HMMER3 built models while another portion of it will be HMMER2 built models. Using the recommended settings based on our findings, about 90% of domains (pfam 29) will be assigned for HMMER3 built models, and to be searched by HMMER3, while the rest 10% of the domains will be assigned for HMMER2 built models, to be searched by HMMER2. As for how the domains are assigned, they are through the AUC comparison of calibrated HMMER2/HMMER3 ROCs (Figure 2).

Since the total number of utilized HMMER3 models far exceeds that of HMMER2 (90% versus 10%), the speed of xHMMER3x2 is expected to be closer to the faster HMMER3 than the slower HMMER2.
Detection sensitivity is maintained since a portion of domains (found to have low sensitivity with HMMER3) is searched by HMMER2 instead.
Stage 2: Full domain re-alignment step with HMMER2
In second stage, domains scanned by HMMER3 is of local alignmnet only. These domains will thus be searched against the same domain model but in glocal-mode HMMER2 to perform the full domain re-alignment step.

Glocal alignment is prefered in function annotation tasks.
Stage 3: E-values to FPR quantification
The third stage of xHMMER3x2 estimates the FPR(FP) values of the domain hits from their E-values using the previously calibrated HMMER2/HMMER3 ROCs (Figure 2).

This is used to overcome the problem of incomparable E-values ranges of the same domain model by different HMMER builds (HMMER2,HMMER3).

The FPR intervals are bounded by these FPR(FP) points:

Figure 2. Calibration procedure between HMMER2 and HMMER3 domain model builds (click on figure itself for larger version)

Stage 1
The "hmmsearch" results for the two HMMER variants are first produced on a positive set of α seed sequences of the model with the rest of the 16294 models (pfam 29) as consensus sequences.
Stage 2
A total of β glocal HMMER2 and γ local HMMER3 sequence-to-domain alignments are paired HMMER2/HMMER3 hits or orphaned accordingly. For the paired hits, the HMMER2 hit is reconstructed to mimic the local alignment of its HMMER3 counterpart to derived a new HMMER2 E-value.
Stage 3
The orphaned and reconstructed E-values of the HMMER2 are partitioned into their positive and negative sets (see red histograms). The same is done for the HMMER3 (see blue histograms). For a particular E-value cutoff, the respective false-positive rates (FPR) and true-positive (TPR) rates can be quantified from the two sets of histograms.

Calibration of HMMER2 & HMMER3 is important in xHMMER3x2 for the following purposes:
  1. To identified which domains to be of HMMER2/HMMER3 built models (stage 1 of xHMMER3x2). This is done through the AUC comparison of HMMER2 and HMMER3 ROC. If AUCH3>AUCH2, this particular domain will have a better search result with HMMER3, and vice versa.

  2. To serve as common metric between HMMER2 and HMMER3 output, eliminating incomparable E-values between the HMMER builds (stage 3 of xHMMER3x2).


In this work, the speed of HMMER3 and glocal-mode alignment of HMMER2 are combined to create the xHMMER3x2 framework (Figure 1) to tackle the large-scale domain annotation task. Briefly, HMMER3 is utilized for initial domain detection so that HMMER2 can subsequently perform the glocal-mode sequence-to-domain alignments for the detected HMMER3 hits. A calibration procedure is required to ensure that the search space by HMMER2 is sufficiently replicated by HMMER3. Generally, the latter is true for 80% of the Pfam domain library (release 29). However, about 20% of HMMER3 domain models has to be replaced by HMMER2 counterparts if the search space of HMMER2 is to be closely replicated.

When tested on a set of UniProt/SwissProt human sequences, xHMMER3x2 can be configured to be between 7 to 201 times faster than HMMER2, but with descending domain detection sensitivity from 99.8% to 95.7% with respect to HMMER2. At extremes, xHMMER3x2 is either the slow glocal-mode HMMER2 or fast HMMER3 with glocal-mode. At recommended setting, xHMMER3x2 can achieve 99.1% sensitivity with 18x speed up.

Finally, the E-values to false-positive rates quantification by xHMMER3x2 allows E-values of different model builds to be compared, so that any annotation discrepancies in a large-scale annotation exercise can be flagged for manual examination.

Overall, the xHMMER3x2 workflow allows large-scale domain annotation speed to be improved over HMMER2 without compromising for domain-detection sensitivity and sequence-to-domain alignment incompleteness.