Abstrakt: | Hidden Markov models (HMMs) are an important tool for modeling biological
sequences and their annotations. By sequence annotation we mean an assignment
of labels to each symbol according to its function. For example, in gene finding
we want to distinguish the regions of DNA that encode proteins from surrounding
non-coding sequence. A hidden Markov model defines a probability distribution
over all annotations of a given sequence $X$.
HMMs are traditionally decoded by the Viterbi algorithm. Viterbi algorithm
finds the most probable annotation for a subset of HMMs. In general, the
sequence annotation is NP-hard and the Viterbi algorithm is used as a heuristic
algorithm.
Recently it has been shown that other decoding methods give better
results than Viterbi in specific applications. We propose a new decoding method
that allows uncertainty in region boundaries by considering all annotations
with slightly shifted boundaries as the same. Our method (the highest expected
reward decoding -- HERD) is an extension of the framework of maximum expected boundary
accuracy decoding introduced by Gross \emph{et al.}.
We evaluate HERD on the problem of detecting viral recombination in the HIV genome
and compare it with an existing tool called jumping HMM which uses the Viterbi
algorithm. HERD has better prediction accuracy in terms of the feature
sensitivity and specificity, where feature is a single-colored block in an
annotation.
|
---|