Browse all publications by topic
Browse all publications by year
- S. Hochreiter, M. Heusel,
and K. Obermayer. Fast Model-based Protein Homology Detection without
Alignment.
.
Bioinformatics, 23(14):1728-1736, 2007.
(HTTP)
Motivation: As more genomes are sequenced, the demand for fast gene
classification techniques is increasing. To analyze a newly sequenced genome,
first the genes are identified and translated into amino acid sequences which
are then classified into structural or functional classes. The best
performing protein classification methods are based on protein homology
detection using sequence alignment methods. Alignment methods have recently
been enhanced by discriminative methods like support vector machines as well
as by position specific scoring matrices (PSSM) as obtained from PSI-BLAST.
However alignment methods are time consuming if a new sequence must be
compared to many known sequences - the same holds for support vector
machines. Even more time consuming is to construct a PSSM for the new
sequence. The best performing methods would take about 25 days on these-days
computers to classify the sequences of a new genome (20,000 genes) as
belonging to just one specific class - however there are hundreds of classes.
Another shortcoming of alignment algorithms is that they do not build a model
of the positive class but measure the mutual distance between sequences or
profiles. Only multiple alignment and hidden Markov models are popular
classification methods which build a model of the positive class but they
show low classification performance. The advantage of a model is that it can
be analyzed for chemical properties common to the class members to obtain new
insights into protein function and structure. We propose a fast model-based
recurrent neural network for protein homology detection, the ''Long
Short-Term Memory'' (LSTM). LSTM automatically extracts indicative patterns
for the positive class but in contrast to profile methods it also extracts
negative patterns and uses correlations between all detected patterns for
classification. LSTM is capable to automatically extract useful local and
global sequence statistics like hydrophobicity, polarity, volume,
polarizability and combine them with a pattern. These properties make LSTM
complementary to alignment based approaches as it does not use predefined
similarity measures like BLOSUM or PAM matrices. Results: We have applied
LSTM to a well known benchmark for remote protein homology detection, where a
protein must be classified as belonging to a SCOP superfamily. LSTM reaches
state-of-the-art classification performance but is considerably faster for
classification than other approaches with comparable classification
performance. LSTM is 5 orders of magnitudes faster than methods which perform
slightly better in classification and 2 orders of magnitudes faster than the
fastest SVM-based approaches (which, however, have lower classification
performance than LSTM). Only PSI-BLAST and HMMbased methods show comparable
time complexity as LSTM but they cannot compete with LSTM in classification
performance. To test the modeling capabilities of LSTM, we applied LSTM to
PROSITE classes and interpreted the extracted patterns. In 8 out of 15
classes LSTM automatically extracted the PROSITE motif. In the remaining 7
cases alternative motifs are generated which give better classification
results on average than the PROSITE motifs.
|