IEEE/ACM Trans Comput Biol Bioinform 2019 Apr 1. Epub 2019 Apr 1.
Engineering stable proteins is crucial to various industrial purposes. Several machine learning methods have been developed to predict changes in the stability of proteins upon single point mutations. To improve accuracy of the prediction, we propose a new unsupervised descriptor for protein sequences that is based on a sequence-to-sequence (seq2seq) neural network model combined with a sequence-compression method called byte-pair encoding (BPE). Our results exhibit that BPE can encode a protein sequence into a sequence of shorter length, thereby enabling efficient training of the seq2seq model. Furthermore, we implement a basic predictor using the proposed descriptor, and our experimental results demonstrate that the predictor achieved state-of-the-art accuracy in case of tests for proteins that are not included in the training data.