Publications by authors named "Jingbo Xia"

19 Publications

  • Page 1 of 1

OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition.

Genomics Inform 2021 Sep 30;19(3):e27. Epub 2021 Sep 30.

Hubei Provincial Key Laboratory of Agricultural Bioinformatics, College of informatics, Huazhong Agricultural University, Wuhan 430070, Hubei Province, China.

Due to the rapid evolution of high-throughput technologies, a tremendous amount of data is being produced in the biological domain, which poses a challenging task for information extraction and natural language understanding. Biological named entity recognition (NER) and named entity normalisation (NEN) are two common tasks aiming at identifying and linking biologically important entities such as genes or gene products mentioned in the literature to biological databases. In this paper, we present an updated version of OryzaGP, a gene and protein dataset for rice species created to help natural language processing (NLP) tools in processing NER and NEN tasks. To create the dataset, we selected more than 15,000 abstracts associated with articles previously curated for rice genes. We developed four dictionaries of gene and protein names associated with database identifiers. We used these dictionaries to annotate the dataset. We also annotated the dataset using pre-trained NLP models. Finally, we analysed the annotation results and discussed how to improve OryzaGP.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.5808/gi.21015DOI Listing
September 2021

LitCovid-AGAC: cellular and molecular level annotation data set based on COVID-19.

Genomics Inform 2021 Sep 30;19(3):e23. Epub 2021 Sep 30.

Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, 430070 Wuhan, China.

Currently, coronavirus disease 2019 (COVID-19) literature has been increasing dramatically, and the increased text amount make it possible to perform large scale text mining and knowledge discovery. Therefore, curation of these texts becomes a crucial issue for Bio-medical Natural Language Processing (BioNLP) community, so as to retrieve the important information about the mechanism of COVID-19. PubAnnotation is an aligned annotation system which provides an efficient platform for biological curators to upload their annotations or merge other external annotations. Inspired by the integration among multiple useful COVID-19 annotations, we merged three annotations resources to LitCovid data set, and constructed a cross-annotated corpus, LitCovid-AGAC. This corpus consists of 12 labels including Mutation, Species, Gene, Disease from PubTator, GO, CHEBI from OGER, Var, MPA, CPA, NegReg, PosReg, Reg from AGAC, upon 50,018 COVID-19 abstracts in LitCovid. Contain sufficient abundant information being possible to unveil the hidden knowledge in the pathological mechanism of COVID-19.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.5808/gi.21013DOI Listing
September 2021

A Novel Metric to Quantify the Effect of Pathway Enrichment Evaluation With Respect to Biomedical Text-Mined Terms: Development and Feasibility Study.

JMIR Med Inform 2021 Jun 18;9(6):e28247. Epub 2021 Jun 18.

Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China.

Background: Natural language processing has long been applied in various applications for biomedical knowledge inference and discovery. Enrichment analysis based on named entity recognition is a classic application for inferring enriched associations in terms of specific biomedical entities such as gene, chemical, and mutation.

Objective: The aim of this study was to investigate the effect of pathway enrichment evaluation with respect to biomedical text-mining results and to develop a novel metric to quantify the effect.

Methods: Four biomedical text mining methods were selected to represent natural language processing methods on drug-related gene mining. Subsequently, a pathway enrichment experiment was performed by using the mined genes, and a series of inverse pathway frequency (IPF) metrics was proposed accordingly to evaluate the effect of pathway enrichment. Thereafter, 7 IPF metrics and traditional P value metrics were compared in simulation experiments to test the robustness of the proposed metrics.

Results: IPF metrics were evaluated in a case study of rapamycin-related gene set. By applying the best IPF metrics in a pathway enrichment simulation test, a novel discovery of drug efficacy of rapamycin for breast cancer was replicated from the data chosen prior to the year 2000. Our findings show the effectiveness of the best IPF metric in support of knowledge discovery in new drug use. Further, the mechanism underlying the drug-disease association was visualized by Cytoscape.

Conclusions: The results of this study suggest the effectiveness of the proposed IPF metrics in pathway enrichment evaluation as well as its application in drug use discovery.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2196/28247DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8277388PMC
June 2021

Bridging heterogeneous mutation data to enhance disease gene discovery.

Brief Bioinform 2021 09;22(5)

Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei Province, P.R. China.

Bridging heterogeneous mutation data fills in the gap between various data categories and propels discovery of disease-related genes. It is known that genome-wide association study (GWAS) infers significant mutation associations that link genotype and phenotype. However, due to the differences of size and quality between GWAS studies, not all de facto vital variations are able to pass the multiple testing. In the meantime, mutation events widely reported in literature unveil typical functional biological process, including mutation types like gain of function and loss of function. To bring together the heterogeneous mutation data, we propose a 'Gene-Disease Association prediction by Mutation Data Bridging (GDAMDB)' pipeline with a statistic generative model. The model learns the distribution parameters of mutation associations and mutation types and recovers false-negative GWAS mutations that fail to pass significant test but represent supportive evidences of functional biological process in literature. Eventually, we applied GDAMDB in Alzheimer's disease (AD) and predicted 79 AD-associated genes. Besides, 12 of them from the original GWAS, 60 of them are supported to be AD-related by other GWAS or literature report, and rest of them are newly predicted genes. Our model is capable of enhancing the GWAS-based gene association discovery by well combining text mining results. The positive result indicates that bridging the heterogeneous mutation data is contributory for the novel disease-related gene discovery.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbab079DOI Listing
September 2021

Hybrid phenotype mining method for investigating off-target protein and underlying side effects of anti-tumor immunotherapy.

BMC Med Inform Decis Mak 2020 07 9;20(Suppl 3):133. Epub 2020 Jul 9.

Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.

Background: It is of utmost importance to investigate novel therapies for cancer, as it is a major cause of death. In recent years, immunotherapies, especially those against immune checkpoints, have been developed and brought significant improvement in cancer management. However, on the other hand, immune checkpoints blockade (ICB) by monoclonal antiboties may cause common and severe adverse reactions (ADRs), the cause of which remains largely undetermined. We hypothesize that ICB-agents may induce adverse reactions through off-target protein interactions, similar to the ADR-causing off-target effects of small molecules. In this study, we propose a hybrid phenotype mining approach which integrates molecular level information and provides new mechanistic insights for ICB-associated ADRs.

Methods: We trained a conditional random fields model on the TAC 2017 benchmark training data, then used it to extract all drug-centric phenotypes for the five anti-PD-1/PD-L1 drugs from the drug labels of the DailyMed database. Proteins with structure similar to the drugs were obtained by using BlastP, and the gene targets of drugs were obtained from the STRING database. The target-centric phenotypes were extracted from the human phenotype ontology database. Finally, a screening module was designed to investigate off-target proteins, by making use of gene ontology analysis and pathway analysis.

Results: Eventually, through the cross-analysis of the drug and target gene phenotypes, the off-target effect caused by the mutation of gene BTK was found, and the candidate side-effect off-target site was analyzed.

Conclusions: This research provided a hybrid method of biomedical natural language processing and bioinformatics to investigate the off-target-based mechanism of ICB treatment. The method can also be applied for the investigation of ADRs related to other large molecule drugs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12911-020-1105-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7346346PMC
July 2020

A multimodal deep learning framework for predicting drug-drug interaction events.

Bioinformatics 2020 08;36(15):4316-4322

College of Informatics, Huazhong Agricultural University, Wuhan 430070, China.

Motivation: Drug-drug interactions (DDIs) are one of the major concerns in pharmaceutical research. Many machine learning based methods have been proposed for the DDI prediction, but most of them predict whether two drugs interact or not. The studies revealed that DDIs could cause different subsequent events, and predicting DDI-associated events is more useful for investigating the mechanism hidden behind the combined drug usage or adverse reactions.

Results: In this article, we collect DDIs from DrugBank database, and extract 65 categories of DDI events by dependency analysis and events trimming. We propose a multimodal deep learning framework named DDIMDL that combines diverse drug features with deep learning to build a model for predicting DDI-associated events. DDIMDL first constructs deep neural network (DNN)-based sub-models, respectively, using four types of drug features: chemical substructures, targets, enzymes and pathways, and then adopts a joint DNN framework to combine the sub-models to learn cross-modality representations of drug-drug pairs and predict DDI events. In computational experiments, DDIMDL produces high-accuracy performances and has high efficiency. Moreover, DDIMDL outperforms state-of-the-art DDI event prediction methods and baseline methods. Among all the features of drugs, the chemical substructures seem to be the most informative. With the combination of substructures, targets and enzymes, DDIMDL achieves an accuracy of 0.8852 and an area under the precision-recall curve of 0.9208.

Availability And Implementation: The source code and data are available at https://github.com/YifanDengWHU/DDIMDL.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa501DOI Listing
August 2020

HPO-Shuffle: an associated gene prioritization strategy and its application in drug repurposing for the treatment of canine epilepsy.

Biosci Rep 2019 09 6;39(9). Epub 2019 Sep 6.

Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Hubei Province, P.R. China

Epilepsy is a common neurological disorder that affects mammalian species including human beings and dogs. In order to discover novel drugs for the treatment of canine epilepsy, multiomics data were analyzed to identify epilepsy associated genes. In this research, the original ranking of associated genes was obtained by two high-throughput bioinformatics experiments: Genome Wide Association Study (GWAS) and microarray analysis. The association ranking of genes was enhanced by a re-ranking system, HPO-Shuffle, which integrated information from GWAS, microarray analysis and Human Phenotype Ontology database for further downstream analysis. After applying HPO-Shuffle, the association ranking of epilepsy genes were improved. Afterward, a weighted gene coexpression network analysis (WGCNA) led to a set of gene modules, which hinted a clear relevance between the high ranked genes and the target disease. Finally, WGCNA and connectivity map (CMap) analysis were performed on the integrated dataset to discover a potential drug list, in which a well-known anticonvulsant was included.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1042/BSR20191247DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6732366PMC
September 2019

A review of drug knowledge discovery using BioNLP and tensor or matrix decomposition.

Genomics Inform 2019 Jun 27;17(2):e18. Epub 2019 Jun 27.

Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China.

Prediction of the relations among drug and other molecular or social entities is the main knowledge discovery pattern for the purpose of drug-related knowledge discovery. Computational approaches have combined the information from different sources and levels for drug-related knowledge discovery, which provides a sophisticated comprehension of the relationship among drugs, targets, diseases, and targeted genes, at the molecular level, or relationships among drugs, usage, side effect, safety, and user preference, at a social level. In this research, previous work from the BioNLP community and matrix or matrix decomposition was reviewed, compared, and concluded, and eventually, the BioNLP open-shared task was introduced as a promising case study representing this area.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.5808/GI.2019.17.2.e18DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6808632PMC
June 2019

Three Dimensions of Reproducibility in Natural Language Processing.

LREC Int Conf Lang Resour Eval 2018 May;2018:156-165

Computational Bioscience Program, University of Colorado School of Medicine.

Despite considerable recent attention to problems with reproducibility of scientific research, there is a striking lack of agreement about the definition of the term. That is a problem, because the lack of a consensus definition makes it difficult to compare studies of reproducibility, and thus to have even a broad overview of the state of the issue in natural language processing. This paper proposes an ontology of reproducibility in that field. Its goal is to enhance both future research and communication about the topic, and retrospective meta-analyses. We show that three dimensions of reproducibility, corresponding to three kinds of claims in natural language processing papers, can account for a variety of types of research reports. These dimensions are reproducibility of a , of a , and of a Three biomedical natural language processing papers by the authors of this paper are analyzed with respect to these dimensions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5998676PMC
May 2018

Exploring the pathogenesis of canine epilepsy using a systems genetics method and implications for anti-epilepsy drug discovery.

Oncotarget 2018 Mar 27;9(17):13181-13192. Epub 2017 Dec 27.

Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Hubei, Wuhan, China.

Epilepsy is a common neurological disorder in domestic dogs. However, its complex mechanism involves multiple genetic and environmental factors that make it challenging to identify the real pathogenic factors contributing to epilepsy, particularly for idiopathic epilepsy. Conventional genome-wide association studies (GWASs) can detect various genes associated with epilepsy, although they primarily detect the effects of single-site mutations in epilepsy while ignoring their interactions. In this study, we used a systems genetics method combining both GWAS and gene interactions and obtained 26 significantly mutated subnetworks. Among these subnetworks, seven genes were reported to be involved in neurological disorders. Combined with gene ontology enrichment analysis, we focused on 4 subnetworks that included traditional GWAS-neglected genes. Moreover, we performed a drug enrichment analysis for each subnetwork and identified significantly enriched candidate anti-epilepsy drugs using a hypergeometric test. We discovered 22 potential drug combinations that induced possible synergistic effects for epilepsy treatment, and one of these drug combinations has been confirmed in the Drug Combination database (DCDB) to have beneficial anti-epileptic effects. The method proposed in this study provides deep insight into the pathogenesis of canine epilepsy and implications for anti-epilepsy drug discovery.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.18632/oncotarget.23719DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5862570PMC
March 2018

Universal Feature Extraction for Traffic Identification of the Target Category.

PLoS One 2016 10;11(11):e0165993. Epub 2016 Nov 10.

Institute of Information and Navigation, Air Force Engineering University, Xi'an, Shaanxi, China.

Traffic identification of the target category is currently a significant challenge for network monitoring and management. To identify the target category with pertinence, a feature extraction algorithm based on the subset with highest proportion is presented in this paper. The method is proposed to be applied to the identification of any category that is assigned as the target one, but not restricted to certain specific category. We divide the process of feature extraction into two stages. In the stage of primary feature extraction, the feature subset is extracted from the dataset which has the highest proportion of the target category. In the stage of secondary feature extraction, the features that can distinguish the target and interfering categories are added to the feature subset. Our theoretical analysis and experimental observations reveal that the proposed algorithm is able to extract fewer features with greater identification ability of the target category. Moreover, the universality of the proposed algorithm proves to be available with the experiment that every category is set to be the target one.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0165993PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5104389PMC
June 2017

Reproducibility in Natural Language Processing: A Case Study of Two R Libraries for Mining PubMed/MEDLINE.

LREC Int Conf Lang Resour Eval 2016 May;2016(W23):6-12

Biomedical Text Mining Group Computational Bioscience Program, University of Colorado School of Medicine.

There is currently a crisis in science related to highly publicized failures to reproduce large numbers of published studies. The current work proposes, by way of case studies, a methodology for moving the study of reproducibility in computational work to a full stage beyond that of earlier work. Specifically, it presents a case study in attempting to reproduce the reports of two R libraries for doing text mining of the PubMed/MEDLINE repository of scientific publications. The main findings are that a rational paradigm for reproduction of natural language processing papers can be established; the advertised functionality was difficult, but not impossible, to reproduce; and reproducibility studies can produce additional insights into the functioning of the published system. Additionally, the work on reproducibility lead to the production of novel user-centered documentation that has been accessed 260 times since its publication-an average of once a day per library.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860830PMC
May 2016

Systems Genetic Validation of the SNP-Metabolite Association in Rice Via Metabolite-Pathway-Based Phenome-Wide Association Scans.

Front Plant Sci 2015 27;6:1027. Epub 2015 Nov 27.

Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University Wuhan, China.

In the post-GWAS (Genome-Wide Association Scan) era, the interpretation of GWAS results is crucial to screen for highly relevant phenotype-genotype association pairs. Based on the single genotype-phenotype association test and a pathway enrichment analysis, we propose a Metabolite-pathway-based Phenome-Wide Association Scan (M-PheWAS) to analyze the key metabolite-SNP pairs in rice and determine the regulatory relationship by assessing similarities in the changes of enzymes and downstream products in a pathway. Two SNPs, sf0315305925 and sf0315308337, were selected using this approach, and their molecular function and regulatory relationship with Enzyme EC:5.5.1.6 and with flavonoids, a significant downstream regulatory metabolite product, were demonstrated. Moreover, a total of 105 crucial SNPs were screened using M-PheWAS, which may be important for metabolite associations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fpls.2015.01027DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4661230PMC
December 2015

A novel feature selection strategy for enhanced biomedical event extraction using the Turku system.

Biomed Res Int 2014 6;2014:205239. Epub 2014 Apr 6.

Department of Chinese, Translation and Linguistics, City University of Hong Kong, Kowloon, Hong Kong ; The Halliday Centre for Intelligent Applications of Language Studies, City University of Hong Kong, Kowloon, Hong Kong.

Feature selection is of paramount importance for text-mining classifiers with high-dimensional features. The Turku Event Extraction System (TEES) is the best performing tool in the GENIA BioNLP 2009/2011 shared tasks, which relies heavily on high-dimensional features. This paper describes research which, based on an implementation of an accumulated effect evaluation (AEE) algorithm applying the greedy search strategy, analyses the contribution of every single feature class in TEES with a view to identify important features and modify the feature set accordingly. With an updated feature set, a new system is acquired with enhanced performance which achieves an increased F-score of 53.27% up from 51.21% for Task 1 under strict evaluation criteria and 57.24% according to the approximate span and recursive criterion.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1155/2014/205239DOI Listing
January 2015

Gene prioritization of resistant rice gene against Xanthomas oryzae pv. oryzae by using text mining technologies.

Biomed Res Int 2013 25;2013:853043. Epub 2013 Nov 25.

Department of Chinese, Translation and Linguistics, City University of Hong Kong, Kowloon, Hong Kong ; The Halliday Centre for Intelligent Applications of Language Studies, City University of Hong Kong, Kowloon, Hong Kong.

To effectively assess the possibility of the unknown rice protein resistant to Xanthomonas oryzae pv. oryzae, a hybrid strategy is proposed to enhance gene prioritization by combining text mining technologies with a sequence-based approach. The text mining technique of term frequency inverse document frequency is used to measure the importance of distinguished terms which reflect biomedical activity in rice before candidate genes are screened and vital terms are produced. Afterwards, a built-in classifier under the chaos games representation algorithm is used to sieve the best possible candidate gene. Our experiment results show that the combination of these two methods achieves enhanced gene prioritization.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1155/2013/853043DOI Listing
August 2014

Using the concept of Chou's pseudo amino acid composition to predict protein solubility: an approach with entropies in information theory.

J Theor Biol 2013 Sep 21;332:211-7. Epub 2013 Mar 21.

College of Science, Huazhong Agricultural University, Wuhan, PR China.

Protein solubility plays a major role and has strong implication in the proteomics. Predicting the propensity of a protein to be soluble or to form inclusion body is a fundamental and not fairly resolved problem. In order to predict the protein solubility, almost 10,000 protein sequences were downloaded from NCBI. Then the sequences were eliminated for the high homologous similarity by CD-HIT. Thus, there were 5692 sequences remained. Based on protein sequences, amino acid and dipeptide compositions were generally extracted to predict protein solubility. In this study, the entropy in information theory was introduced as another predictive factor in the model. Experiments involving nine different feature vector combinations, including the above-mentioned three kinds of factors, were conducted with support vector machines (SVMs) as prediction engine. Each combination was evaluated by re-substitution test and 10-fold cross-validation test. According to the evaluation results, the accuracies and Matthew's Correlation Coefficient (MCC) values were boosted by the introduction of the entropy. The best combination was the one with amino acid, dipeptide compositions and their entropies. Its accuracy reached 90.34% and Matthew's Correlation Coefficient (MCC) value was 0.7494 in re-substitution test, while 88.12% and 0.7945 respectively for 10-fold cross-validation. In conclusion, the introduction of the entropy significantly improved the performance of the predictive method.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jtbi.2013.03.010DOI Listing
September 2013

Using the concept of pseudo amino acid composition to predict resistance gene against Xanthomonas oryzae pv. oryzae in rice: an approach from chaos games representation.

J Theor Biol 2011 Sep 16;284(1):16-23. Epub 2011 Jun 16.

College of Science, Huazhong Agricultural University, Wuhan, Hubei, China.

To evaluate the possibility of an unknown protein to be a resistant gene against Xanthomonas oryzae pv. oryzae, a different mode of pseudo amino acid composition (PseAAC) is proposed to formulate the protein samples by integrating the amino acid composition, as well as the Chaos games representation (CGR) method. Some numerical comparisons of triangle, quadrangle and 12-vertex polygon CGR are carried to evaluate the efficiency of using these fractal figures in classifiers. The numerical results show that among the three polygon methods, triangle method owns a good fractal visualization and performs the best in the classifier construction. By using triangle + 12-vertex polygon CGR as the mathematical feature, the classifier achieves 98.13% in Jackknife test and MCC achieves 0.8462.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jtbi.2011.06.003DOI Listing
September 2011

Prediction of thermophilic protein with pseudo amino Acid composition: an approach from combined feature selection and reduction.

Protein Pept Lett 2011 Jul;18(7):684-9

College of Science, Huazhong Agricultural University, Wuhan, PR of China.

Prediction of thermophilic and mesophilic protein plays a crucial role in both biochemistry and bioengineering. In this study, a different mode of pseudo amino acid composition (PseAAC) was proposed to formulate the protein samples by integrating the amino acid composition, the physic chemical features, as well as the composition transition and distribution features, where each of the protein samples was represented by a numerical vector through the sequence-based approach. Using the support vector machine algorithm, an accurate and reliable classifier was constructed to predict the thermophilic and mesophilic proteins. Moreover, three feature reduction algorithms were obtained for locating the most vital features and reducing the size of feature space. Among the three feature reduction algorithms, the genetic algorithm performed best. Finally, with the reduced features extracted from the genetic algorithm, it was observed that for the selected dataset the new classifier achieved a high accuracy of 95.93% with the Matthews correlation coefficient of 0.9187.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2174/092986611795446085DOI Listing
July 2011

Predicting protein solubility with a hybrid approach by pseudo amino acid composition.

Protein Pept Lett 2010 Dec;17(12):1466-72

College of Science, Huazhong Agricultural University, Wuhan, P.R. of China.

Protein solubility plays a major role for understanding the crystal growth and crystallization process of protein. How to predict the propensity of a protein to be soluble or to form inclusion body is a long but not fairly resolved problem. After choosing almost 10,000 protein sequences from NCBI database and eliminating the sequences with 90% homologous similarity by CD-HIT, 5692 sequences remained. By using Chou's pseudo amino acid composition features, we predict the soluble protein with the three methods: support vector machine (SVM), back propagation neural network (BP Neural Network) and hybrid method based on SVM and BP Neural Network, respectively. Each method is evaluated by re-substitution test and 10-fold cross-validation test. In the re-substitution test, the BP Neural Network performs with the best results, in which the accuracy achieves 0.9288 and Matthews Correlation Coefficient (MCC) achieves 0.8513. Meanwhile, the other two methods are better than BP Neural Network in 10-fold cross-validation test. The hybrid method based on SVM and BP Neural Network is the best. The average accuracy is 0.8678 and average MCC is 0.7233. Although all of the three methods achieve considerable evaluations, the hybrid method is deemed to be the best, according to the performance comparison.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2174/0929866511009011466DOI Listing
December 2010
-->