Publications by authors named "Quan Zou"

237 Publications

SgRNA-RF: identification of SgRNA on-target activity with imbalanced datasets.

IEEE/ACM Trans Comput Biol Bioinform 2021 May 12;PP. Epub 2021 May 12.

Single-guide RNA is a guide RNA (gRNA), which guides the insertion or deletion of uridine residues into kinetoplastid during RNA editing. It is a small non-coding RNA that can be combined with pre -mRNA pairing. SgRNA is a critical component of the CRISPR/Cas9 gene knockout system and play an important role in gene editing and gene regulation. It is important to accurately and quickly identify highly on-target activity sgRNAs. Due to its importance, several computational predictors have been proposed to predict sgRNAs on-target activity. All these methods have clearly contributed to the development of this very important field. However, they also have certain limitations. In the paper, we developed a new classifier SgRNA-RF, which extracts the features of nucleic acid composition and structure of on-target activity sgRNA sequence and identified by random forest algorithm. In addition to solving an imbalanced dataset, this paper proposed a new method called CS-Smote. We compared sgRNA-RF with state-of-the-art predictors on the five datasets, and found SgRNA-RF significantly improved the identification accuracy, with accuracies of 0.8636,0.9161,0.894,0.938,0.965,0.77,0.979,0.973, respectively. The user-friendly web server that implements sgRNA-RF is freely available at http://server.malab.cn/sgRNA-RF/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TCBB.2021.3079116DOI Listing
May 2021

Ecological and network analyses identify four microbial species with potential significance for the diagnosis/treatment of ulcerative colitis (UC).

BMC Microbiol 2021 May 4;21(1):138. Epub 2021 May 4.

Department of Gastroenterology, The First Affiliated Hospital of Kunming Medical University, Yunnan Institute of Digestive Disease, Kunming, Yunnan, China.

Background: Ulcerative colitis (UC) is one of the primary types of inflammatory bowel disease (IBD), the occurrence of which has been increasing worldwide. Although IBD is an intensively studied human microbiome-associated disease, research on Chinese populations remains relatively limited, particularly on the mucosal microbiome. The present study aimed to analyze the changes in the mucosal microbiome associated with UC from the perspectives of medical ecology and complex network analysis.

Results: In total, 56 mucosal microbiome samples were collected from 28 Chinese UC patients and their healthy family partners, followed by amplicon sequencing. Based on sequencing data, we analyzed species diversity, shared species, and inter-species interactions at the whole community, main phyla, and core/periphery species levels. We identified four opportunistic "pathogens" (i.e., Clostridium tertium, Odoribacter splanchnicus, Ruminococcus gnavus, and Flavonifractor plautii) with potential significance for the diagnosis and treatment of UC, which were inhibited in healthy individuals, but unrestricted in the UC patients. In addition, we also discovered in this study: (i) The positive-to-negative links (P/N) ratio, which measures the balance of species interactions or inhibition effects in microbiome networks, was significantly higher in UC patients, indicating loss of inhibition against potentially opportunistic "pathogens" associated with dysbiosis. (ii) Previous studies have reported conflicting evidence regarding species diversity and composition between UC patients and healthy controls. Here, significant differences were found at the major phylum and core/periphery scales, but not at the whole community level. Thus, we argue that the paradoxical results found in existing studies are due to the scale effect.

Conclusions: Our results reveal changes in the ecology and network structure of the gut mucosal microbiome that might be associated with UC, and these changes might provide potential therapeutic mechanisms of UC. The four opportunistic pathogens that were identified in the present study deserve further investigation in future studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12866-021-02201-6DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8097971PMC
May 2021

MMFGRN: a multi-source multi-model fusion method for gene regulatory network reconstruction.

Brief Bioinform 2021 May 3. Epub 2021 May 3.

College of Intelligence and Computing, Tianjin University, Tianjin, China.

Lots of biological processes are controlled by gene regulatory networks (GRNs), such as growth and differentiation of cells, occurrence and development of the diseases. Therefore, it is important to persistently concentrate on the research of GRN. The determination of the gene-gene relationships from gene expression data is a complex issue. Since it is difficult to efficiently obtain the regularity behind the gene-gene relationship by only relying on biochemical experimental methods, thus various computational methods have been used to construct GRNs, and some achievements have been made. In this paper, we propose a novel method MMFGRN (for "Multi-source Multi-model Fusion for Gene Regulatory Network reconstruction") to reconstruct the GRN. In order to make full use of the limited datasets and explore the potential regulatory relationships contained in different data types, we construct the MMFGRN model from three perspectives: single time series data model, single steady-data model and time series and steady-data joint model. And, we utilize the weighted fusion strategy to get the final global regulatory link ranking. Finally, MMFGRN model yields the best performance on the DREAM4 InSilico_Size10 data, outperforming other popular inference algorithms, with an overall area under receiver operating characteristic score of 0.909 and area under precision-recall (AUPR) curves score of 0.770 on the 10-gene network. Additionally, as the network scale increases, our method also has certain advantages with an overall AUPR score of 0.335 on the DREAM4 InSilico_Size100 data. These results demonstrate the good robustness of MMFGRN on different scales of networks. At the same time, the integration strategy proposed in this paper provides a new idea for the reconstruction of the biological network model without prior knowledge, which can help researchers to decipher the elusive mechanism of life.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbab166DOI Listing
May 2021

CarSite-II: an integrated classification algorithm for identifying carbonylated sites based on K-means similarity-based undersampling and synthetic minority oversampling techniques.

BMC Bioinformatics 2021 Apr 26;22(1):216. Epub 2021 Apr 26.

Department of Computer Science, Xiamen University, Xiamen, 361005, China.

Background: Carbonylation is a non-enzymatic irreversible protein post-translational modification, and refers to the side chain of amino acid residues being attacked by reactive oxygen species and finally converted into carbonyl products. Studies have shown that protein carbonylation caused by reactive oxygen species is involved in the etiology and pathophysiological processes of aging, neurodegenerative diseases, inflammation, diabetes, amyotrophic lateral sclerosis, Huntington's disease, and tumor. Current experimental approaches used to predict carbonylation sites are expensive, time-consuming, and limited in protein processing abilities. Computational prediction of the carbonylation residue location in protein post-translational modifications enhances the functional characterization of proteins.

Results: In this study, an integrated classifier algorithm, CarSite-II, was developed to identify K, P, R, and T carbonylated sites. The resampling method K-means similarity-based undersampling and the synthetic minority oversampling technique (SMOTE-KSU) were incorporated to balance the proportions of K, P, R, and T carbonylated training samples. Next, the integrated classifier system Rotation Forest uses "support vector machine" subclassifications to divide three types of feature spaces into several subsets. CarSite-II gained Matthew's correlation coefficient (MCC) values of 0.2287/0.3125/0.2787/0.2814, False Positive rate values of 0.2628/0.1084/0.1383/0.1313, False Negative rate values of 0.2252/0.0205/0.0976/0.0608 for K/P/R/T carbonylation sites by tenfold cross-validation, respectively. On our independent test dataset, CarSite-II yield MCC values of 0.6358/0.2910/0.4629/0.3685, False Positive rate values of 0.0165/0.0203/0.0188/0.0094, False Negative rate values of 0.1026/0.1875/0.2037/0.3333 for K/P/R/T carbonylation sites. The results show that CarSite-II achieves remarkably better performance than all currently available prediction tools.

Conclusion: The related results revealed that CarSite-II achieved better performance than the currently available five programs, and revealed the usefulness of the SMOTE-KSU resampling approach and integration algorithm. For the convenience of experimental scientists, the web tool of CarSite-II is available in http://47.100.136.41:8081/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-021-04134-3DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8077735PMC
April 2021

Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation.

Comput Struct Biotechnol J 2021 19;19:1612-1619. Epub 2021 Mar 19.

Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.

DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the genomes, especially in and , based on distributed feature representation and through the word embedding technique 'word2vec'. While a few bioinformatics tools are currently employed to identify 4mC sites in these , their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.csbj.2021.03.015DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8042287PMC
March 2021

A comprehensive review of the imbalance classification of protein post-translational modifications.

Brief Bioinform 2021 Apr 8. Epub 2021 Apr 8.

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.

Post-translational modifications (PTMs) play significant roles in regulating protein structure, activity and function, and they are closely involved in various pathologies. Therefore, the identification of associated PTMs is the foundation of in-depth research on related biological mechanisms, disease treatments and drug design. Due to the high cost and time consumption of high-throughput sequencing techniques, developing machine learning-based predictors has been considered an effective approach to rapidly recognize potential modified sites. However, the imbalanced distribution of true and false PTM sites, namely, the data imbalance problem, largely effects the reliability and application of prediction tools. In this article, we conduct a systematic survey of the research progress in the imbalanced PTMs classification. First, we describe the modeling process in detail and outline useful data imbalance solutions. Then, we summarize the recently proposed bioinformatics tools based on imbalanced PTM data and simultaneously build a convenient website, ImClassi_PTMs (available at lab.malab.cn/∼dlj/ImbClassi_PTMs/), to facilitate the researchers to view. Moreover, we analyze the challenges of current computational predictors and propose some suggestions to improve the efficiency of imbalance learning. We hope that this work will provide comprehensive knowledge of imbalanced PTM recognition and contribute to advanced predictors in the future.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbab089DOI Listing
April 2021

DisBalance: a platform to automatically build balance-based disease prediction models and discover microbial biomarkers from microbiome data.

Brief Bioinform 2021 Apr 8. Epub 2021 Apr 8.

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China.

How best to utilize the microbial taxonomic abundances in regard to the prediction and explanation of human diseases remains appealing and challenging, and the relative nature of microbiome data necessitates a proper feature selection method to resolve the compositional problem. In this study, we developed an all-in-one platform to address a series of issues in microbiome-based human disease prediction and taxonomic biomarkers discovery. We prioritize the interpretation, runtime and classification accuracy of the distal discriminative balances analysis (DBA-distal) method in selecting a set of distal discriminative balances, and develop DisBalance, a comprehensive platform, to integrate and streamline the workflows of disease model building, disease risk prediction and disease-related biomarker discovery for microbiome-based binary classifications. DisBalance allows the de novo model-building and disease risk prediction in a very fast and convenient way. To facilitate the model-driven and knowledge-driven discoveries, DisBalance dedicates multiple strategies for the mining of microbial biomarkers. The independent validation of the models constructed by the DisBalance pipeline is performed on seven microbiome datasets from the original article of DBA-distal. The implementation of the DisBalance platform is demonstrated by a complete analysis of a shotgun metagenomic dataset of Ulcerative Colitis (UC). As a free and open-source, DisBlance can be accessed at http://lab.malab.cn/soft/DisBalance. The source code and demo data for Disbalance are available at https://github.com/yangfenglong/DisBalance.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbab094DOI Listing
April 2021

Critical downstream analysis steps for single-cell RNA sequencing data.

Brief Bioinform 2021 Apr 5. Epub 2021 Apr 5.

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China.

Single-cell RNA sequencing (scRNA-seq) has enabled us to study biological questions at the single-cell level. Currently, many analysis tools are available to better utilize these relatively noisy data. In this review, we summarize the most widely used methods for critical downstream analysis steps (i.e. clustering, trajectory inference, cell-type annotation and integrating datasets). The advantages and limitations are comprehensively discussed, and we provide suggestions for choosing proper methods in different situations. We hope this paper will be useful for scRNA-seq data analysts and bioinformatics tool developers.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbab105DOI Listing
April 2021

Machine learning for phytopathology: from the molecular scale towards the network scale.

Brief Bioinform 2021 Mar 31. Epub 2021 Mar 31.

Shenzhen Polytechnic, China.

With the increasing volume of high-throughput sequencing data from a variety of omics techniques in the field of plant-pathogen interactions, sorting, retrieving, processing and visualizing biological information have become a great challenge. Within the explosion of data, machine learning offers powerful tools to process these complex omics data by various algorithms, such as Bayesian reasoning, support vector machine and random forest. Here, we introduce the basic frameworks of machine learning in dissecting plant-pathogen interactions and discuss the applications and advances of machine learning in plant-pathogen interactions from molecular to network biology, including the prediction of pathogen effectors, plant disease resistance protein monitoring and the discovery of protein-protein networks. The aim of this review is to provide a summary of advances in plant defense and pathogen infection and to indicate the important developments of machine learning in phytopathology.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbab037DOI Listing
March 2021

rBPDL: Predicting RNA-binding proteins using deep learning.

IEEE J Biomed Health Inform 2021 Mar 29;PP. Epub 2021 Mar 29.

RNA-binding protein (RBP) is a powerful and wide-ranging regulator that plays an important role in cell development, differentiation, metabolism, health and disease. The prediction of RBPs provides valuable guidance for biologists; although the wet test RBP has made good progress, it is time-consuming and not flexible. Therefore, we developed a network model, rBPDL, by combining a convolutional neural network and long short-term memory for multilabel classification of RBPs. Moreover, to achieve better prediction results, we used a voting algorithm for ensemble learning of the model. We compared rBPDL with state-of-the-art methods and found that rBPDL significantly improved identification performance for the RBP68 dataset, with a macro-Area Under Curve (AUC), micro-AUC, and weighted AUC of 0.936, 0.962, and 0.946, respectively. Furthermore, we analyzed the performance of rBPDL on a single RBP and found, through AUC statistical analysis of the RBP domain, that the RBP identification performance in the same domain was similar. In addition, we analyzed the performance preferences and physicochemical properties of the binding protein amino acids and explored the characteristics that affect the binding by using the RBP86 dataset. The code and datasets can be found at the link: https://github.com/nmt315320/rBPDL.git.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/JBHI.2021.3069259DOI Listing
March 2021

Single-cell RNA sequencing analysis identifies key genes in brain metastasis from lung adenocarcinoma.

Curr Gene Ther 2021 Mar 18. Epub 2021 Mar 18.

Second Affiliated Hospital of Harbin Medical University, Harbin Medical University, Harbin, 150080. China.

Background: Lung adenocarcinoma (LADC) is the most common type of lung cancer and is a subtype of non-small-cell lung cancer (NSCLC). Approximately 40% of LADC patients experience brain metastases (BMs) during the course of the disease. In this study, integrated bioinformatics methods were applied to identify key genes related to brain metastasis in lung adenocarcinoma.

Methods: We derived and characterized genes differentially expressed between the primary tumour and brain metastases using tumour cells isolated from two lung cancer Patient-derived xenografts (PDX) cases (GSE 69405). Gene ontology (GO) and KEGG pathway enrichment analyses were applied, and protein-protein interaction (PPI) networks and Cytoscape software were utilized to identify key genes.

Results: Four key genes including CKAP4 (Cytoskeleton Associated Protein 4), SERPINA1 (Serpin Family A Member 1), SDC2 (Syndecan 2) and GNG11 (G Protein Subunit Gamma 11) were identified for BM-LADC by the Venn diagram.

Conclusion: We believe these key genes may be potential biomarkers for improved prognosis and treatment of lung adenocarcinoma.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2174/1566523221666210319104752DOI Listing
March 2021

Computational biology and chemistry Special section editorial: Computational analyses for miRNA.

Comput Biol Chem 2021 Apr 30;91:107448. Epub 2021 Jan 30.

Hainan Key Laboratory for Computational Science and Application, Hainan Normal University, Haikou 571158, China; Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China; Yangtze Delta Region Institute (Quzhou), Universityof Electronic Science and Technology of China, Quzhou 324000, China. Electronic address:

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.compbiolchem.2021.107448DOI Listing
April 2021

A comprehensive overview and critical evaluation of gene regulatory network inference technologies.

Brief Bioinform 2021 Feb 5. Epub 2021 Feb 5.

School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.

Gene regulatory network (GRN) is the important mechanism of maintaining life process, controlling biochemical reaction and regulating compound level, which plays an important role in various organisms and systems. Reconstructing GRN can help us to understand the molecular mechanism of organisms and to reveal the essential rules of a large number of biological processes and reactions in organisms. Various outstanding network reconstruction algorithms use specific assumptions that affect prediction accuracy, in order to deal with the uncertainty of processing. In order to study why a certain method is more suitable for specific research problem or experimental data, we conduct research from model-based, information-based and machine learning-based method classifications. There are obviously different types of computational tools that can be generated to distinguish GRNs. Furthermore, we discuss several classical, representative and latest methods in each category to analyze core ideas, general steps, characteristics, etc. We compare the performance of state-of-the-art GRN reconstruction technologies on simulated networks and real networks under different scaling conditions. Through standardized performance metrics and common benchmarks, we quantitatively evaluate the stability of various methods and the sensitivity of the same algorithm applying to different scaling networks. The aim of this study is to explore the most appropriate method for a specific GRN, which helps biologists and medical scientists in discovering potential drug targets and identifying cancer biomarkers.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbab009DOI Listing
February 2021

Anticancer peptides prediction with deep representation learning features.

Brief Bioinform 2021 Feb 3. Epub 2021 Feb 3.

School of Electronic and Communication Engineering, Shenzhen Polytechnic.

Anticancer peptides constitute one of the most promising therapeutic agents for combating common human cancers. Using wet experiments to verify whether a peptide displays anticancer characteristics is time-consuming and costly. Hence, in this study, we proposed a computational method named identify anticancer peptides via deep representation learning features (iACP-DRLF) using light gradient boosting machine algorithm and deep representation learning features. Two kinds of sequence embedding technologies were used, namely soft symmetric alignment embedding and unified representation (UniRep) embedding, both of which involved deep neural network models based on long short-term memory networks and their derived networks. The results showed that the use of deep representation learning features greatly improved the capability of the models to discriminate anticancer peptides from other peptides. Also, UMAP (uniform manifold approximation and projection for dimension reduction) and SHAP (shapley additive explanations) analysis proved that UniRep have an advantage over other features for anticancer peptide identification. The python script and pretrained models could be downloaded from https://github.com/zhibinlv/iACP-DRLF or from http://public.aibiochem.net/iACP-DRLF/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbab008DOI Listing
February 2021

Sequence representation approaches for sequence-based protein prediction tasks that use deep learning.

Brief Funct Genomics 2021 Mar;20(1):61-73

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.

Deep learning has been increasingly used in bioinformatics, especially in sequence-based protein prediction tasks, as large amounts of biological data are available and deep learning techniques have been developed rapidly in recent years. For sequence-based protein prediction tasks, the selection of a suitable model architecture is essential, whereas sequence data representation is a major factor in controlling model performance. Here, we summarized all the main approaches that are used to represent protein sequence data (amino acid sequence encoding or embedding), which include end-to-end embedding methods, non-contextual embedding methods and embedding methods that use transfer learning and others that are applied for some specific tasks (such as protein sequence embedding based on extracted features for protein structure predictions and graph convolutional network-based embedding for drug discovery tasks). We have also reviewed the architectures of various types of embedding models theoretically and the development of these types of sequence embedding approaches to facilitate researchers and users in selecting the model that best suits their requirements.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bfgp/elaa030DOI Listing
March 2021

Prediction of RNA-binding protein and alternative splicing event associations during epithelial-mesenchymal transition based on inductive matrix completion.

Brief Bioinform 2021 Feb 1. Epub 2021 Feb 1.

College of Mathematics and Statistics, Shenzhen University, 518000, Guangdong, China.

Motivation: The developmental process of epithelial-mesenchymal transition (EMT) is abnormally activated during breast cancer metastasis. Transcriptional regulatory networks that control EMT have been well studied; however, alternative RNA splicing plays a vital regulatory role during this process and the regulating mechanism needs further exploration. Because of the huge cost and complexity of biological experiments, the underlying mechanisms of alternative splicing (AS) and associated RNA-binding proteins (RBPs) that regulate the EMT process remain largely unknown. Thus, there is an urgent need to develop computational methods for predicting potential RBP-AS event associations during EMT.

Results: We developed a novel model for RBP-AS target prediction during EMT that is based on inductive matrix completion (RAIMC). Integrated RBP similarities were calculated based on RBP regulating similarity, and RBP Gaussian interaction profile (GIP) kernel similarity, while integrated AS event similarities were computed based on AS event module similarity and AS event GIP kernel similarity. Our primary objective was to complete missing or unknown RBP-AS event associations based on known associations and on integrated RBP and AS event similarities. In this paper, we identify significant RBPs for AS events during EMT and discuss potential regulating mechanisms. Our computational results confirm the effectiveness and superiority of our model over other state-of-the-art methods. Our RAIMC model achieved AUC values of 0.9587 and 0.9765 based on leave-one-out cross-validation (CV) and 5-fold CV, respectively, which are larger than the AUC values from the previous models. RAIMC is a general matrix completion framework that can be adopted to predict associations between other biological entities. We further validated the prediction performance of RAIMC on the genes CD44 and MAP3K7. RAIMC can identify the related regulating RBPs for isoforms of these two genes.

Availability And Implementation: The source code for RAIMC is available at https://github.com/yushanqiu/RAIMC.

Contact: zouquan@nclab.net online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbaa440DOI Listing
February 2021

GutBalance: a server for the human gut microbiome-based disease prediction and biomarker discovery with compositionality addressed.

Brief Bioinform 2021 Jan 30. Epub 2021 Jan 30.

Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin 150001, China.

The compositionality of the microbiome data is well-known but often neglected. The compositional transformation pertains to the supervised learning of microbiome data and is a critical step that decides the performance and reliability of the disease classifiers. We value the excellent performance of the distal discriminative balance analysis (DBA) method, which selects distal balances of pairs and trios of bacteria, in addressing the classification of high-dimensional microbiome data. By applying this method to the species-level abundances of all the disease phenotypes in the GMrepo database, we build a balance-based model repository for the classification of human gut microbiome-related diseases. The model repository supports the prediction of disease risks for new sample(s). More importantly, we highlight the concept of balance-disease associations rather than the conventional microbe-disease associations and develop the human Gut Balance-Disease Association Database (GBDAD). Each predictable balance for each disease model indicates a potential biomarker-disease relationship and can be interpreted as a bacteria ratio positively or negatively correlated with the disease. Furthermore, by linking the balance-disease associations to the evidenced microbe-disease associations in MicroPhenoDB, we surprisingly found that most species-disease associations inferred from the shotgun metagenomic datasets can be validated by external evidence beyond MicroPhenoDB. The balance-based species-disease association inference will accelerate the generation of new microbe-disease association hypotheses in gastrointestinal microecology research and clinical trials. The model repository and the GBDAD database are deployed on the GutBalance server, which supports interactive visualization and systematic interrogation of the disease models, disease-related balances and disease-related species of interest.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbaa436DOI Listing
January 2021

Application of learning to rank in bioinformatics tasks.

Brief Bioinform 2021 Jan 18. Epub 2021 Jan 18.

University of Electronic Science and Technology of China.

Over the past decades, learning to rank (LTR) algorithms have been gradually applied to bioinformatics. Such methods have shown significant advantages in multiple research tasks in this field. Therefore, it is necessary to summarize and discuss the application of these algorithms so that these algorithms are convenient and contribute to bioinformatics. In this paper, the characteristics of LTR algorithms and their strengths over other types of algorithms are analyzed based on the application of multiple perspectives in bioinformatics. Finally, the paper further discusses the shortcomings of the LTR algorithms, the methods and means to better use the algorithms and some open problems that currently exist.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbaa394DOI Listing
January 2021

Identify RNA-associated subcellular localizations based on multi-label learning using Chou's 5-steps rule.

BMC Genomics 2021 Jan 15;22(1):56. Epub 2021 Jan 15.

School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.

Background: Biological functions of biomolecules rely on the cellular compartments where they are located in cells. Importantly, RNAs are assigned in specific locations of a cell, enabling the cell to implement diverse biochemical processes in the way of concurrency. However, lots of existing RNA subcellular localization classifiers only solve the problem of single-label classification. It is of great practical significance to expand RNA subcellular localization into multi-label classification problem.

Results: In this study, we extract multi-label classification datasets about RNA-associated subcellular localizations on various types of RNAs, and then construct subcellular localization datasets on four RNA categories. In order to study Homo sapiens, we further establish human RNA subcellular localization datasets. Furthermore, we utilize different nucleotide property composition models to extract effective features to adequately represent the important information of nucleotide sequences. In the most critical part, we achieve a major challenge that is to fuse the multivariate information through multiple kernel learning based on Hilbert-Schmidt independence criterion. The optimal combined kernel can be put into an integration support vector machine model for identifying multi-label RNA subcellular localizations. Our method obtained excellent results of 0.703, 0.757, 0.787, and 0.800, respectively on four RNA data sets on average precision.

Conclusion: To be specific, our novel method performs outstanding rather than other prediction tools on novel benchmark datasets. Moreover, we establish user-friendly web server with the implementation of our method.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-020-07347-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7811227PMC
January 2021

HSM6AP: a high-precision predictor for the Homo N6-methyladenosine (m^6 A) based on multiple weights and feature stitching.

RNA Biol 2021 Feb 12:1-11. Epub 2021 Feb 12.

Bioinformatics Laboratory, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.

Recent studies have shown that RNA methylation modification can affect RNA transcription, metabolism, splicing and stability. In addition, RNA methylation modification has been associated with cancer, obesity and other diseases. Based on information about human genome and machine learning, this paper discusses the effect of the fusion sequence and gene-level feature extraction on the accuracy of methylation site recognition. The significant limitation of existing computing tools was exposed by discovered of new features. (1) Most prediction models are based solely on sequence features and use SVM or random forest as classification methods. (2) Limited by the number of samples, the model may not achieve good performance. In order to establish a better prediction model for methylation sites, we must set specific weighting strategies for training samples and find more powerful and informative feature matrices to establish a comprehensive model. In this paper, we present HSM6AP, a high-precision predictor for the N6-methyladenosine () based on multiple weights and feature stitching. Compared with existing methods, HSM6AP samples were creatively weighted during training, and a wide range of features were explored. Max-Relevance-Max-Distance (MRMD) is employed for feature selection, and the feature matrix is generated by fusing a single feature. The extreme gradient boosting (XGBoost), an integrated machine learning algorithm based on decision tree, is used for model training and improves model performance through parameter adjustment. Two rigorous independent data sets demonstrated the superiority of HSM6AP in identifying methylation sites. HSM6AP is an advanced predictor that can be directly employed by users (especially non-professional users) to predict methylation sites. Users can access our related tools and data sets at the following website: The codes of our tool can be publicly accessible at .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1080/15476286.2021.1875180DOI Listing
February 2021

Regulator Network Analysis of Rice and Maize Yield-Related Genes.

Front Cell Dev Biol 2020 3;8:621464. Epub 2020 Dec 3.

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.

Rice and maize are the principal food crop species worldwide. The mechanism of gene regulation for the yield of rice and maize is still the research focus at present. Seed size, weight and shape are important traits of crop yield in rice and maize. Most members of three gene families, APETALA2/ethylene response factor, auxin response factors and MADS, were identified to be involved in yield traits in rice and maize. Analysis of molecular regulation mechanisms related to yield traits provides theoretical support for the improvement of crop yield. Genetic regulatory network analysis can provide new insights into gene families with the improvement of sequencing technology. Here, we analyzed the evolutionary relationships and the genetic regulatory network for the gene family members to predicted genes that may be involved in yield-related traits in rice and maize. The results may provide some theoretical and application guidelines for future investigations of molecular biology, which may be helpful for developing new rice and maize varieties with high yield traits.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fcell.2020.621464DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7793993PMC
December 2020

Editorial: Computational Learning Models and Methods Driven by Omics for Precision Medicine.

Front Genet 2020 23;11:620976. Epub 2020 Dec 23.

Faculty of Computing, Engineering and the Built Environment, School of Computing, Engineering and Intelligent Systems, Ulster University, Coleraine, United Kingdom.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2020.620976DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7785880PMC
December 2020

SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning.

Brief Bioinform 2021 Jan 4. Epub 2021 Jan 4.

Tianjin University.

Motivation: mRNA location corresponds to the location of protein translation and contributes to precise spatial and temporal management of the protein function. However, current assignment of subcellular localization of eukaryotic mRNA reveals important limitations: (1) turning multiple classifications into multiple dichotomies makes the training process tedious; (2) the majority of the models trained by classical algorithm are based on the extraction of single sequence information; (3) the existing state-of-the-art models have not reached an ideal level in terms of prediction and generalization ability. To achieve better assignment of subcellular localization of eukaryotic mRNA, a better and more comprehensive model must be developed.

Results: In this paper, SubLocEP is proposed as a two-layer integrated prediction model for accurate prediction of the location of sequence samples. Unlike the existing models based on limited features, SubLocEP comprehensively considers additional feature attributes and is combined with LightGBM to generated single feature classifiers. The initial integration model (single-layer model) is generated according to the categories of a feature. Subsequently, two single-layer integration models are weighted (sequence-based: physicochemical properties = 3:2) to produce the final two-layer model. The performance of SubLocEP on independent datasets is sufficient to indicate that SubLocEP is an accurate and stable prediction model with strong generalization ability. Additionally, an online tool has been developed that contains experimental data and can maximize the user convenience for estimation of subcellular localization of eukaryotic mRNA.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbaa401DOI Listing
January 2021

sgRNACNN: identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks.

Plant Mol Biol 2021 Mar 1;105(4-5):483-495. Epub 2021 Jan 1.

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.

Key Message: We proposed an ensemble convolutional neural network model to identify sgRNA high on-target activity in four crops and we used one-hot encoding and k-mers for sequence encoding. As an important component of the CRISPR/Cas9 system, single-guide RNA (sgRNA) plays an important role in gene redirection and editing. sgRNA has played an important role in the improvement of agronomic species, but there is a lack of effective bioinformatics tools to identify the activity of sgRNA in agronomic species. Therefore, it is necessary to develop a method based on machine learning to identify sgRNA high on-target activity. In this work, we proposed a simple convolutional neural network method to identify sgRNA high on-target activity. Our study used one-hot encoding and k-mers for sequence data conversion and a voting algorithm for constructing the convolutional neural network ensemble model sgRNACNN for the prediction of sgRNA activity. The ensemble model sgRNACNN was used for predictions in four crops: Glycine max, Zea mays, Sorghum bicolor and Triticum aestivum. The accuracy rates of the four crops in the sgRNACNN model were 82.43%, 80.33%, 78.25% and 87.49%, respectively. The experimental results showed that sgRNACNN realizes the identification of high on-target activity sgRNA of agronomic data and can meet the demands of sgRNA activity prediction in agronomy to a certain extent. These results have certain significance for guiding crop gene editing and academic research. The source code and relevant dataset can be found in the following link: https://github.com/nmt315320/sgRNACNN.git .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s11103-020-01102-yDOI Listing
March 2021

Genome-Wide Analysis of LysM-Containing Gene Family in Wheat: Structural and Phylogenetic Analysis during Development and Defense.

Genes (Basel) 2020 Dec 29;12(1). Epub 2020 Dec 29.

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.4 Block 2 North Jianshe Road, Chengdu 610054, China.

The lysin motif (LysM) family comprise a number of defense proteins that play important roles in plant immunity. The LysM family includes LysM-containing receptor-like proteins (LYP) and LysM-containing receptor-like kinase (LYK). LysM generally recognizes the chitin and peptidoglycan derived from bacteria and fungi. Approximately 4000 proteins with the lysin motif (Pfam PF01476) are found in prokaryotes and eukaryotes. Our study identified 57 LysM genes and 60 LysM proteins in wheat and renamed these genes and proteins based on chromosome distribution. According to the phylogenetic and gene structure of intron-exon distribution analysis, the 60 LysM proteins were classified into seven groups. Gene duplication events had occurred among the LysM family members during the evolution process, resulting in an increase in the LysM gene family. Synteny analysis suggested the characteristics of evolution of the LysM family in wheat and other species. Systematic analysis of these species provided a foundation of LysM genes in crop defense. A comprehensive analysis of the expression and cis-elements of LysM gene family members suggested that they play an essential role in defending against plant pathogens. The present study provides an overview of the LysM family in the wheat genome as well as information on systematic, phylogenetic, gene duplication, and intron-exon distribution analyses that will be helpful for future functional analysis of this important protein family, especially in Gramineae species.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/genes12010031DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7823900PMC
December 2020

Identification of Sub-Golgi protein localization by use of deep representation learning features.

Bioinformatics 2020 Dec 26. Epub 2020 Dec 26.

Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China.

Motivation: The Golgi apparatus has a key functional role in protein biosynthesis within the eukaryotic cell with malfunction resulting in various neurodegenerative diseases. For a better understanding of the Golgi apparatus, it is essential to identification of sub-Golgi protein localization. Although some machine learning methods have been used to identify sub-Golgi localization proteins by sequence representation fusion, more accurate sub-Golgi protein identification is still challenging by existing methodology.

Results: we developed a protein sub-Golgi localization identification protocol using deep representation learning features with 107 dimensions. By this protocol, we demonstrated that instead of multi-type protein sequence feature representation fusion as in previous state-of-the-art sub-Golgi-protein localization classifiers, it is sufficient to exploit only one type of feature representation for more accurately identification of sub-Golgi proteins. Compared with independent testing results for benchmark datasets, our protocol is able to perform generally, reliably, and robustly for sub-Golgi protein localization prediction.

Availability: A use-friendly webserver is freely accessible at http://isGP-DRLF.aibiochem.net and the prediction code is accessible at https://github.com/zhibinlv/isGP-DRLF.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa1074DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8023683PMC
December 2020

Analysis of Cyclin-Dependent Kinase 1 as an Independent Prognostic Factor for Gastric Cancer Based on Statistical Methods.

Front Cell Dev Biol 2020 7;8:620164. Epub 2020 Dec 7.

School of Management, Shenzhen Polytechnic, Shenzhen, China.

Objective: The aim of this study was to investigate the expression of cyclin-dependent kinase 1 (CDK1) in gastric cancer (GC), evaluate its relationship with the clinicopathological features and prognosis of GC, and analyze the advantage of CDK1 as a potential independent prognostic factor for GC.

Methods: The Cancer Genome Atlas (TCGA) data and corresponding clinical features of GC were collected. First, the aim gene was selected by combining five topological analysis methods, where the gene expression in paracancerous and GC tissues was analyzed by Limma package and Wilcox test. Second, the correlation between gene expression and clinical features was analyzed by logistic regression. Finally, the survival analysis was carried out by using the Kaplan-Meier. The gene prognostic value was evaluated by univariate and multivariate Cox analyses, and the gene potential biological function was explored by gene set enrichment analysis (GSEA).

Results: CDK1 was selected as one of the most important genes associated with GC. The expression level of CDK1 in GC tissues was significantly higher than that in paracancerous tissues, which was significantly correlated with pathological stage and grade. The survival rate of the CDK1 high expression group was significantly lower than that of the low expression group. CDK1 expression was significantly correlated with overall survival (OS). CDK1 expression was mainly involved in prostate cancer, small cell lung cancer, and GC and was enriched in the WNT signaling pathway and T cell receptor signaling pathway.

Conclusion: CDK1 may serve as an independent prognostic factor for GC. It is also expected to be a new target for molecular targeted therapy of GC.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fcell.2020.620164DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7750425PMC
December 2020

Goals and approaches for each processing step for single-cell RNA sequencing data.

Brief Bioinform 2020 Dec 15. Epub 2020 Dec 15.

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China.

Single-cell RNA sequencing (scRNA-seq) has enabled researchers to study gene expression at the cellular level. However, due to the extremely low levels of transcripts in a single cell and technical losses during reverse transcription, gene expression at a single-cell resolution is usually noisy and highly dimensional; thus, statistical analyses of single-cell data are a challenge. Although many scRNA-seq data analysis tools are currently available, a gold standard pipeline is not available for all datasets. Therefore, a general understanding of bioinformatics and associated computational issues would facilitate the selection of appropriate tools for a given set of data. In this review, we provide an overview of the goals and most popular computational analysis tools for the quality control, normalization, imputation, feature selection and dimension reduction of scRNA-seq data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbaa314DOI Listing
December 2020

ITP-Pred: an interpretable method for predicting, therapeutic peptides with fused features low-dimension representation.

Brief Bioinform 2020 Dec 14. Epub 2020 Dec 14.

University of Electronic Science and Technology of China.

The peptide therapeutics market is providing new opportunities for the biotechnology and pharmaceutical industries. Therefore, identifying therapeutic peptides and exploring their properties are important. Although several studies have proposed different machine learning methods to predict peptides as being therapeutic peptides, most do not explain the decision factors of model in detail. In this work, an Interpretable Therapeutic Peptide Prediction (ITP-Pred) model based on efficient feature fusion was developed. First, we proposed three kinds of feature descriptors based on sequence and physicochemical property encoded, namely amino acid composition (AAC), group AAC and coding autocorrelation, and concatenated them to obtain the feature representation of therapeutic peptide. Then, we input it into the CNN-Bi-directional Long Short-Term Memory (BiLSTM) model to automatically learn recognition of therapeutic peptides. The cross-validation and independent verification experiments results indicated that ITP-Pred has a higher prediction performance on the benchmark dataset than other comparison methods. Finally, we analyzed the output of the model from two aspects: sequence order and physical and chemical properties, mining important features as guidance for the design of better models that can complement existing methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbaa367DOI Listing
December 2020

Prediction of bio-sequence modifications and the associations with diseases.

Brief Funct Genomics 2021 Mar;20(1):1-18

Modifications of protein, RNA and DNA play an important role in many biological processes and are related to some diseases. Therefore, accurate identification and comprehensive understanding of protein, RNA and DNA modification sites can promote research on disease treatment and prevention. With the development of sequencing technology, the number of known sequences has continued to increase. In the past decade, many computational tools that can be used to predict protein, RNA and DNA modification sites have been developed. In this review, we comprehensively summarized the modification site predictors for three different biological sequences and the association with diseases. The relevant web server is accessible at http://lab.malab.cn/∼acy/PTM_data/ some sample data on protein, RNA and DNA modification can be downloaded from that website.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bfgp/elaa023DOI Listing
March 2021