Publications by authors named "Hanxu Wang"

2 Publications

  • Page 1 of 1

Identifying Plant Pentatricopeptide Repeat Proteins Using a Variable Selection Method.

Front Plant Sci 2021 1;12:506681. Epub 2021 Mar 1.

College of Information and Computer Engineering, Northeast Forestry University, Harbin, China.

Pentatricopeptide repeat (PPR), which is a triangular pentapeptide repeat domain, plays an important role in plant growth. Features extracted from sequences are applicable to PPR protein identification using certain classification methods. However, which components of a multidimensional feature (namely variables) are more effective for protein discrimination has never been discussed. Therefore, we seek to select variables from a multidimensional feature for identifying PPR proteins. A framework of variable selection for identifying PPR proteins is proposed. Samples representing PPR positive proteins and negative ones are equally split into a training and a testing set. Variable importance is regarded as scores derived from an iteration of resampling, training, and scoring step on the training set. A model selection method based on Gaussian mixture model is applied to automatic choice of variables which are effective to identify PPR proteins. Measurements are used on the testing set to show the effectiveness of the selected variables. Certain variables other than the multidimensional feature they belong to do work for discrimination between PPR positive proteins and those negative ones. In addition, the content of methionine may play an important role in predicting PPR proteins.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fpls.2021.506681DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7957076PMC
March 2021

ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles.

BMC Bioinformatics 2020 Feb 5;21(1):43. Epub 2020 Feb 5.

College of Information and Computer Engineering, Northeast Forestry University, No.26 Hexing Road, Harbin, 150040, China.

Background: Various methods for differential expression analysis have been widely used to identify features which best distinguish between different categories of samples. Multiple hypothesis testing may leave out explanatory features, each of which may be composed of individually insignificant variables. Multivariate hypothesis testing holds a non-mainstream position, considering the large computation overhead of large-scale matrix operation. Random forest provides a classification strategy for calculation of variable importance. However, it may be unsuitable for different distributions of samples.

Results: Based on the thought of using an ensemble classifier, we develop a feature selection tool for differential expression analysis on expression profiles (i.e., ECFS-DEA for short). Considering the differences in sample distribution, a graphical user interface is designed to allow the selection of different base classifiers. Inspired by random forest, a common measure which is applicable to any base classifier is proposed for calculation of variable importance. After an interactive selection of a feature on sorted individual variables, a projection heatmap is presented using k-means clustering. ROC curve is also provided, both of which can intuitively demonstrate the effectiveness of the selected feature.

Conclusions: Feature selection through ensemble classifiers helps to select important variables and thus is applicable for different sample distributions. Experiments on simulation and realistic data demonstrate the effectiveness of ECFS-DEA for differential expression analysis on expression profiles. The software is available at http://bio-nefu.com/resource/ecfs-dea.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-020-3388-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7003361PMC
February 2020