Publications by authors named "Stephen S-T Yau"

52 Publications

Analysis of the Genomic Distance Between Bat Coronavirus RaTG13 and SARS-CoV-2 Reveals Multiple Origins of COVID-19.

Acta Math Sci 2021 19;41(3):1017-1022. Epub 2021 Apr 19.

Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 China.

The severe acute respiratory syndrome COVID-19 was discovered on December 31, 2019 in China. Subsequently, many COVID-19 cases were reported in many other countries. However, some positive COVID-19 samples had been reported earlier than those officially accepted by health authorities in other countries, such as France and Italy. Thus, it is of great importance to determine the place where SARS-CoV-2 was first transmitted to human. To this end, we analyze genomes of SARS-CoV-2 using k-mer natural vector method and compare the similarities of global SARS-CoV-2 genomes by a new natural metric. Because it is commonly accepted that SARS-CoV-2 is originated from bat coronavirus RaTG13, we only need to determine which SARS-CoV-2 genome sequence has the closest distance to bat coronavirus RaTG13 under our natural metric. From our analysis, SARS-CoV-2 most likely has already existed in other countries such as France, India, Netherland, England and United States before the outbreak at Wuhan, China.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s10473-021-0323-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8054123PMC
April 2021

Amino acid torsion angles enable prediction of protein fold classification.

Sci Rep 2020 12 10;10(1):21773. Epub 2020 Dec 10.

Department of Mathematical Sciences, Tsinghua University, Beijing, 100084, People's Republic of China.

Protein structure can provide insights that help biologists to predict and understand protein functions and interactions. However, the number of known protein structures has not kept pace with the number of protein sequences determined by high-throughput sequencing. Current techniques used to determine the structure of proteins are complex and require a lot of time to analyze the experimental results, especially for large protein molecules. The limitations of these methods have motivated us to create a new approach for protein structure prediction. Here we describe a new approach to predict of protein structures and structure classes from amino acid sequences. Our prediction model performs well in comparison with previous methods when applied to the structural classification of two CATH datasets with more than 5000 protein domains. The average accuracy is 92.5% for structure classification, which is higher than that of previous research. We also used our model to predict four known protein structures with a single amino acid sequence, while many other existing methods could only obtain one possible structure for a given sequence. The results show that our method provides a new effective and reliable tool for protein structure prediction research.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-020-78465-1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7729947PMC
December 2020

Classification of genomic components and prediction of genes of based on subsequence natural vector and support vector machine.

PeerJ 2020 3;8:e9625. Epub 2020 Aug 3.

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

Background: Begomoviruses are widely distributed and causing devastating diseases in many crops. According to the number of genomic components, a begomovirus is known as either monopartite or bipartite begomovirus. Both the monopartite and bipartite begomoviruses have the DNA-A component which encodes all essential proteins for virus functions, while the bipartite begomoviruses still contain the DNA-B component. The satellite molecules, known as betasatellites, alphasatellites or deltasatellites, sometimes exist in the begomoviruses. So, the genomic components of begomoviruses are complex and varied. Different genomic components have different gene structures and functions. Classifying the components of begomoviruses is important for studying the virus origin and pathogenic mechanism.

Methods: We propose a model combining Subsequence Natural Vector (SNV) method with Support Vector Machine (SVM) algorithm, to classify the genomic components of begomoviruses and predict the genes of begomoviruses. First, the genome sequence is represented as a vector numerically by the SNV method. Then SVM is applied on the datasets to build the classification model. At last, recursive feature elimination (RFE) is used to select essential features of the subsequence natural vectors based on the importance of features.

Results: In the investigation, DNA-A, DNA-B, and different satellite DNAs are selected to build the model. To evaluate our model, the homology-based method BLAST and two machine learning algorithms Random Forest and Naive Bayes method are used to compare with our model. According to the results, our classification model can classify DNA-A, DNA-B, and different satellites with high accuracy. Especially, we can distinguish whether a DNA-A component is from a monopartite or a bipartite begomovirus. Then, based on the results of classification, we can also predict the genes of different genomic components. According to the selected features, we find that the content of four nucleotides in the second and tenth segments (approximately 150-350 bp and 1,450-1,650 bp) are the most different between DNA-A components of monopartite and bipartite begomoviruses, which may be related to the pre-coat protein (AV2) and the transcriptional activator protein (AC2) genes. Our results advance the understanding of the unique structures of the genomic components of begomoviruses.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.7717/peerj.9625DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7409808PMC
August 2020

A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector.

Comput Struct Biotechnol J 2020 15;18:1904-1913. Epub 2020 Jul 15.

Department of Mathematical Sciences, Tsinghua University, Beijing, PR China.

Chaos Game Representation (CGR) was first proposed to be an image representation method of DNA and have been extended to the case of other biological macromolecules. Compared with the CGR images of DNA, where DNA sequences are converted into a series of points in the unit square, the existing CGR images of protein are not so elegant in geometry and the implications of the distribution of points in the CGR image are not so obvious. In this study, by naturally distributing the twenty amino acids on the vertices of a regular dodecahedron, we introduce a novel three-dimensional image representation of protein sequences with CGR method. We also associate each CGR image with a vector in high dimensional Euclidean space, called the extended natural vector (ENV), in order to analyze the information contained in the CGR images. Based on the results of protein classification and phylogenetic analysis, our method could serve as a precise method to discover biological relationships between proteins.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.csbj.2020.07.004DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7390779PMC
July 2020

Analysis of the Hosts and Transmission Paths of SARS-CoV-2 in the COVID-19 Outbreak.

Genes (Basel) 2020 06 9;11(6). Epub 2020 Jun 9.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.

The severe respiratory disease COVID-19 was initially reported in Wuhan, China, in December 2019, and spread into many provinces from Wuhan. The corresponding pathogen was soon identified as a novel coronavirus named SARS-CoV-2 (formerly, 2019-nCoV). As of 2 May, 2020, over 3 million COVID-19 cases had been confirmed, and 235,290 deaths had been reported globally, and the numbers are still increasing. It is important to understand the phylogenetic relationship between SARS-CoV-2 and known coronaviruses, and to identify its hosts for preventing the next round of emergency outbreak. In this study, we employ an effective alignment-free approach, the Natural Vector method, to analyze the phylogeny and classify the coronaviruses based on genomic and protein data. Our results show that SARS-CoV-2 is closely related to, but distinct from the SARS-CoV branch. By analyzing the genetic distances from the SARS-CoV-2 strain to the coronaviruses residing in animal hosts, we establish that the most possible transmission path originates from bats to pangolins to humans.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/genes11060637DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7349679PMC
June 2020

Positional Correlation Natural Vector: A Novel Method for Genome Comparison.

Int J Mol Sci 2020 May 29;21(11). Epub 2020 May 29.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.

Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/ijms21113859DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7312176PMC
May 2020

A New Method Based on Coding Sequence Density to Cluster Bacteria.

J Comput Biol 2020 12 11;27(12):1688-1698. Epub 2020 May 11.

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

Bacterial evolution is an important study field, biological sequences are often used to construct phylogenetic relationships. Multiple sequence alignment is very time-consuming and cannot deal with large scales of bacterial genome sequences in a reasonable time. Hence, a new mathematical method, joining density vector method, is proposed to cluster bacteria, which characterizes the features of coding sequence (CDS) in a DNA sequence. Coding sequences carry genetic information that can synthesize proteins. The correspondence between a genomic sequence and its joining density vector (JDV) is one-to-one. JDV reflects the statistical characteristics of genomic sequence and large amounts of data can be analyzed using this new approach. We apply the novel method to do phylogenetic analysis on four bacterial data sets at hierarchies of genus and species. The phylogenetic trees prove that our new method accurately describes the evolutionary relationships of bacterial coding sequences, and is faster than ClustalW and the existing alignment-free methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2019.0509DOI Listing
December 2020

Splice sites detection using chaos game representation and neural network.

Genomics 2020 03 5;112(2):1847-1852. Epub 2019 Nov 5.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, P.R. China. Electronic address:

A novel method is proposed to detect the acceptor and donor splice sites using chaos game representation and artificial neural network. In order to achieve high accuracy, inputs to the neural network, or feature vector, shall reflect the true nature of the DNA segments. Therefore it is important to have one-to-one numerical representation, i.e. a feature vector should be able to represent the original data. Chaos game representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane in a one-to-one manner. Using CGR, a DNA sequence can be mapped to a numerical sequence that reflects the true nature of the original sequence. In this research, we propose to use CGR as feature input to a neural network to detect splice sites on the NN269 dataset. Computational experiments indicate that this approach gives good accuracy while being simpler than other methods in the literature, with only one neural network component. The code and data for our method can be accessed from this link: https://github.com/thoang3/portfolio/tree/SpliceSites_ANN_CGR.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ygeno.2019.10.018DOI Listing
March 2020

A novel alignment-free method for HIV-1 subtype classification.

Infect Genet Evol 2020 01 1;77:104080. Epub 2019 Nov 1.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China. Electronic address:

HIV-1 is the most common and pathogenic strain of human immunodeficiency virus consisting of many subtypes. To study the difference among HIV-1 subtypes in infection, diagnosis and drug design, it is important to identify HIV-1 subtypes from clinical HIV-1 samples. In this work, we propose an effective numeric representation called Subsequence Natural Vector (SNV) to encode HIV-1 sequences. Using the representation, we introduce an improved linear discriminant analysis method to classify HIV-1 viruses correctly. SNV is based on distribution of nucleotides in HIV-1 viral sequences. It not only computes the number of nucleotides, but also describes the position and variance of nucleotides in viruses. To validate our alignment-free method, 6902 complete genomes and 11,668 pol gene sequences of HIV-1 subtypes were collected from the up-to-date Los Alamos HIV database. SNV outperforms the three popular methods, Kameris, Comet and REGA, with almost 100% Sensitivity and Specificity, also with much less time. Our subtyping algorithm especially works better for circulating recombinant forms (CRFs) consisting of a few sequences. Our approach is also powerful to separate unique recombinant forms (URFs) from other subtypes with 100% Sensitivity and Specificity. Moreover, phylogenetic trees based on SNV representation are constructed using full-length HIV-1 genomes and pol genes respectively, where viruses from the same subtype are clustered together correctly.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.meegid.2019.104080DOI Listing
January 2020

Fast and accurate genome comparison using genome images: The Extended Natural Vector Method.

Mol Phylogenet Evol 2019 12 26;141:106633. Epub 2019 Sep 26.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China. Electronic address:

Using numerical methods for genome comparison has always been of importance in bioinformatics. The Chaos Game Representation (CGR) is an effective genome sequence mapping technology, which converts genome sequences to CGR images. To each CGR image, we associate a vector called an Extended Natural Vector (ENV). The ENV is based on the distribution of intensity values. This mapping produces a one-to-one correspondence between CGR images and their ENVs. We define the distance between two DNA sequences as the distance between their associated ENVs. We cluster and classify several datasets including Influenza A viruses, Bacillus genomes, and Conoidea mitochondrial genomes to build their phylogenetic trees. Results show that our ENV combining CGR method (CGR-ENV) compares favorably in classification accuracy and efficiency against the multiple sequence alignment (MSA) method and other alignment-free methods. The research provides significant insights into the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ympev.2019.106633DOI Listing
December 2019

Large-Scale Genome Comparison Based on Cumulative Fourier Power and Phase Spectra: Central Moment and Covariance Vector.

Comput Struct Biotechnol J 2019 11;17:982-994. Epub 2019 Jul 11.

Department of Mathematical Sciences, Tsinghua University, Beijing, PR China.

Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.csbj.2019.07.003DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6661692PMC
July 2019

A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance.

Front Genet 2019 9;10:234. Epub 2019 Apr 9.

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ. By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in ℝ. The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2019.00234DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6465635PMC
April 2019

Protein Sequence Classification Using Natural Vector and Convex Hull Method.

J Comput Biol 2019 04 14;26(4):315-321. Epub 2019 Feb 14.

Department of Mathematical Sciences, Tsinghua University, Beijing, P.R. China.

Protein kinase C (PKC) is a superfamily of enzymes, which regulate numerous cellular responses. The specific function of PKC protein family is mainly governed by its individual protein domains. However, existing protein sequence classification methods based on sequence alignment and sequence analysis models focused little on the domain analysis. In this study, we introduce a novel protein kinase classification method that considers both domain sequence similarity and whole sequence similarity to quantify the evolutionary distance from a specific protein to a protein family. Using the natural vector method, we establish a 60-dimensional space, where each protein is uniquely represented by a vector. We also define a convex hull, consisting of the natural vectors corresponding to all members of a protein family. The sequence similarity between a protein and a protein family, therefore, can be quantified as the distance between the protein vector and the protein family convex hull. We have applied this method in a PKC sample library and the results showed a higher accuracy of classification compared with other alignment-free methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2018.0216DOI Listing
April 2019

A new efficient method for analyzing fungi species using correlations between nucleotides.

BMC Evol Biol 2018 12 27;18(1):200. Epub 2018 Dec 27.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, People's Republic of China.

Background: In recent years, DNA barcoding has become an important tool for biologists to identify species and understand their natural biodiversity. The complexity of barcode data makes it difficult to analyze quickly and effectively. Manual classification of this data cannot keep up to the rate of increase of available data.

Results: In this study, we propose a new method for DNA barcode classification based on the distribution of nucleotides within the sequence. By adding the covariance of nucleotides to the original natural vector, this augmented 18-dimensional natural vector makes good use of the available information in the DNA sequence. The accurate classification results we obtained demonstrate that this new 18-dimensional natural vector method, together with the random forest classifier algorthm, can serve as a computationally efficient identification tool for DNA barcodes. We performed phylogenetic analysis on the genus Megacollybia to validate our method. We also studied how effective our method was in determining the genetic distance within and between species in our barcoding dataset.

Conclusions: The classification performs well on the fungi barcode dataset with high and robust accuracy. The reasonable phylogenetic trees we obtained further validate our methods. This method is alignment-free and does not depend on any model assumption, and it will become a powerful tool for classification and evolutionary analysis.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12862-018-1330-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6307163PMC
December 2018

Convex hull principle for classification and phylogeny of eukaryotic proteins.

Genomics 2019 12 5;111(6):1777-1784. Epub 2018 Dec 5.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China. Electronic address:

This study quantitatively validates the principle that the biological properties associated with a given genotype are determined by the distribution of amino acids. In order to visualize this central law of molecular biology, each protein was represented by a point in 250-dimensional space based on its amino acid distribution. Proteins from the same family are found to cluster together, leading to the principle that the convex hull surrounding protein points from the same family do not intersect with the convex hulls of other protein families. This principle was verified computationally for all available and reliable protein kinases and human proteins. In addition, we generated 2,328,761 figures to show that the convex hulls of different families were disjoint from each other. The classification performs well with high and robust accuracy (95.75% and 97.5%) together with reasonable phylogenetic trees validate our methods further.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ygeno.2018.11.033DOI Listing
December 2019

Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method.

Genomics 2019 12 5;111(6):1298-1305. Epub 2018 Sep 5.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China. Electronic address:

Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ygeno.2018.08.010DOI Listing
December 2019

Convex hull analysis of evolutionary and phylogenetic relationships between biological groups.

J Theor Biol 2018 11 27;456:34-40. Epub 2018 Jul 27.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, P.R. China. Electronic address:

Comparing DNA and protein sequence groups plays an important role in biological evolutionary relationship research. Despite many methods available for sequence comparison, only a few can be used for group comparison. In this study, we propose a novel approach using convex hulls. We use statistical information contained within the sequences to represent each sequence as a point in high dimensional space. We find that the points belonging to one biological group are located in a different region of space than points belonging to other biological groups. To be more precise, the convex hull of the points from one group are disjoint from the convex hulls of points from other groups. This finding allows us to do phylogenetic analysis for groups in an efficient way. Five different theorems are presented for checking whether two convex hulls intersect or are disjoint. Test results for datasets related to HRV, HPV, Ebolavirus, PKC and protein phosphatase domains demonstrate that our method performs well and provides a new tool for studying group phylogeny. More significantly, the convex analysis presents a new way to search for sequences belonging to a biological group by examining points within the group's convex hull.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jtbi.2018.07.035DOI Listing
November 2018

A new method to cluster genomes based on cumulative Fourier power spectrum.

Gene 2018 Oct 20;673:239-250. Epub 2018 Jun 20.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China. Electronic address:

Analyzing phylogenetic relationships using mathematical methods has always been of importance in bioinformatics. Quantitative research may interpret the raw biological data in a precise way. Multiple Sequence Alignment (MSA) is used frequently to analyze biological evolutions, but is very time-consuming. When the scale of data is large, alignment methods cannot finish calculation in reasonable time. Therefore, we present a new method using moments of cumulative Fourier power spectrum in clustering the DNA sequences. Each sequence is translated into a vector in Euclidean space. Distances between the vectors can reflect the relationships between sequences. The mapping between the spectra and moment vector is one-to-one, which means that no information is lost in the power spectra during the calculation. We cluster and classify several datasets including Influenza A, primates, and human rhinovirus (HRV) datasets to build up the phylogenetic trees. Results show that the new proposed cumulative Fourier power spectrum is much faster and more accurately than MSA and another alignment-free method known as k-mer. The research provides us new insights in the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes. The computer programs of the cumulative Fourier power spectrum are available at GitHub (https://github.com/YaulabTsinghua/cumulative-Fourier-power-spectrum).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.gene.2018.06.042DOI Listing
October 2018

Establishing the phylogeny of Prochlorococcus with a new alignment-free method.

Ecol Evol 2017 12 15;7(24):11057-11065. Epub 2017 Nov 15.

Department of Mathematical Sciences Tsinghua University Beijing China.

marinus, one of the most abundant marine cyanobacteria in the global ocean, is classified into low-light (LL) and high-light (HL) adapted ecotypes. These two adapted ecotypes differ in their ecophysiological characteristics, especially whether adapted for growth at high-light or low-light intensities. However, some evolutionary relationships of phylogeny remain to be resolved, such as whether the strains SS120 and MIT9211 form a monophyletic group. We use the Natural Vector (NV) method to represent the sequence in order to identify the phylogeny of the . The natural vector method is alignment free without any model assumptions. This study added the covariances of amino acids in protein sequence to the natural vector method. Based on these new natural vectors, we can compute the Hausdorff distance between the two clades which represents the dissimilarity. This method enables us to systematically analyze both the dataset of ribosomal proteomes and the dataset of 16s-23s rRNA sequences in order to reconstruct the phylogeny of . Furthermore, we apply classification to inspect the relationship of SS120 and MIT9211. From the reconstructed phylogenetic trees and classification results, we may conclude that the SS120 does not cluster with MIT9211. This study demonstrates a new method for performing phylogenetic analysis. The results confirm that these two strains do not form a monophyletic clade in the phylogeny of .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/ece3.3535DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5743538PMC
December 2017

A novel fast vector method for genetic sequence comparison.

Sci Rep 2017 09 22;7(1):12226. Epub 2017 Sep 22.

Department of Mathematical Sciences, Tsinghua University, Beijing, 100084, PR China.

With sharp increasing in biological sequences, the traditional sequence alignment methods become unsuitable and infeasible. It motivates a surge of fast alignment-free techniques for sequence analysis. Among these methods, many sorts of feature vector methods are established and applied to reconstruction of species phylogeny. The vectors basically consist of some typical numerical features for certain biological problems. The features may come from the primary sequences, secondary or three dimensional structures of macromolecules. In this study, we propose a novel numerical vector based on only primary sequences of organism to build their phylogeny. Three chemical and physical properties of primary sequences: purine, pyrimidine and keto are also incorporated to the vector. Using each property, we convert the nucleotide sequence into a new sequence consisting of only two kinds of letters. Therefore, three sequences are constructed according to the three properties. For each letter of each sequence we calculate the number of the letter, the average position of the letter and the variation of the position of the letter appearing in the sequence. Tested on several datasets related to mammals, viruses and bacteria, this new tool is fast in speed and accurate for inferring the phylogeny of organisms.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-017-12493-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5610321PMC
September 2017

A novel alignment-free vector method to cluster protein sequences.

J Theor Biol 2017 08 3;427:41-52. Epub 2017 Jun 3.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China. Electronic address:

Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important biochemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jtbi.2017.06.002DOI Listing
August 2017

A coevolution analysis for identifying protein-protein interactions by Fourier transform.

PLoS One 2017 21;12(4):e0174862. Epub 2017 Apr 21.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.

Protein-protein interactions (PPIs) play key roles in life processes, such as signal transduction, transcription regulations, and immune response, etc. Identification of PPIs enables better understanding of the functional networks within a cell. Common experimental methods for identifying PPIs are time consuming and expensive. However, recent developments in computational approaches for inferring PPIs from protein sequences based on coevolution theory avoid these problems. In the coevolution theory model, interacted proteins may show coevolutionary mutations and have similar phylogenetic trees. The existing coevolution methods depend on multiple sequence alignments (MSA); however, the MSA-based coevolution methods often produce high false positive interactions. In this paper, we present a computational method using an alignment-free approach to accurately detect PPIs and reduce false positives. In the method, protein sequences are numerically represented by biochemical properties of amino acids, which reflect the structural and functional differences of proteins. Fourier transform is applied to the numerical representation of protein sequences to capture the dissimilarities of protein sequences in biophysical context. The method is assessed for predicting PPIs in Ebola virus. The results indicate strong coevolution between the protein pairs (NP-VP24, NP-VP30, NP-VP40, VP24-VP30, VP24-VP40, and VP30-VP40). The method is also validated for PPIs in influenza and E.coli genomes. Since our method can reduce false positive and increase the specificity of PPI prediction, it offers an effective tool to understand mechanisms of disease pathogens and find potential targets for drug design. The Python programs in this study are available to public at URL (https://github.com/cyinbox/PPI).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0174862PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5400233PMC
September 2017

An information-based network approach for protein classification.

PLoS One 2017 28;12(3):e0174386. Epub 2017 Mar 28.

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

Protein classification is one of the critical problems in bioinformatics. Early studies used geometric distances and polygenetic-tree to classify proteins. These methods use binary trees to present protein classification. In this paper, we propose a new protein classification method, whereby theories of information and networks are used to classify the multivariate relationships of proteins. In this study, protein universe is modeled as an undirected network, where proteins are classified according to their connections. Our method is unsupervised, multivariate, and alignment-free. It can be applied to the classification of both protein sequences and structures. Nine examples are used to demonstrate the efficiency of our new method.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0174386PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5370107PMC
September 2017

Zika and Flaviviruses Phylogeny Based on the Alignment-Free Natural Vector Method.

DNA Cell Biol 2017 Feb 15;36(2):109-116. Epub 2016 Dec 15.

1 Department of Mathematical Sciences, Tsinghua University , Beijing, People's Republic of China .

Zika virus (ZIKV) is a mosquito-borne flavivirus. It was first isolated from Uganda in 1947 and has become an emergent event since 2007. However, because of the inconsistency of alignment methods, the evolution of ZIKV remains poorly understood. In this study, we first use the complete protein and an alignment-free method to build a phylogenetic tree of 87 Zika strains in which Asian, East African, and West African lineages are characterized. We also use the NS5 protein to construct the genetic relationship among 44 Zika strains. For the first time, these strains are divided into two clades: African 1 and African 2. This result suggests that ZIKV originates from Africa, then spread to Asia, Pacific islands, and throughout the Americas. We also perform the phylogeny analysis for 53 viruses in genus Flavivirus to which ZIKV belongs using complete proteins. Our conclusion is consistent with the classification by the hosts and transmission vectors.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1089/dna.2016.3532DOI Listing
February 2017

Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison.

Genomics 2016 10 15;108(3-4):134-142. Epub 2016 Aug 15.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China. Electronic address:

Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ygeno.2016.08.002DOI Listing
October 2016

Virus classification in 60-dimensional protein space.

Mol Phylogenet Evol 2016 06 15;99:53-62. Epub 2016 Mar 15.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China. Electronic address:

Due to vast sequence divergence among different viral groups, sequence alignment is not directly applicable to genome-wide comparative analysis of viruses. More and more attention has been paid to alignment-free methods for whole genome comparison and phylogenetic tree reconstruction. Among alignment-free methods, the recently proposed "Natural Vector (NV) representation" has successfully been used to study the phylogeny of multi-segmented viruses based on a 12-dimensional genome space derived from the nucleotide sequence structure. But the preference of proteomes over genomes for the determination of viral phylogeny was not deeply investigated. As the translated products of genes, proteins directly form the shape of viral structure and are vital for all metabolic pathways. In this study, using the NV representation of a protein sequence along with the Hausdorff distance suitable to compare point sets, we construct a 60-dimensional protein space to analyze the evolutionary relationships of 4021 viruses by whole-proteomes in the current NCBI Reference Sequence Database (RefSeq). We also take advantage of the previously developed natural graphical representation to recover viral phylogeny. Our results demonstrate that the proposed method is efficient and accurate for classifying viruses. The accuracy rates of our predictions such as for Baltimore II viruses are as high as 95.9% for family labels, 95.7% for subfamily labels and 96.5% for genus labels. Finally, we discover that proteomes lead to better viral classification when reliable protein sequences are abundant. In other cases, the accuracy rates using proteomes are still comparable to that of genomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ympev.2016.03.009DOI Listing
June 2016

A new method for studying the evolutionary origin of the SAR11 clade marine bacteria.

Mol Phylogenet Evol 2016 May 27;98:271-9. Epub 2016 Feb 27.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China. Electronic address:

The free-living SAR11 clade is a globally abundant group of oceanic Alphaproteobacteria, with small genome sizes and rich genomic A+T content. However, the taxonomy of SAR11 has become controversial recently. Some researchers argue that the position of SAR11 is a sister group to Rickettsiales. Other researchers advocate that SAR11 is located within free-living lineages of Alphaproteobacteria. Here, we use the natural vector representation method to identify the evolutionary origin of the SAR11 clade. This alignment-free method does not depend on any model assumptions. With this approach, the correspondence between proteome sequences and their natural vectors is one-to-one. After fixing a set of proteins, each bacterium is represented by a set of vectors. The Hausdorff distance is then used to compute the dissimilarity distance between two bacteria. The phylogenetic tree can be reconstructed based on these distances. Using our method, we systematically analyze four data sets of alphaproteobacterial proteomes in order to reconstruct the phylogeny of Alphaproteobacteria. From this we can see that the phylogenetic position of the SAR11 group is within a group of other free-living lineages of Alphaproteobacteria.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ympev.2016.02.015DOI Listing
May 2016

Two Dimensional Yau-Hausdorff Distance with Applications on Comparison of DNA and Protein Sequences.

PLoS One 2015 18;10(9):e0136577. Epub 2015 Sep 18.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.

Comparing DNA or protein sequences plays an important role in the functional analysis of genomes. Despite many methods available for sequences comparison, few methods retain the information content of sequences. We propose a new approach, the Yau-Hausdorff method, which considers all translations and rotations when seeking the best match of graphical curves of DNA or protein sequences. The complexity of this method is lower than that of any other two dimensional minimum Hausdorff algorithm. The Yau-Hausdorff method can be used for measuring the similarity of DNA sequences based on two important tools: the Yau-Hausdorff distance and graphical representation of DNA sequences. The graphical representations of DNA sequences conserve all sequence information and the Yau-Hausdorff distance is mathematically proved as a true metric. Therefore, the proposed distance can preciously measure the similarity of DNA sequences. The phylogenetic analyses of DNA sequences by the Yau-Hausdorff distance show the accuracy and stability of our approach in similarity comparison of DNA or protein sequences. This study demonstrates that Yau-Hausdorff distance is a natural metric for DNA and protein sequences with high level of stability. The approach can be also applied to similarity analysis of protein sequences by graphic representations, as well as general two dimensional shape matching.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0136577PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4575136PMC
May 2016

An improved model for whole genome phylogenetic analysis by Fourier transform.

J Theor Biol 2015 Oct 4;382:99-110. Epub 2015 Jul 4.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China. Electronic address:

DNA sequence similarity comparison is one of the major steps in computational phylogenetic studies. The sequence comparison of closely related DNA sequences and genomes is usually performed by multiple sequence alignments (MSA). While the MSA method is accurate for some types of sequences, it may produce incorrect results when DNA sequences undergone rearrangements as in many bacterial and viral genomes. It is also limited by its computational complexity for comparing large volumes of data. Previously, we proposed an alignment-free method that exploits the full information contents of DNA sequences by Discrete Fourier Transform (DFT), but still with some limitations. Here, we present a significantly improved method for the similarity comparison of DNA sequences by DFT. In this method, we map DNA sequences into 2-dimensional (2D) numerical sequences and then apply DFT to transform the 2D numerical sequences into frequency domain. In the 2D mapping, the nucleotide composition of a DNA sequence is a determinant factor and the 2D mapping reduces the nucleotide composition bias in distance measure, and thus improving the similarity measure of DNA sequences. To compare the DFT power spectra of DNA sequences with different lengths, we propose an improved even scaling algorithm to extend shorter DFT power spectra to the longest length of the underlying sequences. After the DFT power spectra are evenly scaled, the spectra are in the same dimensionality of the Fourier frequency space, then the Euclidean distances of full Fourier power spectra of the DNA sequences are used as the dissimilarity metrics. The improved DFT method, with increased computational performance by 2D numerical representation, can be applicable to any DNA sequences of different length ranges. We assess the accuracy of the improved DFT similarity measure in hierarchical clustering of different DNA sequences including simulated and real datasets. The method yields accurate and reliable phylogenetic trees and demonstrates that the improved DFT dissimilarity measure is an efficient and effective similarity measure of DNA sequences. Due to its high efficiency and accuracy, the proposed DFT similarity measure is successfully applied on phylogenetic analysis for individual genes and large whole bacterial genomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jtbi.2015.06.033DOI Listing
October 2015

Ebolavirus classification based on natural vectors.

DNA Cell Biol 2015 Jun 24;34(6):418-28. Epub 2015 Mar 24.

3Department of Mathematical Sciences, Tsinghua University, Beijing, People's Republic of China.

According to the WHO, ebolaviruses have resulted in 8818 human deaths in West Africa as of January 2015. To better understand the evolutionary relationship of the ebolaviruses and infer virulence from the relationship, we applied the alignment-free natural vector method to classify the newest ebolaviruses. The dataset includes three new Guinea viruses as well as 99 viruses from Sierra Leone. For the viruses of the family of Filoviridae, both genus label classification and species label classification achieve an accuracy rate of 100%. We represented the relationships among Filoviridae viruses by Unweighted Pair Group Method with Arithmetic Mean (UPGMA) phylogenetic trees and found that the filoviruses can be separated well by three genera. We performed the phylogenetic analysis on the relationship among different species of Ebolavirus by their coding-complete genomes and seven viral protein genes (glycoprotein [GP], nucleoprotein [NP], VP24, VP30, VP35, VP40, and RNA polymerase [L]). The topology of the phylogenetic tree by the viral protein VP24 shows consistency with the variations of virulence of ebolaviruses. The result suggests that VP24 be a pharmaceutical target for treating or preventing ebolaviruses.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1089/dna.2014.2678DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484716PMC
June 2015