Publications by authors named "Shaojun Pei"

8 Publications

  • Page 1 of 1

Analysis of the Genomic Distance Between Bat Coronavirus RaTG13 and SARS-CoV-2 Reveals Multiple Origins of COVID-19.

Acta Math Sci 2021 19;41(3):1017-1022. Epub 2021 Apr 19.

Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 China.

The severe acute respiratory syndrome COVID-19 was discovered on December 31, 2019 in China. Subsequently, many COVID-19 cases were reported in many other countries. However, some positive COVID-19 samples had been reported earlier than those officially accepted by health authorities in other countries, such as France and Italy. Thus, it is of great importance to determine the place where SARS-CoV-2 was first transmitted to human. To this end, we analyze genomes of SARS-CoV-2 using k-mer natural vector method and compare the similarities of global SARS-CoV-2 genomes by a new natural metric. Because it is commonly accepted that SARS-CoV-2 is originated from bat coronavirus RaTG13, we only need to determine which SARS-CoV-2 genome sequence has the closest distance to bat coronavirus RaTG13 under our natural metric. From our analysis, SARS-CoV-2 most likely has already existed in other countries such as France, India, Netherland, England and United States before the outbreak at Wuhan, China.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s10473-021-0323-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8054123PMC
April 2021

New Genome Sequence Detection via Natural Vector Convex Hull Method.

IEEE/ACM Trans Comput Biol Bioinform 2020 Nov 25;PP. Epub 2020 Nov 25.

It remains challenging how to find existing but undiscovered genome sequence mutations or predict potential genome sequence mutations based on real sequence data. Motivated by this, we develop approaches to detect new, undiscovered genome sequences. Because discovering new genome sequences through biological experiments is resource-intensive, we want to achieve the new genome sequence detection task mathematically. However, little literature tells us how to detect new, undiscovered genome sequence mutations mathematically. We form a new framework based on natural vector convex hull method that conducts alignment-free sequence analysis. Our newly developed two approaches, Random-permutation Algorithm with Penalty (RAP) and Random-permutation Algorithm with Penalty and COstrained Search (RAPCOS), use the geometry properties captured by natural vectors. In our experiment, we discover a mathematically new human immunodeficiency virus (HIV) genome sequence using some real HIV genome sequences. Significantly, the proposed methods are applicable to solve the new genome sequence detection challenge and have many good properties, such as robustness, rapid convergence, and fast computation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TCBB.2020.3040706DOI Listing
November 2020

Classification of genomic components and prediction of genes of based on subsequence natural vector and support vector machine.

PeerJ 2020 3;8:e9625. Epub 2020 Aug 3.

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

Background: Begomoviruses are widely distributed and causing devastating diseases in many crops. According to the number of genomic components, a begomovirus is known as either monopartite or bipartite begomovirus. Both the monopartite and bipartite begomoviruses have the DNA-A component which encodes all essential proteins for virus functions, while the bipartite begomoviruses still contain the DNA-B component. The satellite molecules, known as betasatellites, alphasatellites or deltasatellites, sometimes exist in the begomoviruses. So, the genomic components of begomoviruses are complex and varied. Different genomic components have different gene structures and functions. Classifying the components of begomoviruses is important for studying the virus origin and pathogenic mechanism.

Methods: We propose a model combining Subsequence Natural Vector (SNV) method with Support Vector Machine (SVM) algorithm, to classify the genomic components of begomoviruses and predict the genes of begomoviruses. First, the genome sequence is represented as a vector numerically by the SNV method. Then SVM is applied on the datasets to build the classification model. At last, recursive feature elimination (RFE) is used to select essential features of the subsequence natural vectors based on the importance of features.

Results: In the investigation, DNA-A, DNA-B, and different satellite DNAs are selected to build the model. To evaluate our model, the homology-based method BLAST and two machine learning algorithms Random Forest and Naive Bayes method are used to compare with our model. According to the results, our classification model can classify DNA-A, DNA-B, and different satellites with high accuracy. Especially, we can distinguish whether a DNA-A component is from a monopartite or a bipartite begomovirus. Then, based on the results of classification, we can also predict the genes of different genomic components. According to the selected features, we find that the content of four nucleotides in the second and tenth segments (approximately 150-350 bp and 1,450-1,650 bp) are the most different between DNA-A components of monopartite and bipartite begomoviruses, which may be related to the pre-coat protein (AV2) and the transcriptional activator protein (AC2) genes. Our results advance the understanding of the unique structures of the genomic components of begomoviruses.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.7717/peerj.9625DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7409808PMC
August 2020

A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector.

Comput Struct Biotechnol J 2020 15;18:1904-1913. Epub 2020 Jul 15.

Department of Mathematical Sciences, Tsinghua University, Beijing, PR China.

Chaos Game Representation (CGR) was first proposed to be an image representation method of DNA and have been extended to the case of other biological macromolecules. Compared with the CGR images of DNA, where DNA sequences are converted into a series of points in the unit square, the existing CGR images of protein are not so elegant in geometry and the implications of the distribution of points in the CGR image are not so obvious. In this study, by naturally distributing the twenty amino acids on the vertices of a regular dodecahedron, we introduce a novel three-dimensional image representation of protein sequences with CGR method. We also associate each CGR image with a vector in high dimensional Euclidean space, called the extended natural vector (ENV), in order to analyze the information contained in the CGR images. Based on the results of protein classification and phylogenetic analysis, our method could serve as a precise method to discover biological relationships between proteins.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.csbj.2020.07.004DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7390779PMC
July 2020

Analysis of the Hosts and Transmission Paths of SARS-CoV-2 in the COVID-19 Outbreak.

Genes (Basel) 2020 06 9;11(6). Epub 2020 Jun 9.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.

The severe respiratory disease COVID-19 was initially reported in Wuhan, China, in December 2019, and spread into many provinces from Wuhan. The corresponding pathogen was soon identified as a novel coronavirus named SARS-CoV-2 (formerly, 2019-nCoV). As of 2 May, 2020, over 3 million COVID-19 cases had been confirmed, and 235,290 deaths had been reported globally, and the numbers are still increasing. It is important to understand the phylogenetic relationship between SARS-CoV-2 and known coronaviruses, and to identify its hosts for preventing the next round of emergency outbreak. In this study, we employ an effective alignment-free approach, the Natural Vector method, to analyze the phylogeny and classify the coronaviruses based on genomic and protein data. Our results show that SARS-CoV-2 is closely related to, but distinct from the SARS-CoV branch. By analyzing the genetic distances from the SARS-CoV-2 strain to the coronaviruses residing in animal hosts, we establish that the most possible transmission path originates from bats to pangolins to humans.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/genes11060637DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7349679PMC
June 2020

A New Method Based on Coding Sequence Density to Cluster Bacteria.

J Comput Biol 2020 12 11;27(12):1688-1698. Epub 2020 May 11.

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

Bacterial evolution is an important study field, biological sequences are often used to construct phylogenetic relationships. Multiple sequence alignment is very time-consuming and cannot deal with large scales of bacterial genome sequences in a reasonable time. Hence, a new mathematical method, joining density vector method, is proposed to cluster bacteria, which characterizes the features of coding sequence (CDS) in a DNA sequence. Coding sequences carry genetic information that can synthesize proteins. The correspondence between a genomic sequence and its joining density vector (JDV) is one-to-one. JDV reflects the statistical characteristics of genomic sequence and large amounts of data can be analyzed using this new approach. We apply the novel method to do phylogenetic analysis on four bacterial data sets at hierarchies of genus and species. The phylogenetic trees prove that our new method accurately describes the evolutionary relationships of bacterial coding sequences, and is faster than ClustalW and the existing alignment-free methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2019.0509DOI Listing
December 2020

Fast and accurate genome comparison using genome images: The Extended Natural Vector Method.

Mol Phylogenet Evol 2019 12 26;141:106633. Epub 2019 Sep 26.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China. Electronic address:

Using numerical methods for genome comparison has always been of importance in bioinformatics. The Chaos Game Representation (CGR) is an effective genome sequence mapping technology, which converts genome sequences to CGR images. To each CGR image, we associate a vector called an Extended Natural Vector (ENV). The ENV is based on the distribution of intensity values. This mapping produces a one-to-one correspondence between CGR images and their ENVs. We define the distance between two DNA sequences as the distance between their associated ENVs. We cluster and classify several datasets including Influenza A viruses, Bacillus genomes, and Conoidea mitochondrial genomes to build their phylogenetic trees. Results show that our ENV combining CGR method (CGR-ENV) compares favorably in classification accuracy and efficiency against the multiple sequence alignment (MSA) method and other alignment-free methods. The research provides significant insights into the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ympev.2019.106633DOI Listing
December 2019

Large-Scale Genome Comparison Based on Cumulative Fourier Power and Phase Spectra: Central Moment and Covariance Vector.

Comput Struct Biotechnol J 2019 11;17:982-994. Epub 2019 Jul 11.

Department of Mathematical Sciences, Tsinghua University, Beijing, PR China.

Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.csbj.2019.07.003DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6661692PMC
July 2019