**8** Publications

- Page
**1**of**1**

Acta Math Sci 2021 19;41(3):1017-1022. Epub 2021 Apr 19.

Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 China.

The severe acute respiratory syndrome COVID-19 was discovered on December 31, 2019 in China. Subsequently, many COVID-19 cases were reported in many other countries. However, some positive COVID-19 samples had been reported earlier than those officially accepted by health authorities in other countries, such as France and Italy. Thus, it is of great importance to determine the place where SARS-CoV-2 was first transmitted to human. To this end, we analyze genomes of SARS-CoV-2 using k-mer natural vector method and compare the similarities of global SARS-CoV-2 genomes by a new natural metric. Because it is commonly accepted that SARS-CoV-2 is originated from bat coronavirus RaTG13, we only need to determine which SARS-CoV-2 genome sequence has the closest distance to bat coronavirus RaTG13 under our natural metric. From our analysis, SARS-CoV-2 most likely has already existed in other countries such as France, India, Netherland, England and United States before the outbreak at Wuhan, China.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1007/s10473-021-0323-x | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8054123 | PMC |

April 2021

IEEE/ACM Trans Comput Biol Bioinform 2020 Nov 25;PP. Epub 2020 Nov 25.

It remains challenging how to find existing but undiscovered genome sequence mutations or predict potential genome sequence mutations based on real sequence data. Motivated by this, we develop approaches to detect new, undiscovered genome sequences. Because discovering new genome sequences through biological experiments is resource-intensive, we want to achieve the new genome sequence detection task mathematically. However, little literature tells us how to detect new, undiscovered genome sequence mutations mathematically. We form a new framework based on natural vector convex hull method that conducts alignment-free sequence analysis. Our newly developed two approaches, Random-permutation Algorithm with Penalty (RAP) and Random-permutation Algorithm with Penalty and COstrained Search (RAPCOS), use the geometry properties captured by natural vectors. In our experiment, we discover a mathematically new human immunodeficiency virus (HIV) genome sequence using some real HIV genome sequences. Significantly, the proposed methods are applicable to solve the new genome sequence detection challenge and have many good properties, such as robustness, rapid convergence, and fast computation.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1109/TCBB.2020.3040706 | DOI Listing |

November 2020

PeerJ 2020 3;8:e9625. Epub 2020 Aug 3.

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.7717/peerj.9625 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7409808 | PMC |

August 2020

Comput Struct Biotechnol J 2020 15;18:1904-1913. Epub 2020 Jul 15.

Department of Mathematical Sciences, Tsinghua University, Beijing, PR China.

Chaos Game Representation (CGR) was first proposed to be an image representation method of DNA and have been extended to the case of other biological macromolecules. Compared with the CGR images of DNA, where DNA sequences are converted into a series of points in the unit square, the existing CGR images of protein are not so elegant in geometry and the implications of the distribution of points in the CGR image are not so obvious. In this study, by naturally distributing the twenty amino acids on the vertices of a regular dodecahedron, we introduce a novel three-dimensional image representation of protein sequences with CGR method. We also associate each CGR image with a vector in high dimensional Euclidean space, called the extended natural vector (ENV), in order to analyze the information contained in the CGR images. Based on the results of protein classification and phylogenetic analysis, our method could serve as a precise method to discover biological relationships between proteins.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1016/j.csbj.2020.07.004 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7390779 | PMC |

July 2020

Genes (Basel) 2020 06 9;11(6). Epub 2020 Jun 9.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.

The severe respiratory disease COVID-19 was initially reported in Wuhan, China, in December 2019, and spread into many provinces from Wuhan. The corresponding pathogen was soon identified as a novel coronavirus named SARS-CoV-2 (formerly, 2019-nCoV). As of 2 May, 2020, over 3 million COVID-19 cases had been confirmed, and 235,290 deaths had been reported globally, and the numbers are still increasing. It is important to understand the phylogenetic relationship between SARS-CoV-2 and known coronaviruses, and to identify its hosts for preventing the next round of emergency outbreak. In this study, we employ an effective alignment-free approach, the Natural Vector method, to analyze the phylogeny and classify the coronaviruses based on genomic and protein data. Our results show that SARS-CoV-2 is closely related to, but distinct from the SARS-CoV branch. By analyzing the genetic distances from the SARS-CoV-2 strain to the coronaviruses residing in animal hosts, we establish that the most possible transmission path originates from bats to pangolins to humans.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.3390/genes11060637 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7349679 | PMC |

June 2020

J Comput Biol 2020 12 11;27(12):1688-1698. Epub 2020 May 11.

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

Bacterial evolution is an important study field, biological sequences are often used to construct phylogenetic relationships. Multiple sequence alignment is very time-consuming and cannot deal with large scales of bacterial genome sequences in a reasonable time. Hence, a new mathematical method, joining density vector method, is proposed to cluster bacteria, which characterizes the features of coding sequence (CDS) in a DNA sequence. Coding sequences carry genetic information that can synthesize proteins. The correspondence between a genomic sequence and its joining density vector (JDV) is one-to-one. JDV reflects the statistical characteristics of genomic sequence and large amounts of data can be analyzed using this new approach. We apply the novel method to do phylogenetic analysis on four bacterial data sets at hierarchies of genus and species. The phylogenetic trees prove that our new method accurately describes the evolutionary relationships of bacterial coding sequences, and is faster than ClustalW and the existing alignment-free methods.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1089/cmb.2019.0509 | DOI Listing |

December 2020

Mol Phylogenet Evol 2019 12 26;141:106633. Epub 2019 Sep 26.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China. Electronic address:

Using numerical methods for genome comparison has always been of importance in bioinformatics. The Chaos Game Representation (CGR) is an effective genome sequence mapping technology, which converts genome sequences to CGR images. To each CGR image, we associate a vector called an Extended Natural Vector (ENV). The ENV is based on the distribution of intensity values. This mapping produces a one-to-one correspondence between CGR images and their ENVs. We define the distance between two DNA sequences as the distance between their associated ENVs. We cluster and classify several datasets including Influenza A viruses, Bacillus genomes, and Conoidea mitochondrial genomes to build their phylogenetic trees. Results show that our ENV combining CGR method (CGR-ENV) compares favorably in classification accuracy and efficiency against the multiple sequence alignment (MSA) method and other alignment-free methods. The research provides significant insights into the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1016/j.ympev.2019.106633 | DOI Listing |

December 2019

Comput Struct Biotechnol J 2019 11;17:982-994. Epub 2019 Jul 11.

Department of Mathematical Sciences, Tsinghua University, Beijing, PR China.

Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1016/j.csbj.2019.07.003 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6661692 | PMC |

July 2019