Publications by authors named "Zhengqiao Zhao"

8 Publications

  • Page 1 of 1

Critical Assessment of Metagenome Interpretation: the second round of challenges.

Nat Methods 2022 04 8;19(4):429-440. Epub 2022 Apr 8.

University of California, Davis, Davis, CA, USA.

Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41592-022-01431-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9007738PMC
April 2022

Learning, visualizing and exploring 16S rRNA structure using an attention-based deep neural network.

PLoS Comput Biol 2021 09 22;17(9):e1009345. Epub 2021 Sep 22.

Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America.

Recurrent neural networks with memory and attention mechanisms are widely used in natural language processing because they can capture short and long term sequential information for diverse tasks. We propose an integrated deep learning model for microbial DNA sequence data, which exploits convolutional neural networks, recurrent neural networks, and attention mechanisms to predict taxonomic classifications and sample-associated attributes, such as the relationship between the microbiome and host phenotype, on the read/sequence level. In this paper, we develop this novel deep learning approach and evaluate its application to amplicon sequences. We apply our approach to short DNA reads and full sequences of 16S ribosomal RNA (rRNA) marker genes, which identify the heterogeneity of a microbial community sample. We demonstrate that our implementation of a novel attention-based deep network architecture, Read2Pheno, achieves read-level phenotypic prediction. Training Read2Pheno models will encode sequences (reads) into dense, meaningful representations: learned embedded vectors output from the intermediate layer of the network model, which can provide biological insight when visualized. The attention layer of Read2Pheno models can also automatically identify nucleotide regions in reads/sequences which are particularly informative for classification. As such, this novel approach can avoid pre/post-processing and manual interpretation required with conventional approaches to microbiome sequence classification. We further show, as proof-of-concept, that aggregating read-level information can robustly predict microbial community properties, host phenotype, and taxonomic classification, with performance at least comparable to conventional approaches. An implementation of the attention-based deep learning network is available at https://github.com/EESI/sequence_attention (a python package) and https://github.com/EESI/seq2att (a command line tool).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1009345DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8496832PMC
September 2021

Amino Acid -mer Feature Extraction for Quantitative Antimicrobial Resistance (AMR) Prediction by Machine Learning and Model Interpretation for Biological Insights.

Biology (Basel) 2020 Oct 28;9(11). Epub 2020 Oct 28.

Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA 19104, USA.

Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide -mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid -mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide -mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/biology9110365DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7694136PMC
October 2020

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life.

BMC Bioinformatics 2020 Sep 21;21(1):412. Epub 2020 Sep 21.

Ecological and Evolutionary Signal-process and Informatics (EESI) Lab, Department of Electrical and Computer Engineering, Drexel University, Market Street, Philadelphia, US.

Background: It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of "incremental learning" addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data.

Results: We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model's knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4 of the non-incremental time with no accuracy loss.

Conclusions: It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-020-03744-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7507296PMC
September 2020

Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization.

PLoS Comput Biol 2020 09 17;16(9):e1008269. Epub 2020 Sep 17.

Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA, USA.

We propose an efficient framework for genetic subtyping of SARS-CoV-2, the novel coronavirus that causes the COVID-19 pandemic. Efficient viral subtyping enables visualization and modeling of the geographic distribution and temporal dynamics of disease spread. Subtyping thereby advances the development of effective containment strategies and, potentially, therapeutic and vaccine strategies. However, identifying viral subtypes in real-time is challenging: SARS-CoV-2 is a novel virus, and the pandemic is rapidly expanding. Viral subtypes may be difficult to detect due to rapid evolution; founder effects are more significant than selection pressure; and the clustering threshold for subtyping is not standardized. We propose to identify mutational signatures of available SARS-CoV-2 sequences using a population-based approach: an entropy measure followed by frequency analysis. These signatures, Informative Subtype Markers (ISMs), define a compact set of nucleotide sites that characterize the most variable (and thus most informative) positions in the viral genomes sequenced from different individuals. Through ISM compression, we find that certain distant nucleotide variants covary, including non-coding and ORF1ab sites covarying with the D614G spike protein mutation which has become increasingly prevalent as the pandemic has spread. ISMs are also useful for downstream analyses, such as spatiotemporal visualization of viral dynamics. By analyzing sequence data available in the GISAID database, we validate the utility of ISM-based subtyping by comparing spatiotemporal analyses using ISMs to epidemiological studies of viral transmission in Asia, Europe, and the United States. In addition, we show the relationship of ISMs to phylogenetic reconstructions of SARS-CoV-2 evolution, and therefore, ISMs can play an important complementary role to phylogenetic tree-based analysis, such as is done in the Nextstrain project. The developed pipeline dynamically generates ISMs for newly added SARS-CoV-2 sequences and updates the visualization of pandemic spatiotemporal dynamics, and is available on Github at https://github.com/EESI/ISM (Jupyter notebook), https://github.com/EESI/ncov_ism (command line tool) and via an interactive website at https://covid19-ism.coe.drexel.edu/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1008269DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7523987PMC
September 2020

Emerging Priorities for Microbiome Research.

Front Microbiol 2020 19;11:136. Epub 2020 Feb 19.

School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, United States.

Microbiome research has increased dramatically in recent years, driven by advances in technology and significant reductions in the cost of analysis. Such research has unlocked a wealth of data, which has yielded tremendous insight into the nature of the microbial communities, including their interactions and effects, both within a host and in an external environment as part of an ecological community. Understanding the role of microbiota, including their dynamic interactions with their hosts and other microbes, can enable the engineering of new diagnostic techniques and interventional strategies that can be used in a diverse spectrum of fields, spanning from ecology and agriculture to medicine and from forensics to exobiology. From June 19-23 in 2017, the NIH and NSF jointly held an Innovation Lab on . This review is inspired by some of the topics that arose as priority areas from this unique, interactive workshop. The goal of this review is to summarize the Innovation Lab's findings by introducing the reader to emerging challenges, exciting potential, and current directions in microbiome research. The review is broken into five key topic areas: (1) interactions between microbes and the human body, (2) evolution and ecology of microbes, including the role played by the environment and microbe-microbe interactions, (3) analytical and mathematical methods currently used in microbiome research, (4) leveraging knowledge of microbial composition and interactions to develop engineering solutions, and (5) interventional approaches and engineered microbiota that may be enabled by selectively altering microbial composition. As such, this review seeks to arm the reader with a broad understanding of the priorities and challenges in microbiome research today and provide inspiration for future investigation and multi-disciplinary collaboration.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fmicb.2020.00136DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7042322PMC
February 2020

Exploring thematic structure and predicted functionality of 16S rRNA amplicon data.

PLoS One 2019 11;14(12):e0219235. Epub 2019 Dec 11.

Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America.

Analysis of microbiome data involves identifying co-occurring groups of taxa associated with sample features of interest (e.g., disease state). Elucidating such relations is often difficult as microbiome data are compositional, sparse, and have high dimensionality. Also, the configuration of co-occurring taxa may represent overlapping subcommunities that contribute to sample characteristics such as host status. Preserving the configuration of co-occurring microbes rather than detecting specific indicator species is more likely to facilitate biologically meaningful interpretations. Additionally, analyses that use taxonomic relative abundances to predict the abundances of different gene functions aggregate predicted functional profiles across taxa. This precludes straightforward identification of predicted functional components associated with subsets of co-occurring taxa. We provide an approach to explore co-occurring taxa using "topics" generated via a topic model and link these topics to specific sample features (e.g., disease state). Rather than inferring predicted functional content based on overall taxonomic relative abundances, we instead focus on inference of functional content within topics, which we parse by estimating interactions between topics and pathways through a multilevel, fully Bayesian regression model. We apply our methods to three publicly available 16S amplicon sequencing datasets: an inflammatory bowel disease dataset, an oral cancer dataset, and a time-series dataset. Using our topic model approach to uncover latent structure in 16S rRNA amplicon surveys, investigators can (1) capture groups of co-occurring taxa termed topics; (2) uncover within-topic functional potential; (3) link taxa co-occurrence, gene function, and environmental/host features; and (4) explore the way in which sets of co-occurring taxa behave and evolve over time. These methods have been implemented in a freely available R package: https://cran.r-project.org/package=themetagenomics, https://github.com/EESI/themetagenomics.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0219235PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6905537PMC
March 2020

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.

PLoS Comput Biol 2019 02 26;15(2):e1006721. Epub 2019 Feb 26.

Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America.

Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding ("embedding") each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1006721DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6407789PMC
February 2019
-->