Publications by authors named "Gail Rosen"

59 Publications

Critical Assessment of Metagenome Interpretation: the second round of challenges.

Nat Methods 2022 04 8;19(4):429-440. Epub 2022 Apr 8.

University of California, Davis, Davis, CA, USA.

Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41592-022-01431-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9007738PMC
April 2022

Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences.

mSystems 2022 Apr 21;7(2):e0003522. Epub 2022 Mar 21.

Drexel Universitygrid.166341.7, Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical & Computer Engineering, College of Engineering, Philadelphia, Pennsylvania, USA.

Next-generation sequencing has been essential to the global response to the COVID-19 pandemic. As of January 2022, nearly 7 million severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences are available to researchers in public databases. Sequence databases are an abundant resource from which to extract biologically relevant and clinically actionable information. As the pandemic has gone on, SARS-CoV-2 has rapidly evolved, involving complex genomic changes that challenge current approaches to classifying SARS-CoV-2 variants. Deep sequence learning could be a potentially powerful way to build complex sequence-to-phenotype models. Unfortunately, while they can be predictive, deep learning typically produces "black box" models that cannot directly provide biological and clinical insight. Researchers should therefore consider implementing emerging methods for visualizing and interpreting deep sequence models. Finally, researchers should address important data limitations, including (i) global sequencing disparities, (ii) insufficient sequence metadata, and (iii) screening artifacts due to poor sequence quality control.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1128/msystems.00035-22DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9040592PMC
April 2022

MetaMutationalSigs: Comparison of mutational signature refitting results made easy.

Bioinformatics 2022 02 14. Epub 2022 Feb 14.

Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA, USA.

Motivation: The analysis of mutational signatures is becoming increasingly common in cancer genetics, with emerging implications in cancer evolution, classification, treatment decision and prognosis. Recently, several packages have been developed for mutational signature analysis, with each using different methodology and yielding significantly different results. Because of the nontrivial differences in tools' refitting results, researchers may desire to survey and compare the available tools, in order to objectively evaluate the results for their specific research question, such as which mutational signatures are prevalent in different cancer types.

Results: Due to the need for effective comparison of refitting mutational signatures, we introduce a user-friendly software that can aggregate and visually present results from different refitting packages.

Availability: MetaMutationalSigs is implemented using R and python and is available for installation using Docker and available at: https://github.com/EESI/MetaMutationalSigs.

Supplementary Information: More information about the package including test data and results are available at https://github.com/EESI/MetaMutationalSigs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btac091DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9004636PMC
February 2022

Learning, visualizing and exploring 16S rRNA structure using an attention-based deep neural network.

PLoS Comput Biol 2021 09 22;17(9):e1009345. Epub 2021 Sep 22.

Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America.

Recurrent neural networks with memory and attention mechanisms are widely used in natural language processing because they can capture short and long term sequential information for diverse tasks. We propose an integrated deep learning model for microbial DNA sequence data, which exploits convolutional neural networks, recurrent neural networks, and attention mechanisms to predict taxonomic classifications and sample-associated attributes, such as the relationship between the microbiome and host phenotype, on the read/sequence level. In this paper, we develop this novel deep learning approach and evaluate its application to amplicon sequences. We apply our approach to short DNA reads and full sequences of 16S ribosomal RNA (rRNA) marker genes, which identify the heterogeneity of a microbial community sample. We demonstrate that our implementation of a novel attention-based deep network architecture, Read2Pheno, achieves read-level phenotypic prediction. Training Read2Pheno models will encode sequences (reads) into dense, meaningful representations: learned embedded vectors output from the intermediate layer of the network model, which can provide biological insight when visualized. The attention layer of Read2Pheno models can also automatically identify nucleotide regions in reads/sequences which are particularly informative for classification. As such, this novel approach can avoid pre/post-processing and manual interpretation required with conventional approaches to microbiome sequence classification. We further show, as proof-of-concept, that aggregating read-level information can robustly predict microbial community properties, host phenotype, and taxonomic classification, with performance at least comparable to conventional approaches. An implementation of the attention-based deep learning network is available at https://github.com/EESI/sequence_attention (a python package) and https://github.com/EESI/seq2att (a command line tool).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1009345DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8496832PMC
September 2021

Examining Cultural Structures and Functions in Biology.

Integr Comp Biol 2022 02;61(6):2282-2293

Friday Harbor Laboratories, University of Washington, Friday Harbor, WA 98250, USA.

Scientific culture and structure organize biological sciences in many ways. We make choices concerning the systems and questions we study. Our research then amplifies these choices into factors that influence the directions of future research by shaping our hypotheses, data analyses, interpretation, publication venues, and dissemination via other methods. But our choices are shaped by more than objective curiosity-we are influenced by cultural paradigms reinforced by societal upbringing and scientific indoctrination during training. This extends to the systems and data that we consider to be ethically obtainable or available for study, and who is considered qualified to do research, ask questions, and communicate about research. It is also influenced by the profitability of concepts like open-access-a system designed to improve equity, but which enacts gatekeeping in unintended but foreseeable ways. Creating truly integrative biology programs will require more than intentionally developing departments or institutes that allow overlapping expertise in two or more subfields of biology. Interdisciplinary work requires the expertise of large and diverse teams of scientists working together-this is impossible without an authentic commitment to addressing, not denying, racism when practiced by individuals, institutions, and cultural aspects of academic science. We have identified starting points for remedying how our field has discouraged and caused harm, but we acknowledge there is a long path forward. This path must be paved with field-wide solutions and institutional buy-in: our solutions must match the scale of the problem. Together, we can integrate-not reintegrate-the nuances of biology into our field.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/icb/icab140DOI Listing
February 2022

RRM2B Is Frequently Amplified Across Multiple Tumor Types: Implications for DNA Repair, Cellular Survival, and Cancer Therapy.

Front Genet 2021 12;12:628758. Epub 2021 Mar 12.

Cancer Prevention and Control Program, Fox Chase Cancer Center, Philadelphia, PA, United States.

plays a crucial role in DNA replication, repair and oxidative stress. While germline mutations have been implicated in mitochondrial disorders, its relevance to cancer has not been established. Here, using TCGA studies, we investigated alterations in cancer. We found that is highly amplified in multiple tumor types, particularly in -amplified tumors, and is associated with increased mRNA expression. We also observed that the chromosomal region 8q22.3-8q24, is amplified in multiple tumors, and includes , along with several other cancer-associated genes. An analysis of genes within this 8q-amplicon showed that cancers that have both -amplified along with have a distinct pattern of amplification compared to cancers that are unaltered or those that have amplifications in or only. Investigation of curated biological interactions revealed that gene products of the amplified 8q22.3-8q24 region have important roles in DNA repair, DNA damage response, oxygen sensing, and apoptosis pathways and interact functionally. Notably, -amplified cancers are characterized by mutation signatures of defective DNA repair and oxidative stress, and at least -amplified breast cancers are associated with poor clinical outcome. These data suggest alterations in RR2MB and possibly the interacting 8q-proteins could have a profound effect on regulatory pathways such as DNA repair and cellular survival, highlighting therapeutic opportunities in these cancers.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2021.628758DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8045241PMC
March 2021

Teaching Microbiome Analysis: From Design to Computation Through Inquiry.

Front Microbiol 2020 29;11:528051. Epub 2020 Oct 29.

School of Education, Drexel University, Philadelphia, PA, United States.

In this article, we present our three-class course sequence to educate students about microbiome analysis and metagenomics through experiential learning by taking them from inquiry to analysis of the microbiome: Molecular Ecology Lab, Bioinformatics, and Computational Microbiome Analysis. Students developed hypotheses, designed lab experiments, sequenced the DNA from microbiomes, learned basic python/R scripting, became proficient in at least one microbiome analysis software, and were able to analyze data generated from the microbiome experiments. While over 150 students (graduate and undergraduate) were impacted by the development of the series of courses, our assessment was only on undergraduate learning, where 45 students enrolled in at least one of the three courses and 4 students took all three. Students gained skills in bioinformatics through the courses, and several positive comments were received through surveys and private correspondence. Through a summative assessment, general trends show that students became more proficient in comparative genomic techniques and had positive attitudes toward their abilities to bridge biology and bioinformatics. While most students took individual or 2 of the courses, we show that pre- and post-surveys of these individual classes still showed progress toward learning objectives. It is expected that students trained will enter the workforce with skills needed to innovate in the biotechnology, health, and environmental industries. Students are trained to maximize impact and tackle real world problems in biology and medicine with their learned knowledge of data science and machine learning. The course materials for the new microbiome analysis course are available on Github: https://github.com/EESI/Comp_Metagenomics_resources.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fmicb.2020.528051DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7658192PMC
October 2020

Healthcare Shift Workers' Temporal Habits for Eating, Sleeping, and Light Exposure: A Multi-Instrument Pilot Study.

J Circadian Rhythms 2020 Oct 21;18. Epub 2020 Oct 21.

School of Nursing, University at Buffalo, Buffalo, NY, US.

Background: Circadian misalignment can impair healthcare shift workers' physical and mental health, resulting in sleep deprivation, obesity, and chronic disease. This multidisciplinary research team assessed eating patterns and sleep/physical activity of healthcare workers on three different shifts (day, night, and rotating-shift). To date, no study of real-world shift workers' daily eating and sleep has utilized a largely-objective measurement.

Method: During this fourteen-day observational study, participants wore two devices (Actiwatch and Bite Technologies counter) to measure physical activity, sleep, light exposure, and eating time. Participants also reported food intake via food diaries on personal mobile devices.

Results: In fourteen (5 day-, 5 night-, and 4 rotating-shift) participants, no baseline difference in BMI was observed. Overall, rotating-shift workers consumed fewer calories and had less activity and sleep than day- and night-shift workers. For eating patterns, compared to night- and rotating-shift, day-shift workers ate more frequently during work days. Night workers, however, consumed more calories at work relative to day and rotating workers. For physical activity and sleep, night-shift workers had the highest activity and least sleep on work days.

Conclusion: This pilot study utilized primarily objective measurement to examine shift workers' habits outside the laboratory. Although no association between BMI and eating patterns/activity/sleep was observed across groups, a small, homogeneous sample may have influenced this. Overall, shift work was associated with 1) increased calorie intake and higher-fat and -carbohydrate diets and 2) sleep deprivation. A larger, more diverse sample can participate in future studies that objectively measure shift workers' real-world habits.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.5334/jcr.199DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7583716PMC
October 2020

Amino Acid -mer Feature Extraction for Quantitative Antimicrobial Resistance (AMR) Prediction by Machine Learning and Model Interpretation for Biological Insights.

Biology (Basel) 2020 Oct 28;9(11). Epub 2020 Oct 28.

Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA 19104, USA.

Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide -mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid -mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide -mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/biology9110365DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7694136PMC
October 2020

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life.

BMC Bioinformatics 2020 Sep 21;21(1):412. Epub 2020 Sep 21.

Ecological and Evolutionary Signal-process and Informatics (EESI) Lab, Department of Electrical and Computer Engineering, Drexel University, Market Street, Philadelphia, US.

Background: It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of "incremental learning" addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data.

Results: We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model's knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4 of the non-incremental time with no accuracy loss.

Conclusions: It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-020-03744-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7507296PMC
September 2020

Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization.

PLoS Comput Biol 2020 09 17;16(9):e1008269. Epub 2020 Sep 17.

Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA, USA.

We propose an efficient framework for genetic subtyping of SARS-CoV-2, the novel coronavirus that causes the COVID-19 pandemic. Efficient viral subtyping enables visualization and modeling of the geographic distribution and temporal dynamics of disease spread. Subtyping thereby advances the development of effective containment strategies and, potentially, therapeutic and vaccine strategies. However, identifying viral subtypes in real-time is challenging: SARS-CoV-2 is a novel virus, and the pandemic is rapidly expanding. Viral subtypes may be difficult to detect due to rapid evolution; founder effects are more significant than selection pressure; and the clustering threshold for subtyping is not standardized. We propose to identify mutational signatures of available SARS-CoV-2 sequences using a population-based approach: an entropy measure followed by frequency analysis. These signatures, Informative Subtype Markers (ISMs), define a compact set of nucleotide sites that characterize the most variable (and thus most informative) positions in the viral genomes sequenced from different individuals. Through ISM compression, we find that certain distant nucleotide variants covary, including non-coding and ORF1ab sites covarying with the D614G spike protein mutation which has become increasingly prevalent as the pandemic has spread. ISMs are also useful for downstream analyses, such as spatiotemporal visualization of viral dynamics. By analyzing sequence data available in the GISAID database, we validate the utility of ISM-based subtyping by comparing spatiotemporal analyses using ISMs to epidemiological studies of viral transmission in Asia, Europe, and the United States. In addition, we show the relationship of ISMs to phylogenetic reconstructions of SARS-CoV-2 evolution, and therefore, ISMs can play an important complementary role to phylogenetic tree-based analysis, such as is done in the Nextstrain project. The developed pipeline dynamically generates ISMs for newly added SARS-CoV-2 sequences and updates the visualization of pandemic spatiotemporal dynamics, and is available on Github at https://github.com/EESI/ISM (Jupyter notebook), https://github.com/EESI/ncov_ism (command line tool) and via an interactive website at https://covid19-ism.coe.drexel.edu/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1008269DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7523987PMC
September 2020

Emerging Priorities for Microbiome Research.

Front Microbiol 2020 19;11:136. Epub 2020 Feb 19.

School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, United States.

Microbiome research has increased dramatically in recent years, driven by advances in technology and significant reductions in the cost of analysis. Such research has unlocked a wealth of data, which has yielded tremendous insight into the nature of the microbial communities, including their interactions and effects, both within a host and in an external environment as part of an ecological community. Understanding the role of microbiota, including their dynamic interactions with their hosts and other microbes, can enable the engineering of new diagnostic techniques and interventional strategies that can be used in a diverse spectrum of fields, spanning from ecology and agriculture to medicine and from forensics to exobiology. From June 19-23 in 2017, the NIH and NSF jointly held an Innovation Lab on . This review is inspired by some of the topics that arose as priority areas from this unique, interactive workshop. The goal of this review is to summarize the Innovation Lab's findings by introducing the reader to emerging challenges, exciting potential, and current directions in microbiome research. The review is broken into five key topic areas: (1) interactions between microbes and the human body, (2) evolution and ecology of microbes, including the role played by the environment and microbe-microbe interactions, (3) analytical and mathematical methods currently used in microbiome research, (4) leveraging knowledge of microbial composition and interactions to develop engineering solutions, and (5) interventional approaches and engineered microbiota that may be enabled by selectively altering microbial composition. As such, this review seeks to arm the reader with a broad understanding of the priorities and challenges in microbiome research today and provide inspiration for future investigation and multi-disciplinary collaboration.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fmicb.2020.00136DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7042322PMC
February 2020

Exploring thematic structure and predicted functionality of 16S rRNA amplicon data.

PLoS One 2019 11;14(12):e0219235. Epub 2019 Dec 11.

Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America.

Analysis of microbiome data involves identifying co-occurring groups of taxa associated with sample features of interest (e.g., disease state). Elucidating such relations is often difficult as microbiome data are compositional, sparse, and have high dimensionality. Also, the configuration of co-occurring taxa may represent overlapping subcommunities that contribute to sample characteristics such as host status. Preserving the configuration of co-occurring microbes rather than detecting specific indicator species is more likely to facilitate biologically meaningful interpretations. Additionally, analyses that use taxonomic relative abundances to predict the abundances of different gene functions aggregate predicted functional profiles across taxa. This precludes straightforward identification of predicted functional components associated with subsets of co-occurring taxa. We provide an approach to explore co-occurring taxa using "topics" generated via a topic model and link these topics to specific sample features (e.g., disease state). Rather than inferring predicted functional content based on overall taxonomic relative abundances, we instead focus on inference of functional content within topics, which we parse by estimating interactions between topics and pathways through a multilevel, fully Bayesian regression model. We apply our methods to three publicly available 16S amplicon sequencing datasets: an inflammatory bowel disease dataset, an oral cancer dataset, and a time-series dataset. Using our topic model approach to uncover latent structure in 16S rRNA amplicon surveys, investigators can (1) capture groups of co-occurring taxa termed topics; (2) uncover within-topic functional potential; (3) link taxa co-occurrence, gene function, and environmental/host features; and (4) explore the way in which sets of co-occurring taxa behave and evolve over time. These methods have been implemented in a freely available R package: https://cran.r-project.org/package=themetagenomics, https://github.com/EESI/themetagenomics.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0219235PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6905537PMC
March 2020

Correction to: Comprehensive benchmarking and ensemble approaches for metagenomic classifiers.

Genome Biol 2019 04 5;20(1):72. Epub 2019 Apr 5.

Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, 10021, USA.

Following publication of the original article [1], the authors would like to highlight the following two corrections.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-019-1687-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6450011PMC
April 2019

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.

PLoS Comput Biol 2019 02 26;15(2):e1006721. Epub 2019 Feb 26.

Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America.

Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding ("embedding") each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1006721DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6407789PMC
February 2019

Opportunities and obstacles for deep learning in biology and medicine.

J R Soc Interface 2018 04;15(141)

Division of Biomedical Informatics and Personalized Medicine, University of Colorado School of Medicine, Aurora, CO, USA.

Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1098/rsif.2017.0387DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5938574PMC
April 2018

Extensions to Online Feature Selection Using Bagging and Boosting.

IEEE Trans Neural Netw Learn Syst 2018 09 11;29(9):4504-4509. Epub 2017 Oct 11.

Feature subset selection can be used to sieve through large volumes of data and discover the most informative subset of variables for a particular learning problem. Yet, due to memory and other resource constraints (e.g., CPU availability), many of the state-of-the-art feature subset selection methods cannot be extended to high dimensional data, or data sets with an extremely large volume of instances. In this brief, we extend online feature selection (OFS), a recently introduced approach that uses partial feature information, by developing an ensemble of online linear models to make predictions. The OFS approach employs a linear model as the base classifier, which allows the $l_{0}$ -norm of the parameter vector to be constrained to perform feature selection leading to sparse linear models. We demonstrate that the proposed ensemble model typically yields a smaller error rate than any single linear model, while maintaining the same level of sparsity and complexity at the time of testing.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2017.2746107DOI Listing
September 2018

Metagenomic characterization of ambulances across the USA.

Microbiome 2017 09 22;5(1):125. Epub 2017 Sep 22.

Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.

Background: Microbial communities in our built environments have great influence on human health and disease. A variety of built environments have been characterized using a metagenomics-based approach, including some healthcare settings. However, there has been no study to date that has used this approach in pre-hospital settings, such as ambulances, an important first point-of-contact between patients and hospitals.

Results: We sequenced 398 samples from 137 ambulances across the USA using shotgun sequencing. We analyzed these data to explore the microbial ecology of ambulances including characterizing microbial community composition, nosocomial pathogens, patterns of diversity, presence of functional pathways and antimicrobial resistance, and potential spatial and environmental factors that may contribute to community composition. We found that the top 10 most abundant species are either common built environment microbes, microbes associated with the human microbiome (e.g., skin), or are species associated with nosocomial infections. We also found widespread evidence of antimicrobial resistance markers (hits ~ 90% samples). We identified six factors that may influence the microbial ecology of ambulances including ambulance surfaces, geographical-related factors (including region, longitude, and latitude), and weather-related factors (including temperature and precipitation).

Conclusions: While the vast majority of microbial species classified were beneficial, we also found widespread evidence of species associated with nosocomial infections and antimicrobial resistance markers. This study indicates that metagenomics may be useful to characterize the microbial ecology of pre-hospital ambulance settings and that more rigorous testing and cleaning of ambulances may be warranted.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s40168-017-0339-6DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5610413PMC
September 2017

Comprehensive benchmarking and ensemble approaches for metagenomic classifiers.

Genome Biol 2017 09 21;18(1):182. Epub 2017 Sep 21.

Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, 10021, USA.

Background: One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited.

Results: In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages.

Conclusions: This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-017-1299-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5609029PMC
September 2017

Nanopore sequencing in microgravity.

NPJ Microgravity 2016 20;2:16035. Epub 2016 Oct 20.

Department of Physiology and Biophysics, Weill Cornell Medical College, New York, NY, USA.

Rapid DNA sequencing and analysis has been a long-sought goal in remote research and point-of-care medicine. In microgravity, DNA sequencing can facilitate novel astrobiological research and close monitoring of crew health, but spaceflight places stringent restrictions on the mass and volume of instruments, crew operation time, and instrument functionality. The recent emergence of portable, nanopore-based tools with streamlined sample preparation protocols finally enables DNA sequencing on missions in microgravity. As a first step toward sequencing in space and aboard the International Space Station (ISS), we tested the Oxford Nanopore Technologies MinION during a parabolic flight to understand the effects of variable gravity on the instrument and data. In a successful proof-of-principle experiment, we found that the instrument generated DNA reads over the course of the flight, including the first ever sequenced in microgravity, and additional reads measured after the flight concluded its parabolas. Here we detail modifications to the sample-loading procedures to facilitate nanopore sequencing aboard the ISS and in other microgravity environments. We also evaluate existing analysis methods and outline two new approaches, the first based on a wave-fingerprint method and the second on entropy signal mapping. Computationally light analysis methods offer the potential for species identification, but are limited by the error profiles (stays, skips, and mismatches) of older nanopore data. Higher accuracies attainable with modified sample processing methods and the latest version of flow cells will further enable the use of nanopore sequencers for diagnostics and research in space.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/npjmgrav.2016.35DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5515536PMC
October 2016

A Sequential Learning Approach for Scaling Up Filter-Based Feature Subset Selection.

IEEE Trans Neural Netw Learn Syst 2018 06 11;29(6):2530-2544. Epub 2017 May 11.

Increasingly, many machine learning applications are now associated with very large data sets whose sizes were almost unimaginable just a short time ago. As a result, many of the current algorithms cannot handle, or do not scale to, today's extremely large volumes of data. Fortunately, not all features that make up a typical data set carry information that is relevant or useful for prediction, and identifying and removing such irrelevant features can significantly reduce the total data size. The unfortunate dilemma, however, is that some of the current data sets are so large that common feature selection algorithms-whose very goal is to reduce the dimensionality-cannot handle such large data sets, creating a vicious cycle. We describe a sequential learning framework for feature subset selection (SLSS) that can scale with both the number of features and the number of observations. The proposed framework uses multiarm bandit algorithms to sequentially search a subset of variables, and assign a level of importance for each feature. The novel contribution of SLSS is its ability to naturally scale to large data sets, evaluate such data in a very small amount of time, and be performed independently of the optimization of any classifier to reduce unnecessary complexity. We demonstrate the capabilities of SLSS on synthetic and real-world data sets.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2017.2697407DOI Listing
June 2018

Marker genes that are less conserved in their sequences are useful for predicting genome-wide similarity levels between closely related prokaryotic strains.

Microbiome 2016 05 3;4(1):18. Epub 2016 May 3.

Rachel & Menachem Mendelovitch Evolutionary Processes of Mutation & Natural Selection Research Laboratory, Department of Genetics and Developmental Biology, the Ruth and Bruce Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, 31096, Haifa, Israel.

Background: The 16s rRNA gene is so far the most widely used marker for taxonomical classification and separation of prokaryotes. Since it is universally conserved among prokaryotes, it is possible to use this gene to classify a broad range of prokaryotic organisms. At the same time, it has often been noted that the 16s rRNA gene is too conserved to separate between prokaryotes at finer taxonomic levels.

Results: In this paper, we examine how well levels of similarity of 16s rRNA and 73 additional universal or nearly universal marker genes correlate with genome-wide levels of gene sequence similarity. We demonstrate that the percent identity of 16s rRNA predicts genome-wide levels of similarity very well for distantly related prokaryotes, but not for closely related ones. In closely related prokaryotes, we find that there are many other marker genes for which levels of similarity are much more predictive of genome-wide levels of gene sequence similarity. Finally, we show that the identities of the markers that are most useful for predicting genome-wide levels of similarity within closely related prokaryotic lineages vary greatly between lineages. However, the most useful markers are always those that are least conserved in their sequences within each lineage.

Conclusions: Our results show that by choosing markers that are less conserved in their sequences within a lineage of interest, it is possible to better predict genome-wide gene sequence similarity between closely related prokaryotes than is possible using the 16s rRNA gene. We point readers towards a database we have created (POGO-DB) that can be used to easily establish which markers show lowest levels of sequence conservation within different prokaryotic lineages.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s40168-016-0162-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4853863PMC
May 2016

Fizzy: feature subset selection for metagenomics.

BMC Bioinformatics 2015 Nov 4;16:358. Epub 2015 Nov 4.

Department of Electrical & Computer Engineering, Drexel University, 3141 Chestnut St., Philadelphia, 19104, PA, USA.

Background: Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α- & β-diversity. Feature subset selection--a sub-field of machine learning--can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate between age groups in the human gut microbiome.

Results: We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets.

Conclusions: We have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-015-0793-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4634798PMC
November 2015

Multi-Layer and Recursive Neural Networks for Metagenomic Classification.

IEEE Trans Nanobioscience 2015 Sep 24;14(6):608-16. Epub 2015 Aug 24.

Recent advances in machine learning, specifically in deep learning with neural networks, has made a profound impact on fields such as natural language processing, image classification, and language modeling; however, feasibility and potential benefits of the approaches to metagenomic data analysis has been largely under-explored. Deep learning exploits many layers of learning nonlinear feature representations, typically in an unsupervised fashion, and recent results have shown outstanding generalization performance on previously unseen data. Furthermore, some deep learning methods can also represent the structure in a data set. Consequently, deep learning and neural networks may prove to be an appropriate approach for metagenomic data. To determine whether such approaches are indeed appropriate for metagenomics, we experiment with two deep learning methods: i) a deep belief network, and ii) a recursive neural network, the latter of which provides a tree representing the structure of the data. We compare these approaches to the standard multi-layer perceptron, which has been well-established in the machine learning community as a powerful prediction algorithm, though its presence is largely missing in metagenomics literature. We find that traditional neural networks can be quite powerful classifiers on metagenomic data compared to baseline methods, such as random forests. On the other hand, while the deep learning approaches did not result in improvements to the classification accuracy, they do provide the ability to learn hierarchical representations of a data set that standard classification methods do not allow. Our goal in this effort is not to determine the best algorithm in terms accuracy-as that depends on the specific application-but rather to highlight the benefits and drawbacks of each of the approach we discuss and provide insight on how they can be improved for predictive metagenomic analysis.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNB.2015.2461219DOI Listing
September 2015

Prokaryotic nucleotide composition is shaped by both phylogeny and the environment.

Genome Biol Evol 2015 Apr 9;7(5):1380-9. Epub 2015 Apr 9.

Rachel and Menachem Mendelovitch Evolutionary Processes of Mutation and Natural Selection Research Laboratory, Department of Genetics and Developmental Biology, The Ruth and Bruce Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa, Israel

The causes of the great variation in nucleotide composition of prokaryotic genomes have long been disputed. Here, we use extensive metagenomic and whole-genome data to demonstrate that both phylogeny and the environment shape prokaryotic nucleotide content. We show that across environments, various phyla are characterized by different mean guanine and cytosine (GC) values as well as by the extent of variation on that mean value. At the same time, we show that GC-content varies greatly as a function of environment, in a manner that cannot be entirely explained by disparities in phylogenetic composition. We find environmentally driven differences in nucleotide content not only between highly diverged environments (e.g., soil, vs. aquatic vs. human gut) but also within a single type of environment. More specifically, we demonstrate that some human guts are associated with a microbiome that is consistently more GC-rich across phyla, whereas others are associated with a more AT-rich microbiome. These differences appear to be driven both by variations in phylogenetic composition and by environmental differences-which are independent of these phylogenetic composition differences. Combined, our results demonstrate that both phylogeny and the environment significantly affect nucleotide composition and that the environmental differences affecting nucleotide composition are far subtler than previously appreciated.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gbe/evv063DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4453058PMC
April 2015

A bootstrap based Neyman-Pearson test for identifying variable importance.

IEEE Trans Neural Netw Learn Syst 2015 Apr;26(4):880-6

Selection of most informative features that leads to a small loss on future data are arguably one of the most important steps in classification, data analysis and model selection. Several feature selection (FS) algorithms are available; however, due to noise present in any data set, FS algorithms are typically accompanied by an appropriate cross-validation scheme. In this brief, we propose a statistical hypothesis test derived from the Neyman-Pearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any FS algorithm, regardless of the FS criteria used by that algorithm, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point. We provide freely available software implementations of the proposed methodology.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2014.2320415DOI Listing
April 2015

Adenovirus and herpesvirus diversity in free-ranging great apes in the Sangha region of the Republic Of Congo.

PLoS One 2015 17;10(3):e0118543. Epub 2015 Mar 17.

Center for Infection and Immunity, Columbia University, New York, New York, United States of America.

Infectious diseases have caused die-offs in both free-ranging gorillas and chimpanzees. Understanding pathogen diversity and disease ecology is therefore critical for conserving these endangered animals. To determine viral diversity in free-ranging, non-habituated gorillas and chimpanzees in the Republic of Congo, genetic testing was performed on great-ape fecal samples collected near Odzala-Kokoua National Park. Samples were analyzed to determine ape species, identify individuals in the population, and to test for the presence of herpesviruses, adenoviruses, poxviruses, bocaviruses, flaviviruses, paramyxoviruses, coronaviruses, filoviruses, and simian immunodeficiency virus (SIV). We identified 19 DNA viruses representing two viral families, Herpesviridae and Adenoviridae, of which three herpesviruses had not been previously described. Co-detections of multiple herpesviruses and/or adenoviruses were present in both gorillas and chimpanzees. Cytomegalovirus (CMV) and lymphocryptovirus (LCV) were found primarily in the context of co-association with each other and adenoviruses. Using viral discovery curves for herpesviruses and adenoviruses, the total viral richness in the sample population of gorillas and chimpanzees was estimated to be a minimum of 23 viruses, corresponding to a detection rate of 83%. These findings represent the first description of DNA viral diversity in feces from free-ranging gorillas and chimpanzees in or near the Odzala-Kokoua National Park and form a basis for understanding the types of viruses circulating among great apes in this region.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118543PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4362762PMC
February 2016

A toolkit for ARB to integrate custom databases and externally built phylogenies.

PLoS One 2015 21;10(1):e0109277. Epub 2015 Jan 21.

Department of Electrical & Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America.

Unlabelled: Researchers are perpetually amassing biological sequence data. The computational approaches employed by ecologists for organizing this data (e.g. alignment, phylogeny, etc.) typically scale nonlinearly in execution time with the size of the dataset. This often serves as a bottleneck for processing experimental data since many molecular studies are characterized by massive datasets. To keep up with experimental data demands, ecologists are forced to choose between continually upgrading expensive in-house computer hardware or outsourcing the most demanding computations to the cloud. Outsourcing is attractive since it is the least expensive option, but does not necessarily allow direct user interaction with the data for exploratory analysis. Desktop analytical tools such as ARB are indispensable for this purpose, but they do not necessarily offer a convenient solution for the coordination and integration of datasets between local and outsourced destinations. Therefore, researchers are currently left with an undesirable tradeoff between computational throughput and analytical capability. To mitigate this tradeoff we introduce a software package to leverage the utility of the interactive exploratory tools offered by ARB with the computational throughput of cloud-based resources. Our pipeline serves as middleware between the desktop and the cloud allowing researchers to form local custom databases containing sequences and metadata from multiple resources and a method for linking data outsourced for computation back to the local database. A tutorial implementation of the toolkit is provided in the supporting information, S1 Tutorial.

Availability: http://www.ece.drexel.edu/gailr/EESI/tutorial.php.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0109277PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4301908PMC
December 2015
-->