Publications by authors named "Anton Nekrutenko"

76 Publications

Fostering accessible online education using Galaxy as an e-learning platform.

PLoS Comput Biol 2021 May 13;17(5):e1008923. Epub 2021 May 13.

Bioinformatics Group, Department of Computer Science, Albert-Ludwigs-University Freiburg, Freiburg, Germany.

The COVID-19 pandemic is shifting teaching to an online setting all over the world. The Galaxy framework facilitates the online learning process and makes it accessible by providing a library of high-quality community-curated training materials, enabling easy access to data and tools, and facilitates sharing achievements and progress between students and instructors. By combining Galaxy with robust communication channels, effective instruction can be designed inclusively, regardless of the students' environments.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1008923DOI Listing
May 2021

Sequencing error profiles of Illumina sequencing instruments.

NAR Genom Bioinform 2021 Mar 27;3(1):lqab019. Epub 2021 Mar 27.

Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA.

Sequencing technology has achieved great advances in the past decade. Studies have previously shown the quality of specific instruments in controlled conditions. Here, we developed a method able to retroactively determine the error rate of most public sequencing datasets. To do this, we utilized the overlaps between reads that are a feature of many sequencing libraries. With this method, we surveyed 1943 different datasets from seven different sequencing instruments produced by Illumina. We show that among public datasets, the more expensive platforms like HiSeq and NovaSeq have a lower error rate and less variation. But we also discovered that there is great variation within each platform, with the accuracy of a sequencing experiment depending greatly on the experimenter. We show the importance of sequence context, especially the phenomenon where preceding bases bias the following bases toward the same identity. We also show the difference in patterns of sequence bias between instruments. Contrary to expectations based on the underlying chemistry, HiSeq X Ten and NovaSeq 6000 share notable exceptions to the preceding-base bias. Our results demonstrate the importance of the specific circumstances of every sequencing experiment, and the importance of evaluating the quality of each one.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nargab/lqab019DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8002175PMC
March 2021

Freely accessible ready to use global infrastructure for SARS-CoV-2 monitoring.

bioRxiv 2021 Mar 25. Epub 2021 Mar 25.

The COVID-19 pandemic is the first global health crisis to occur in the age of big genomic data.Although data generation capacity is well established and sufficiently standardized, analytical capacity is not. To establish analytical capacity it is necessary to pull together global computational resources and deliver the best open source tools and analysis workflows within a ready to use, universally accessible resource. Such a resource should not be controlled by a single research group, institution, or country. Instead it should be maintained by a community of users and developers who ensure that the system remains operational and populated with current tools. A community is also essential for facilitating the types of discourse needed to establish best analytical practices. Bringing together public computational research infrastructure from the USA, Europe, and Australia, we developed a distributed data analysis platform that accomplishes these goals. It is immediately accessible to anyone in the world and is designed for the analysis of rapidly growing collections of deep sequencing datasets. We demonstrate its utility by detecting allelic variants in high-quality existing SARS-CoV-2 sequencing datasets and by continuous reanalysis of COG-UK data. All workflows, data, and documentation is available at https://covid19.galaxyproject.org .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/2021.03.25.437046DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8010728PMC
March 2021

Erratum: Increased yields of duplex sequencing data by a series of quality control tools.

NAR Genom Bioinform 2021 Mar 1;3(1):lqab014. Epub 2021 Mar 1.

Institute of Biophysics, Johannes Kepler University, 4020 Linz, Austria.

[This corrects the article DOI: 10.1093/nargab/lqab002.].
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nargab/lqab014DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7936659PMC
March 2021

Using Galaxy to Perform Large-Scale Interactive Data Analyses-An Update.

Curr Protoc 2021 Feb;1(2):e31

Penn State University, University Park, Pennsylvania.

Modern biology continues to become increasingly computational. Datasets are becoming progressively larger, more complex, and more abundant. The computational savviness necessary to analyze these data creates an ongoing obstacle for experimental biologists. Galaxy (galaxyproject.org) provides access to computational biology tools in a web-based interface. It also provides access to major public biological data repositories, allowing private data to be combined with public datasets. Galaxy is hosted on high-capacity servers worldwide and is accessible for free, with an option to be installed locally. This article demonstrates how to employ Galaxy to perform biologically relevant analyses on publicly available datasets. These protocols use both standard and custom tools, serving as a tutorial and jumping-off point for more intensive and/or more specific analyses using Galaxy. © 2021 Wiley Periodicals LLC. Basic Protocol 1: Finding human coding exons with highest SNP density Basic Protocol 2: Calling peaks for ChIP-seq data Basic Protocol 3: Compare datasets using genomic coordinates Basic Protocol 4: Working with multiple alignments Basic Protocol 5: Single cell RNA-seq.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/cpz1.31DOI Listing
February 2021

Increased yields of duplex sequencing data by a series of quality control tools.

NAR Genom Bioinform 2021 Mar 9;3(1):lqab002. Epub 2021 Feb 9.

Institute of Biophysics, Johannes Kepler University, 4020 Linz, Austria.

Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nargab/lqab002DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7872198PMC
March 2021

A single-cell RNA-sequencing training and analysis suite using the Galaxy framework.

Gigascience 2020 10;9(10)

Department of Bioinformatics, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany.

Background: The vast ecosystem of single-cell RNA-sequencing tools has until recently been plagued by an excess of diverging analysis strategies, inconsistent file formats, and compatibility issues between different software suites. The uptake of 10x Genomics datasets has begun to calm this diversity, and the bioinformatics community leans once more towards the large computing requirements and the statistically driven methods needed to process and understand these ever-growing datasets.

Results: Here we outline several Galaxy workflows and learning resources for single-cell RNA-sequencing, with the aim of providing a comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap between the computational methods and the underlying cell biology. The Galaxy reproducible bioinformatics framework provides tools, workflows, and trainings that not only enable users to perform 1-click 10x preprocessing but also empower them to demultiplex raw sequencing from custom tagged and full-length sequencing protocols. The downstream analysis supports a range of high-quality interoperable suites separated into common stages of analysis: inspection, filtering, normalization, confounder removal, and clustering. The teaching resources cover concepts from computer science to cell biology. Access to all resources is provided at the singlecell.usegalaxy.eu portal.

Conclusions: The reproducible and training-oriented Galaxy framework provides a sustainable high-performance computing environment for users to run flexible analyses on both 10x and alternative platforms. The tutorials from the Galaxy Training Network along with the frequent training workshops hosted by the Galaxy community provide a means for users to learn, publish, and teach single-cell RNA-sequencing analysis.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giaa102DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7574357PMC
October 2020

No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics.

PLoS Pathog 2020 08 13;16(8):e1008643. Epub 2020 Aug 13.

Temple University, Philadelphia, Pennsylvania, United States of America.

The current state of much of the Wuhan pneumonia virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) research shows a regrettable lack of data sharing and considerable analytical obfuscation. This impedes global research cooperation, which is essential for tackling public health emergencies and requires unimpeded access to data, analysis tools, and computational infrastructure. Here, we show that community efforts in developing open analytical software tools over the past 10 years, combined with national investments into scientific computational infrastructure, can overcome these deficiencies and provide an accessible platform for tackling global health emergencies in an open and transparent manner. Specifically, we use all SARS-CoV-2 genomic data available in the public domain so far to (1) underscore the importance of access to raw data and (2) demonstrate that existing community efforts in curation and deployment of biomedical software can reliably support rapid, reproducible research during global health crises. All our analyses are fully documented at https://github.com/galaxyproject/SARS-CoV-2.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.ppat.1008643DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7425854PMC
August 2020

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update.

Nucleic Acids Res 2020 07;48(W1):W395-W402

Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA, USA.

Galaxy (https://galaxyproject.org) is a web-based computational workbench used by tens of thousands of scientists across the world to analyze large biomedical datasets. Since 2005, the Galaxy project has fostered a global community focused on achieving accessible, reproducible, and collaborative research. Together, this community develops the Galaxy software framework, integrates analysis tools and visualizations into the framework, runs public servers that make Galaxy available via a web browser, performs and publishes analyses using Galaxy, leads bioinformatics workshops that introduce and use Galaxy, and develops interactive training materials for Galaxy. Over the last two years, all aspects of the Galaxy project have grown: code contributions, tools integrated, users, and training materials. Key advances in Galaxy's user interface include enhancements for analyzing large dataset collections as well as interactive tools for exploratory data analysis. Extensions to Galaxy's framework include support for federated identity and access management and increased ability to distribute analysis jobs to remote resources. New community resources include large public servers in Europe and Australia, an increasing number of regional and local Galaxy communities, and substantial growth in the Galaxy Training Network.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa434DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7319590PMC
July 2020

In memory of James Taylor: the birth of Galaxy.

Genome Biol 2020 04 30;21(1):105. Epub 2020 Apr 30.

Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-020-02016-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7193333PMC
April 2020

Family reunion via error correction: an efficient analysis of duplex sequencing data.

BMC Bioinformatics 2020 Mar 4;21(1):96. Epub 2020 Mar 4.

Graduate Program in Bioinformatics and Genomics, The Huck Institutes for Life Sciences, The Pennsylvania State University, University Park, PA, USA.

Background: Duplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost-sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are thrown away.

Results: In this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows "reuniting" these reads with their respective families increasing the output of the method and making it more cost effective.

Conclusions: We combine an error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0. It is written in Python, C, AWK, and Bash. It is open source and readily available through Galaxy, Bioconda, and Github: https://github.com/galaxyproject/dunovo.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-020-3419-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7057607PMC
March 2020

Bottleneck and selection in the germline and maternal age influence transmission of mitochondrial DNA in human pedigrees.

Proc Natl Acad Sci U S A 2019 12 22;116(50):25172-25178. Epub 2019 Nov 22.

Department of Biology, Penn State University, University Park, PA 16802;

Heteroplasmy-the presence of multiple mitochondrial DNA (mtDNA) haplotypes in an individual-can lead to numerous mitochondrial diseases. The presentation of such diseases depends on the frequency of the heteroplasmic variant in tissues, which, in turn, depends on the dynamics of mtDNA transmissions during germline and somatic development. Thus, understanding and predicting these dynamics between generations and within individuals is medically relevant. Here, we study patterns of heteroplasmy in 2 tissues from each of 345 humans in 96 multigenerational families, each with, at least, 2 siblings (a total of 249 mother-child transmissions). This experimental design has allowed us to estimate the timing of mtDNA mutations, drift, and selection with unprecedented precision. Our results are remarkably concordant between 2 complementary population-genetic approaches. We find evidence for a severe germline bottleneck (7-10 mtDNA segregating units) that occurs independently in different oocyte lineages from the same mother, while somatic bottlenecks are less severe. We demonstrate that divergence between mother and offspring increases with the mother's age at childbirth, likely due to continued drift of heteroplasmy frequencies in oocytes under meiotic arrest. We show that this period is also accompanied by mutation accumulation leading to more de novo mutations in children born to older mothers. We show that heteroplasmic variants at intermediate frequencies can segregate for many generations in the human population, despite the strong germline bottleneck. We show that selection acts during germline development to keep the frequency of putatively deleterious variants from rising. Our findings have important applications for clinical genetics and genetic counseling.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.1906331116DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6911200PMC
December 2019

A High-Resolution View of Adaptive Event Dynamics in a Plasmid.

Genome Biol Evol 2019 10;11(10):3022-3034

Department of Biochemistry and Molecular Biology, The Pennsylvania State University.

Coadaptation between bacterial hosts and plasmids frequently results in adaptive changes restricted exclusively to host genome leaving plasmids unchanged. To better understand this remarkable stability, we transformed naïve Escherichia coli cells with a plasmid carrying an antibiotic-resistance gene and forced them to adapt in a turbidostat environment. We then drew population samples at regular intervals and subjected them to duplex sequencing-a technique specifically designed for identification of low-frequency mutations. Variants at ten sites implicated in plasmid copy number control emerged almost immediately, tracked consistently across the experiment's time points, and faded below detectable frequencies toward the end. This variation crash coincided with the emergence of mutations on the host chromosome. Mathematical modeling of trajectories for adaptive changes affecting plasmid copy number showed that such mutations cannot readily fix or even reach appreciable frequencies. We conclude that there is a strong selection against alterations of copy number even if it can provide a degree of growth advantage. This incentive is likely rooted in the complex interplay between mutated and wild-type plasmids constrained within a single cell and underscores the importance of understanding of intracellular plasmid variability.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gbe/evz197DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6827461PMC
October 2019

HyPhy 2.5-A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies.

Mol Biol Evol 2020 Jan;37(1):295-299

Department of Statistics and Bioinformatics Research Center, North Carolina State University, Raleigh, NC.

HYpothesis testing using PHYlogenies (HyPhy) is a scriptable, open-source package for fitting a broad range of evolutionary models to multiple sequence alignments, and for conducting subsequent parameter estimation and hypothesis testing, primarily in the maximum likelihood statistical framework. It has become a popular choice for characterizing various aspects of the evolutionary process: natural selection, evolutionary rates, recombination, and coevolution. The 2.5 release (available from www.hyphy.org) includes a completely re-engineered computational core and analysis library that introduces new classes of evolutionary models and statistical tests, delivers substantial performance and stability enhancements, improves usability, streamlines end-to-end analysis workflows, makes it easier to develop custom analyses, and is mostly backward compatible with previous HyPhy releases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/msz197DOI Listing
January 2020

Predicting runtimes of bioinformatics tools based on historical data: five years of Galaxy usage.

Bioinformatics 2019 09;35(18):3453-3460

Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, USA.

Motivation: One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation.

Results: Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation.

Availability And Implementation: Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz054DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6931352PMC
September 2019

Child Weight Gain Trajectories Linked To Oral Microbiota Composition.

Sci Rep 2018 09 19;8(1):14030. Epub 2018 Sep 19.

Center for Medical Genomics, Penn State University, University Park, PA, 16802, USA.

Gut and oral microbiota perturbations have been observed in obese adults and adolescents; less is known about their influence on weight gain in young children. Here we analyzed the gut and oral microbiota of 226 two-year-olds with 16S rRNA gene sequencing. Weight and length were measured at seven time points and used to identify children with rapid infant weight gain (a strong risk factor for childhood obesity), and to derive growth curves with innovative Functional Data Analysis (FDA) techniques. We showed that growth curves were associated negatively with diversity, and positively with the Firmicutes-to-Bacteroidetes ratio, of the oral microbiota. We also demonstrated an association between the gut microbiota and child growth, even after controlling for the effect of diet on the microbiota. Lastly, we identified several bacterial genera that were associated with child growth patterns. These results suggest that by the age of two, the oral microbiota of children with rapid infant weight gain may have already begun to establish patterns often seen in obese adults. They also suggest that the gut microbiota at age two, while strongly influenced by diet, does not harbor obesity signatures many researchers identified in later life stages.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-018-31866-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6145887PMC
September 2018

Community-Driven Data Analysis Training for Biology.

Cell Syst 2018 06;6(6):752-758.e1

Department of Systems Biology and Bioinformatics, University of Rostock, Ulmenstraße 69, Rostock 18051, Germany.

The primary problem with the explosion of biomedical datasets is not the data, not computational resources, and not the required storage space, but the general lack of trained and skilled researchers to manipulate and analyze these data. Eliminating this problem requires development of comprehensive educational resources. Here we present a community-driven framework that enables modern, interactive teaching of data analytics in life sciences and facilitates the development of training materials. The key feature of our system is that it is not a static but a continuously improved collection of tutorials. By coupling tutorials with a web-based analysis framework, biomedical researchers can learn by performing computation themselves through a web browser without the need to install software or search for example datasets. Our ultimate goal is to expand the breadth of training materials to include fundamental statistical and data science topics and to precipitate a complete re-engineering of undergraduate and graduate curricula in life sciences. This project is accessible at https://training.galaxyproject.org.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.cels.2018.05.012DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6296361PMC
June 2018

Practical Computational Reproducibility in the Life Sciences.

Cell Syst 2018 06;6(6):631-635

Johns Hopkins University, Baltimore, MD, USA. Electronic address:

Many areas of research suffer from poor reproducibility, particularly in computationally intensive domains where results rely on a series of complex methodological decisions that are not well captured by traditional publication approaches. Various guidelines have emerged for achieving reproducibility, but implementation of these practices remains difficult due to the challenge of assembling software tools plus associated libraries, connecting tools together into pipelines, and specifying parameters. Here, we discuss a suite of cutting-edge technologies that make computational reproducibility not just possible, but practical in both time and effort. This suite combines three well-tested components-a system for building highly portable packages of bioinformatics software, containerization and virtualization technologies for isolating reusable execution environments for these packages, and workflow systems that automatically orchestrate the composition of these packages for entire pipelines-to achieve an unprecedented level of computational reproducibility. We also provide a practical implementation and five recommendations to help set a typical researcher on the path to performing data analyses reproducibly.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.cels.2018.03.014DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6263957PMC
June 2018

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.

Nucleic Acids Res 2018 07;46(W1):W537-W544

Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA.

Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky379DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6030816PMC
July 2018

Biology Needs Evolutionary Software Tools: Let's Build Them Right.

Mol Biol Evol 2018 06;35(6):1372-1375

Lerner Research Institute, Cleveland Clinic, Cleveland, OH.

Research in population genetics and evolutionary biology has always provided a computational backbone for life sciences as a whole. Today evolutionary and population biology reasoning are essential for interpretation of large complex datasets that are characteristic of all domains of today's life sciences ranging from cancer biology to microbial ecology. This situation makes algorithms and software tools developed by our community more important than ever before. This means that we, developers of software tool for molecular evolutionary analyses, now have a shared responsibility to make these tools accessible using modern technological developments as well as provide adequate documentation and training.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/msy084DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5967460PMC
June 2018

Jupyter and Galaxy: Easing entry barriers into complex data analyses for biomedical researchers.

PLoS Comput Biol 2017 05 25;13(5):e1005425. Epub 2017 May 25.

Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America.

What does it take to convert a heap of sequencing data into a publishable result? First, common tools are employed to reduce primary data (sequencing reads) to a form suitable for further analyses (i.e., the list of variable sites). The subsequent exploratory stage is much more ad hoc and requires the development of custom scripts and pipelines, making it problematic for biomedical researchers. Here, we describe a hybrid platform combining common analysis pathways with the ability to explore data interactively. It aims to fully encompass and simplify the "raw data-to-publication" pathway and make it reproducible.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1005425DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5444614PMC
May 2017

Streamlined analysis of duplex sequencing data with Du Novo.

Genome Biol 2016 08 26;17(1):180. Epub 2016 Aug 26.

Graduate Program in Bioinformatics and Genomics, The Huck Institutes for the Life Sciences, Penn State University, 505 Wartik Lab, University Park, PA, 16802, USA.

Duplex sequencing was originally developed to detect rare nucleotide polymorphisms normally obscured by the noise of high-throughput sequencing. Here we describe a new, streamlined, reference-free approach for the analysis of duplex sequencing data. We show the approach performs well on simulated data and precisely reproduces previously published results and apply it to a newly produced dataset, enabling us to type low-frequency variants in human mitochondrial DNA. Finally, we provide all necessary tools as stand-alone components as well as integrate them into the Galaxy platform. All analyses performed in this manuscript can be repeated exactly as described at http://usegalaxy.org/duplex .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-016-1039-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5000403PMC
August 2016

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update.

Nucleic Acids Res 2016 07 2;44(W1):W3-W10. Epub 2016 May 2.

The Computational Biology Institute, George Washington University, Washington DC, USA

High-throughput data production technologies, particularly 'next-generation' DNA sequencing, have ushered in widespread and disruptive changes to biomedical research. Making sense of the large datasets produced by these technologies requires sophisticated statistical and computational methods, as well as substantial computational power. This has led to an acute crisis in life sciences, as researchers without informatics training attempt to perform computation-dependent analyses. Since 2005, the Galaxy project has worked to address this problem by providing a framework that makes advanced computational tools usable by non experts. Galaxy seeks to make data-intensive research more accessible, transparent and reproducible by providing a Web-based environment in which users can perform computational analyses and have all of the details automatically tracked for later inspection, publication, or reuse. In this report we highlight recently added features enabling biomedical analyses on a large scale.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw343DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4987906PMC
July 2016

Integrative genomic analysis by interoperation of bioinformatics tools in GenomeSpace.

Nat Methods 2016 Mar 18;13(3):245-247. Epub 2016 Jan 18.

The Broad Institute of MIT and Harvard, Cambridge, MA, USA.

Complex biomedical analyses require the use of multiple software tools in concert and remain challenging for much of the biomedical research community. We introduce GenomeSpace (http://www.genomespace.org), a cloud-based, cooperative community resource that currently supports the streamlined interaction of 20 bioinformatics tools and data resources. To facilitate integrative analysis by non-programmers, it offers a growing set of 'recipes', short workflows to guide investigators through high-utility analysis tasks.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nmeth.3732DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4767623PMC
March 2016

StructureFold: genome-wide RNA secondary structure mapping and reconstruction in vivo.

Bioinformatics 2015 Aug 16;31(16):2668-75. Epub 2015 Apr 16.

Department of Biology, Center for RNA Molecular Biology, Bioinformatics and Genomics Graduate Program, Plant Biology Graduate Program, Pennsylvania State University, University Park, Pennsylvania 16802, USA.

Motivation: RNAs fold into complex structures that are integral to the diverse mechanisms underlying RNA regulation of gene expression. Recent development of transcriptome-wide RNA structure profiling through the application of structure-probing enzymes or chemicals combined with high-throughput sequencing has opened a new field that greatly expands the amount of in vitro and in vivo RNA structural information available. The resultant datasets provide the opportunity to investigate RNA structural information on a global scale. However, the analysis of high-throughput RNA structure profiling data requires considerable computational effort and expertise.

Results: We present a new platform, StructureFold, that provides an integrated computational solution designed specifically for large-scale RNA structure mapping and reconstruction across any transcriptome. StructureFold automates the processing and analysis of raw high-throughput RNA structure profiling data, allowing the seamless incorporation of wet-bench structural information from chemical probes and/or ribonucleases to restrain RNA secondary structure prediction via the RNAstructure and ViennaRNA package algorithms. StructureFold performs reads mapping and alignment, normalization and reactivity derivation, and RNA structure prediction in a single user-friendly web interface or via local installation. The variation in transcript abundance and length that prevails in living cells and consequently causes variation in the counts of structure-probing events between transcripts is accounted for. Accordingly, StructureFold is applicable to RNA structural profiling data obtained in vivo as well as to in vitro or in silico datasets. StructureFold is deployed via the Galaxy platform.

Availability And Implementation: StructureFold is freely available as a component of Galaxy available at: https://usegalaxy.org/.

Contact: yxt148@psu.edu or sma3@psu.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btv213DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6280868PMC
August 2015

Online resources for genomic analysis using high-throughput sequencing.

Cold Spring Harb Protoc 2015 Feb 5;2015(4):324-35. Epub 2015 Feb 5.

Department of Biochemistry and Molecular Biology, Penn State University, University Park, Pennsylvania 16802;

The availability of high-throughput sequencing has created enormous possibilities for scientific discovery. However, the massive amount of data being generated has resulted in a severe informatics bottleneck. A large number of tools exist for analyzing next-generation sequencing (NGS) data, yet often there remains a disconnect between these research tools and the ability of many researchers to use them. As a consequence, several online resources and communities have been developed to assist researchers with both the management and the analysis of sequencing data sets. Here we describe the use and applications of common file formats for coding and storing genomic data, consider several web-accessible open-source resources for the visualization and analysis of NGS data, and provide examples of typical analyses with links to further detailed exercises.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/pdb.top083667DOI Listing
February 2015

Maternal age effect and severe germ-line bottleneck in the inheritance of human mitochondrial DNA.

Proc Natl Acad Sci U S A 2014 Oct 13;111(43):15474-9. Epub 2014 Oct 13.

Biology, and

The manifestation of mitochondrial DNA (mtDNA) diseases depends on the frequency of heteroplasmy (the presence of several alleles in an individual), yet its transmission across generations cannot be readily predicted owing to a lack of data on the size of the mtDNA bottleneck during oogenesis. For deleterious heteroplasmies, a severe bottleneck may abruptly transform a benign (low) frequency in a mother into a disease-causing (high) frequency in her child. Here we present a high-resolution study of heteroplasmy transmission conducted on blood and buccal mtDNA of 39 healthy mother-child pairs of European ancestry (a total of 156 samples, each sequenced at ∼20,000× per site). On average, each individual carried one heteroplasmy, and one in eight individuals carried a disease-associated heteroplasmy, with minor allele frequency ≥1%. We observed frequent drastic heteroplasmy frequency shifts between generations and estimated the effective size of the germ-line mtDNA bottleneck at only ∼30-35 (interquartile range from 9 to 141). Accounting for heteroplasmies, we estimated the mtDNA germ-line mutation rate at 1.3 × 10(-8) (interquartile range from 4.2 × 10(-9) to 4.1 × 10(-8)) mutations per site per year, an order of magnitude higher than for nuclear DNA. Notably, we found a positive association between the number of heteroplasmies in a child and maternal age at fertilization, likely attributable to oocyte aging. This study also took advantage of droplet digital PCR (ddPCR) to validate heteroplasmies and confirm a de novo mutation. Our results can be used to predict the transmission of disease-causing mtDNA variants and illuminate evolutionary dynamics of the mitochondrial genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.1409328111DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4217420PMC
October 2014

Dissemination of scientific software with Galaxy ToolShed.

Genome Biol 2014 Feb 20;15(2):403. Epub 2014 Feb 20.

The proliferation of web-based integrative analysis frameworks has enabled users to perform complex analyses directly through the web. Unfortunately, it also revoked the freedom to easily select the most appropriate tools. To address this, we have developed Galaxy ToolShed.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb4161DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4038738PMC
February 2014

Controlling for contamination in re-sequencing studies with a reproducible web-based phylogenetic approach.

Biotechniques 2014 1;56(3):134-141. Epub 2014 Mar 1.

Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA.

Polymorphism discovery is a routine application of next-generation sequencing technology where multiple samples are sent to a service provider for library preparation, subsequent sequencing, and bioinformatic analyses. The decreasing cost and advances in multiplexing approaches have made it possible to analyze hundreds of samples at a reasonable cost. However, because of the manual steps involved in the initial processing of samples and handling of sequencing equipment, cross-contamination remains a significant challenge. It is especially problematic in cases where polymorphism frequencies do not adhere to diploid expectation, for example, heterogeneous tumor samples, organellar genomes, as well as during bacterial and viral sequencing. In these instances, low levels of contamination may be readily mistaken for polymorphisms, leading to false results. Here we describe practical steps designed to reliably detect contamination and uncover its origin, and also provide new, Galaxy-based, readily accessible computational tools and workflows for quality control. All results described in this report can be reproduced interactively on the web as described at http://usegalaxy.org/contamination.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2144/000114146DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4377138PMC
December 2014