Publications by authors named "Daniel Blankenberg"

34 Publications

SimText: A text mining framework for interactive analysis and visualization of similarities among biomedical entities.

Bioinformatics 2021 May 25. Epub 2021 May 25.

Cologne Center for Genomics (CCG), Medical Faculty of the University of Cologne, University Hospital of Cologne, Cologne, 50931, Germany.

: Literature exploration in PubMed on a large number of biomedical entities (e.g., genes, diseases or experiments) can be time-consuming and challenging, especially when assessing associations between entities. Here, we describe SimText, a user-friendly toolset that provides customizable and systematic workflows for the analysis of similarities among a set of entities based on text. SimText can be used for (i) text collection from PubMed and extraction of words with different text mining approaches, and (ii) interactive analysis and visualization of data using unsupervised learning techniques in an interactive app.

Availability And Implementation: We developed SimText as an open-source R software and integrated it into Galaxy (https://usegalaxy.eu), an online data analysis platform with supporting self-learning training material available at https://training.galaxyproject.org. A command-line version of the toolset is available for download from GitHub (https://github.com/dlal-group/simtext) or as Docker image (https://hub.docker.com/r/dlalgroup/simtext/tags.).

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btab365DOI Listing
May 2021

Using Galaxy to Perform Large-Scale Interactive Data Analyses-An Update.

Curr Protoc 2021 Feb;1(2):e31

Penn State University, University Park, Pennsylvania.

Modern biology continues to become increasingly computational. Datasets are becoming progressively larger, more complex, and more abundant. The computational savviness necessary to analyze these data creates an ongoing obstacle for experimental biologists. Galaxy (galaxyproject.org) provides access to computational biology tools in a web-based interface. It also provides access to major public biological data repositories, allowing private data to be combined with public datasets. Galaxy is hosted on high-capacity servers worldwide and is accessible for free, with an option to be installed locally. This article demonstrates how to employ Galaxy to perform biologically relevant analyses on publicly available datasets. These protocols use both standard and custom tools, serving as a tutorial and jumping-off point for more intensive and/or more specific analyses using Galaxy. © 2021 Wiley Periodicals LLC. Basic Protocol 1: Finding human coding exons with highest SNP density Basic Protocol 2: Calling peaks for ChIP-seq data Basic Protocol 3: Compare datasets using genomic coordinates Basic Protocol 4: Working with multiple alignments Basic Protocol 5: Single cell RNA-seq.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/cpz1.31DOI Listing
February 2021

A single-cell RNA-sequencing training and analysis suite using the Galaxy framework.

Gigascience 2020 10;9(10)

Department of Bioinformatics, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany.

Background: The vast ecosystem of single-cell RNA-sequencing tools has until recently been plagued by an excess of diverging analysis strategies, inconsistent file formats, and compatibility issues between different software suites. The uptake of 10x Genomics datasets has begun to calm this diversity, and the bioinformatics community leans once more towards the large computing requirements and the statistically driven methods needed to process and understand these ever-growing datasets.

Results: Here we outline several Galaxy workflows and learning resources for single-cell RNA-sequencing, with the aim of providing a comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap between the computational methods and the underlying cell biology. The Galaxy reproducible bioinformatics framework provides tools, workflows, and trainings that not only enable users to perform 1-click 10x preprocessing but also empower them to demultiplex raw sequencing from custom tagged and full-length sequencing protocols. The downstream analysis supports a range of high-quality interoperable suites separated into common stages of analysis: inspection, filtering, normalization, confounder removal, and clustering. The teaching resources cover concepts from computer science to cell biology. Access to all resources is provided at the singlecell.usegalaxy.eu portal.

Conclusions: The reproducible and training-oriented Galaxy framework provides a sustainable high-performance computing environment for users to run flexible analyses on both 10x and alternative platforms. The tutorials from the Galaxy Training Network along with the frequent training workshops hosted by the Galaxy community provide a means for users to learn, publish, and teach single-cell RNA-sequencing analysis.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giaa102DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7574357PMC
October 2020

No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics.

PLoS Pathog 2020 08 13;16(8):e1008643. Epub 2020 Aug 13.

Temple University, Philadelphia, Pennsylvania, United States of America.

The current state of much of the Wuhan pneumonia virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) research shows a regrettable lack of data sharing and considerable analytical obfuscation. This impedes global research cooperation, which is essential for tackling public health emergencies and requires unimpeded access to data, analysis tools, and computational infrastructure. Here, we show that community efforts in developing open analytical software tools over the past 10 years, combined with national investments into scientific computational infrastructure, can overcome these deficiencies and provide an accessible platform for tackling global health emergencies in an open and transparent manner. Specifically, we use all SARS-CoV-2 genomic data available in the public domain so far to (1) underscore the importance of access to raw data and (2) demonstrate that existing community efforts in curation and deployment of biomedical software can reliably support rapid, reproducible research during global health crises. All our analyses are fully documented at https://github.com/galaxyproject/SARS-CoV-2.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.ppat.1008643DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7425854PMC
August 2020

A crowdsourced set of curated structural variants for the human genome.

PLoS Comput Biol 2020 06 19;16(6):e1007933. Epub 2020 Jun 19.

Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America.

A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is more challenging. In this study, we manually curated 1235 SVs, which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app-SVCurator-to help GIAB curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. 'Expert' curators were 93% concordant with each other, and 37 of the 61 curators had at least 78% concordance with a set of 'expert' curators. The curators were least concordant for complex SVs and SVs that had inaccurate breakpoints or size predictions. After filtering events with low concordance among curators, we produced high confidence labels for 935 events. The SVCurator crowdsourced labels were 94.5% concordant with the heuristic-based draft benchmark SV callset from GIAB. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1007933DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7329145PMC
June 2020

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update.

Nucleic Acids Res 2020 07;48(W1):W395-W402

Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA, USA.

Galaxy (https://galaxyproject.org) is a web-based computational workbench used by tens of thousands of scientists across the world to analyze large biomedical datasets. Since 2005, the Galaxy project has fostered a global community focused on achieving accessible, reproducible, and collaborative research. Together, this community develops the Galaxy software framework, integrates analysis tools and visualizations into the framework, runs public servers that make Galaxy available via a web browser, performs and publishes analyses using Galaxy, leads bioinformatics workshops that introduce and use Galaxy, and develops interactive training materials for Galaxy. Over the last two years, all aspects of the Galaxy project have grown: code contributions, tools integrated, users, and training materials. Key advances in Galaxy's user interface include enhancements for analyzing large dataset collections as well as interactive tools for exploratory data analysis. Extensions to Galaxy's framework include support for federated identity and access management and increased ability to distribute analysis jobs to remote resources. New community resources include large public servers in Europe and Australia, an increasing number of regional and local Galaxy communities, and substantial growth in the Galaxy Training Network.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa434DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7319590PMC
July 2020

Galaxy External Display Applications: closing a dataflow interoperability loop.

Nat Methods 2020 02;17(2):123-124

Department of Biochemistry and Molecular Biology, Penn State University, University Park, Pennsylvania, USA.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41592-019-0727-xDOI Listing
February 2020

Sptlc1 is essential for myeloid differentiation and hematopoietic homeostasis.

Blood Adv 2019 11;3(22):3635-3649

Cancer and Developmental Biology Laboratory, National Cancer Institute, National Institutes of Health, Frederick, MD.

Serine palmitoyltransferase (SPT) long-chain base subunit 1 (SPTLC1) is 1 of the 2 main catalytic subunits of the SPT complex, which catalyzes the first and rate-limiting step of sphingolipid biosynthesis. Here, we show that Sptlc1 deletion in adult bone marrow (BM) cells results in defective myeloid differentiation. In chimeric mice from noncompetitive BM transplant assays, there was an expansion of the Lin- c-Kit+ Sca-1+ compartment due to increased multipotent progenitor production, but myeloid differentiation was severely compromised. We also show that defective biogenesis of sphingolipids in the endoplasmic reticulum (ER) leads to ER stress that affects myeloid differentiation. Furthermore, we demonstrate that transient accumulation of fatty acid, a substrate for sphingolipid biosynthesis, could be partially responsible for the ER stress. Independently, we find that ER stress in general, such as that induced by the chemical thapsigargin or the fatty acid palmitic acid, compromises myeloid differentiation in culture. These results identify perturbed sphingolipid metabolism as a source of ER stress, which may produce diverse pathological effects related to differential cell-type sensitivity.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1182/bloodadvances.2019000729DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6880889PMC
November 2019

Software engineering for scientific big data analysis.

Gigascience 2019 05;8(5)

Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, 9500 Euclid Avenue / NE50, Cleveland, OH, USA.

The increasing complexity of data and analysis methods has created an environment where scientists, who may not have formal training, are finding themselves playing the impromptu role of software engineer. While several resources are available for introducing scientists to the basics of programming, researchers have been left with little guidance on approaches needed to advance to the next level for the development of robust, large-scale data analysis tools that are amenable to integration into workflow management systems, tools, and frameworks. The integration into such workflow systems necessitates additional requirements on computational tools, such as adherence to standard conventions for robustness, data input, output, logging, and flow control. Here we provide a set of 10 guidelines to steer the creation of command-line computational tools that are usable, reliable, extensible, and in line with standards of modern coding practices.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giz054DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6532757PMC
May 2019

Child Weight Gain Trajectories Linked To Oral Microbiota Composition.

Sci Rep 2018 09 19;8(1):14030. Epub 2018 Sep 19.

Center for Medical Genomics, Penn State University, University Park, PA, 16802, USA.

Gut and oral microbiota perturbations have been observed in obese adults and adolescents; less is known about their influence on weight gain in young children. Here we analyzed the gut and oral microbiota of 226 two-year-olds with 16S rRNA gene sequencing. Weight and length were measured at seven time points and used to identify children with rapid infant weight gain (a strong risk factor for childhood obesity), and to derive growth curves with innovative Functional Data Analysis (FDA) techniques. We showed that growth curves were associated negatively with diversity, and positively with the Firmicutes-to-Bacteroidetes ratio, of the oral microbiota. We also demonstrated an association between the gut microbiota and child growth, even after controlling for the effect of diet on the microbiota. Lastly, we identified several bacterial genera that were associated with child growth patterns. These results suggest that by the age of two, the oral microbiota of children with rapid infant weight gain may have already begun to establish patterns often seen in obese adults. They also suggest that the gut microbiota at age two, while strongly influenced by diet, does not harbor obesity signatures many researchers identified in later life stages.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-018-31866-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6145887PMC
September 2018

Recommendations for the packaging and containerizing of bioinformatics software.

F1000Res 2018 14;7. Epub 2018 Jun 14.

EMBL European Bioinformatics Institute, Cambridge, UK.

Software Containers are changing the way scientists and researchers develop, deploy and exchange scientific software. They allow labs of all sizes to easily install bioinformatics software, maintain multiple versions of the same software and combine tools into powerful analysis pipelines. However, containers and software packages should be produced under certain rules and standards in order to be reusable, compatible and easy to integrate into pipelines and analysis workflows. Here, we presented a set of recommendations developed by the BioContainers Community to produce standardized bioinformatics packages and containers. These recommendations provide practical guidelines to make bioinformatics software more discoverable, reusable and transparent.  They are aimed to guide developers, organisations, journals and funders to increase the quality and sustainability of research software.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.15140.2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6738188PMC
November 2019

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.

Nucleic Acids Res 2018 07;46(W1):W537-W544

Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA.

Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky379DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6030816PMC
July 2018

Biology Needs Evolutionary Software Tools: Let's Build Them Right.

Mol Biol Evol 2018 06;35(6):1372-1375

Lerner Research Institute, Cleveland Clinic, Cleveland, OH.

Research in population genetics and evolutionary biology has always provided a computational backbone for life sciences as a whole. Today evolutionary and population biology reasoning are essential for interpretation of large complex datasets that are characteristic of all domains of today's life sciences ranging from cancer biology to microbial ecology. This situation makes algorithms and software tools developed by our community more important than ever before. This means that we, developers of software tool for molecular evolutionary analyses, now have a shared responsibility to make these tools accessible using modern technological developments as well as provide adequate documentation and training.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/msy084DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5967460PMC
June 2018

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update.

Nucleic Acids Res 2016 07 2;44(W1):W3-W10. Epub 2016 May 2.

The Computational Biology Institute, George Washington University, Washington DC, USA

High-throughput data production technologies, particularly 'next-generation' DNA sequencing, have ushered in widespread and disruptive changes to biomedical research. Making sense of the large datasets produced by these technologies requires sophisticated statistical and computational methods, as well as substantial computational power. This has led to an acute crisis in life sciences, as researchers without informatics training attempt to perform computation-dependent analyses. Since 2005, the Galaxy project has worked to address this problem by providing a framework that makes advanced computational tools usable by non experts. Galaxy seeks to make data-intensive research more accessible, transparent and reproducible by providing a Web-based environment in which users can perform computational analyses and have all of the details automatically tracked for later inspection, publication, or reuse. In this report we highlight recently added features enabling biomedical analyses on a large scale.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw343DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4987906PMC
July 2016

Integrative genomic analysis by interoperation of bioinformatics tools in GenomeSpace.

Nat Methods 2016 Mar 18;13(3):245-247. Epub 2016 Jan 18.

The Broad Institute of MIT and Harvard, Cambridge, MA, USA.

Complex biomedical analyses require the use of multiple software tools in concert and remain challenging for much of the biomedical research community. We introduce GenomeSpace (http://www.genomespace.org), a cloud-based, cooperative community resource that currently supports the streamlined interaction of 20 bioinformatics tools and data resources. To facilitate integrative analysis by non-programmers, it offers a growing set of 'recipes', short workflows to guide investigators through high-utility analysis tasks.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nmeth.3732DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4767623PMC
March 2016

Online resources for genomic analysis using high-throughput sequencing.

Cold Spring Harb Protoc 2015 Feb 5;2015(4):324-35. Epub 2015 Feb 5.

Department of Biochemistry and Molecular Biology, Penn State University, University Park, Pennsylvania 16802;

The availability of high-throughput sequencing has created enormous possibilities for scientific discovery. However, the massive amount of data being generated has resulted in a severe informatics bottleneck. A large number of tools exist for analyzing next-generation sequencing (NGS) data, yet often there remains a disconnect between these research tools and the ability of many researchers to use them. As a consequence, several online resources and communities have been developed to assist researchers with both the management and the analysis of sequencing data sets. Here we describe the use and applications of common file formats for coding and storing genomic data, consider several web-accessible open-source resources for the visualization and analysis of NGS data, and provide examples of typical analyses with links to further detailed exercises.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/pdb.top083667DOI Listing
February 2015

Maternal age effect and severe germ-line bottleneck in the inheritance of human mitochondrial DNA.

Proc Natl Acad Sci U S A 2014 Oct 13;111(43):15474-9. Epub 2014 Oct 13.

Biology, and

The manifestation of mitochondrial DNA (mtDNA) diseases depends on the frequency of heteroplasmy (the presence of several alleles in an individual), yet its transmission across generations cannot be readily predicted owing to a lack of data on the size of the mtDNA bottleneck during oogenesis. For deleterious heteroplasmies, a severe bottleneck may abruptly transform a benign (low) frequency in a mother into a disease-causing (high) frequency in her child. Here we present a high-resolution study of heteroplasmy transmission conducted on blood and buccal mtDNA of 39 healthy mother-child pairs of European ancestry (a total of 156 samples, each sequenced at ∼20,000× per site). On average, each individual carried one heteroplasmy, and one in eight individuals carried a disease-associated heteroplasmy, with minor allele frequency ≥1%. We observed frequent drastic heteroplasmy frequency shifts between generations and estimated the effective size of the germ-line mtDNA bottleneck at only ∼30-35 (interquartile range from 9 to 141). Accounting for heteroplasmies, we estimated the mtDNA germ-line mutation rate at 1.3 × 10(-8) (interquartile range from 4.2 × 10(-9) to 4.1 × 10(-8)) mutations per site per year, an order of magnitude higher than for nuclear DNA. Notably, we found a positive association between the number of heteroplasmies in a child and maternal age at fertilization, likely attributable to oocyte aging. This study also took advantage of droplet digital PCR (ddPCR) to validate heteroplasmies and confirm a de novo mutation. Our results can be used to predict the transmission of disease-causing mtDNA variants and illuminate evolutionary dynamics of the mitochondrial genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.1409328111DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4217420PMC
October 2014

Dissemination of scientific software with Galaxy ToolShed.

Genome Biol 2014 Feb 20;15(2):403. Epub 2014 Feb 20.

The proliferation of web-based integrative analysis frameworks has enabled users to perform complex analyses directly through the web. Unfortunately, it also revoked the freedom to easily select the most appropriate tools. To address this, we have developed Galaxy ToolShed.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb4161DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4038738PMC
February 2014

Analysis of next-generation sequencing data using Galaxy.

Methods Mol Biol 2014 ;1150:21-43

Department of Biochemistry and Molecular Biology, Penn State University, 505 Wartik Laboratory, University Park, PA, 16802, USA,

The extraordinary throughput of next-generation sequencing (NGS) technology is outpacing our ability to analyze and interpret the data. This chapter will focus on practical informatics methods, strategies, and software tools for transforming NGS data into usable information through the use of a web-based platform, Galaxy. The Galaxy interface is explored through several different types of example analyses. Instructions for running one's own Galaxy server on local hardware or on cloud computing resources are provided. Installing new tools into a personal Galaxy instance is also demonstrated.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/978-1-4939-0512-6_2DOI Listing
November 2014

Controlling for contamination in re-sequencing studies with a reproducible web-based phylogenetic approach.

Biotechniques 2014 1;56(3):134-141. Epub 2014 Mar 1.

Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA.

Polymorphism discovery is a routine application of next-generation sequencing technology where multiple samples are sent to a service provider for library preparation, subsequent sequencing, and bioinformatic analyses. The decreasing cost and advances in multiplexing approaches have made it possible to analyze hundreds of samples at a reasonable cost. However, because of the manual steps involved in the initial processing of samples and handling of sequencing equipment, cross-contamination remains a significant challenge. It is especially problematic in cases where polymorphism frequencies do not adhere to diploid expectation, for example, heterogeneous tumor samples, organellar genomes, as well as during bacterial and viral sequencing. In these instances, low levels of contamination may be readily mistaken for polymorphisms, leading to false results. Here we describe practical steps designed to reliably detect contamination and uncover its origin, and also provide new, Galaxy-based, readily accessible computational tools and workflows for quality control. All results described in this report can be reproduced interactively on the web as described at http://usegalaxy.org/contamination.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2144/000114146DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4377138PMC
December 2014

Wrangling Galaxy's reference data.

Bioinformatics 2014 Jul 28;30(13):1917-9. Epub 2014 Feb 28.

Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA 16802, USA, http://www.galaxyproject.org, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA, Department of Biology and Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USADepartment of Biochemistry and Molecular Biology, Penn State University, University Park, PA 16802, USA, http://www.galaxyproject.org, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA, Department of Biology and Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USA.

Unlabelled: The Galaxy platform has developed into a fully featured collaborative workbench, with goals of inherently capturing provenance to enable reproducible data analysis, and of making it straightforward to run one's own server. However, many Galaxy platform tools rely on the presence of reference data, such as alignment indexes, to function efficiently. Until now, the building of this cache of data for Galaxy has been an error-prone manual process lacking reproducibility and provenance. The Galaxy Data Manager framework is an enhancement that changes the management of Galaxy's built-in data cache from a manual procedure to an automated graphical user interface (GUI) driven process, which contains the same openness, reproducibility and provenance that is afforded to Galaxy's analysis tools. Data Manager tools allow the Galaxy administrator to download, create and install additional datasets for any type of reference data in real time.

Availability And Implementation: The Galaxy Data Manager framework is implemented in Python and has been integrated as part of the core Galaxy platform. Individual Data Manager tools can be defined locally or installed from a ToolShed, allowing the Galaxy community to define additional Data Manager tools as needed, with full versioning and dependency support.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btu119DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4071198PMC
July 2014

CloudMap: a cloud-based pipeline for analysis of mutant genome sequences.

Genetics 2012 Dec 10;192(4):1249-69. Epub 2012 Oct 10.

Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, NY 10032, USA.

Whole genome sequencing (WGS) allows researchers to pinpoint genetic differences between individuals and significantly shortcuts the costly and time-consuming part of forward genetic analysis in model organism systems. Currently, the most effort-intensive part of WGS is the bioinformatic analysis of the relatively short reads generated by second generation sequencing platforms. We describe here a novel, easily accessible and cloud-based pipeline, called CloudMap, which greatly simplifies the analysis of mutant genome sequences. Available on the Galaxy web platform, CloudMap requires no software installation when run on the cloud, but it can also be run locally or via Amazon's Elastic Compute Cloud (EC2) service. CloudMap uses a series of predefined workflows to pinpoint sequence variations in animal genomes, such as those of premutagenized and mutagenized Caenorhabditis elegans strains. In combination with a variant-based mapping procedure, CloudMap allows users to sharply define genetic map intervals graphically and to retrieve very short lists of candidate variants with a few simple clicks. Automated workflows and extensive video user guides are available to detail the individual analysis steps performed (http://usegalaxy.org/cloudmap). We demonstrate the utility of CloudMap for WGS analysis of C. elegans and Arabidopsis genomes and describe how other organisms (e.g., Zebrafish and Drosophila) can easily be accommodated by this software platform. To accommodate rapid analysis of many mutants from large-scale genetic screens, CloudMap contains an in silico complementation testing tool that allows users to rapidly identify instances where multiple alleles of the same gene are present in the mutant collection. Lastly, we describe the application of a novel mapping/WGS method ("Variant Discovery Mapping") that does not rely on a defined polymorphic mapping strain, and we integrate the application of this method into CloudMap. CloudMap tools and documentation are continually updated at http://usegalaxy.org/cloudmap.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1534/genetics.112.144204DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3512137PMC
December 2012

An encyclopedia of mouse DNA elements (Mouse ENCODE).

Genome Biol 2012 Aug 13;13(8):418. Epub 2012 Aug 13.

To complement the human Encyclopedia of DNA Elements (ENCODE) project and to enable a broad range of mouse genomics efforts, the Mouse ENCODE Consortium is applying the same experimental pipelines developed for human ENCODE to annotate the mouse genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb-2012-13-8-418DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491367PMC
August 2012

Using Galaxy to perform large-scale interactive data analyses.

Curr Protoc Bioinformatics 2012 Jun;Chapter 10:Unit10.5

Penn State University, University Park, PA, USA.

Innovations in biomedical research technologies continue to provide experimental biologists with novel and increasingly large genomic and high-throughput data resources to be analyzed. As creating and obtaining data has become easier, the key decision faced by many researchers is a practical one: where and how should an analysis be performed? Datasets are large and analysis tool set-up and use is riddled with complexities outside of the scope of core research activities. The authors believe that Galaxy provides a powerful solution that simplifies data acquisition and analysis in an intuitive Web application, granting all researchers access to key informatics tools previously only available to computational specialists working in Unix-based environments. We will demonstrate through a series of biomedically relevant protocols how Galaxy specifically brings together (1) data retrieval from public and private sources, for example, UCSC's Eukaryote and Microbial Genome Browsers, (2) custom tools (wrapped Unix functions, format standardization/conversions, interval operations), and 3rd-party analysis tools.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/0471250953.bi1005s38DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4282168PMC
June 2012

Making whole genome multiple alignments usable for biologists.

Bioinformatics 2011 Sep 19;27(17):2426-8. Epub 2011 Jul 19.

The Huck Institutes for the Life Sciences and Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA, USA.

Summary: Here we describe a set of tools implemented within the Galaxy platform designed to make analysis of multiple genome alignments truly accessible for biologists. These tools are available through both a web-based graphical user interface and a command-line interface.

Availability And Implementation: This open-source toolset was implemented in Python and has been integrated into the online data analysis platform Galaxy (public web access: http://usegalaxy.org; download: http://getgalaxy.org). Additional help is available as a live supplement from http://usegalaxy.org/u/dan/p/maf.

Contact: [email protected]; [email protected]

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btr398DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3157923PMC
September 2011

Integrating diverse databases into an unified analysis framework: a Galaxy approach.

Database (Oxford) 2011 29;2011:bar011. Epub 2011 Apr 29.

The Galaxy Project, University Park, PA, USA.

Recent technological advances have lead to the ability to generate large amounts of data for model and non-model organisms. Whereas, in the past, there have been a relatively small number of central repositories that serve genomic data, an increasing number of distinct specialized data repositories and resources have been established. Here, we describe a generic approach that provides for the integration of a diverse spectrum of data resources into a unified analysis framework, Galaxy (http://usegalaxy.org). This approach allows the simplified coupling of external data resources with the data analysis tools available to Galaxy users, while leveraging the native data mining facilities of the external data resources. DATABASE URL: http://usegalaxy.org.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bar011DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3092608PMC
September 2011

Manipulation of FASTQ data with Galaxy.

Bioinformatics 2010 Jul 18;26(14):1783-5. Epub 2010 Jun 18.

Huck Institute for the Life Sciences, Penn State University, University Park, PA 16803, USA.

Summary: Here, we describe a tool suite that functions on all of the commonly known FASTQ format variants and provides a pipeline for manipulating next generation sequencing data taken from a sequencing machine all the way through the quality filtering steps.

Availability And Implementation: This open-source toolset was implemented in Python and has been integrated into the online data analysis platform Galaxy (public web access: http://usegalaxy.org; download: http://getgalaxy.org). Two short movies that highlight the functionality of tools described in this manuscript as well as results from testing components of this tool suite against a set of previously published files are available at http://usegalaxy.org/u/dan/p/fastq
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btq281DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2894519PMC
July 2010

Galaxy: a web-based genome analysis tool for experimentalists.

Curr Protoc Mol Biol 2010 Jan;Chapter 19:Unit 19.10.1-21

The Huck Institutes for the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, USA.

High-throughput data production has revolutionized molecular biology. However, massive increases in data generation capacity require analysis approaches that are more sophisticated, and often very computationally intensive. Thus, making sense of high-throughput data requires informatics support. Galaxy (http://galaxyproject.org) is a software system that provides this support through a framework that gives experimentalists simple interfaces to powerful tools, while automatically managing the computational details. Galaxy is distributed both as a publicly available Web service, which provides tools for the analysis of genomic, comparative genomic, and functional genomic data, or a downloadable package that can be deployed in individual laboratories. Either way, it allows experimentalists without informatics or programming expertise to perform complex large-scale analysis with just a Web browser.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/0471142727.mb1910s89DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4264107PMC
January 2010
-->