Publications by authors named "Ben Busby"

27 Publications

  • Page 1 of 1

Predicting drug-metagenome interactions: Variation in the microbial β-glucuronidase level in the human gut metagenomes.

PLoS One 2021 7;16(1):e0244876. Epub 2021 Jan 7.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America.

Characterizing the gut microbiota in terms of their capacity to interfere with drug metabolism is necessary to achieve drug efficacy and safety. Although examples of drug-microbiome interactions are well-documented, little has been reported about a computational pipeline for systematically identifying and characterizing bacterial enzymes that process particular classes of drugs. The goal of our study is to develop a computational approach that compiles drugs whose metabolism may be influenced by a particular class of microbial enzymes and that quantifies the variability in the collective level of those enzymes among individuals. The present paper describes this approach, with microbial β-glucuronidases as an example, which break down drug-glucuronide conjugates and reactivate the drugs or their metabolites. We identified 100 medications that may be metabolized by β-glucuronidases from the gut microbiome. These medications included morphine, estrogen, ibuprofen, midazolam, and their structural analogues. The analysis of metagenomic data available through the Sequence Read Archive (SRA) showed that the level of β-glucuronidase in the gut metagenomes was higher in males than in females, which provides a potential explanation for the sex-based differences in efficacy and toxicity for several drugs, reported in previous studies. Our analysis also showed that infant gut metagenomes at birth and 12 months of age have higher levels of β-glucuronidase than the metagenomes of their mothers and the implication of this observed variability was discussed in the context of breastfeeding as well as infant hyperbilirubinemia. Overall, despite important limitations discussed in this paper, our analysis provided useful insights on the role of the human gut metagenome in the variability in drug response among individuals. Importantly, this approach exploits drug and metagenome data available in public databases as well as open-source cheminformatics and bioinformatics tools to predict drug-metagenome interactions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244876PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7790408PMC
January 2021

NCBI's Virus Discovery Codeathon: Building "FIVE" -The Federated Index of Viral Experiments API Index.

Viruses 2020 12 10;12(12). Epub 2020 Dec 10.

National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 9000 Rockville Pike, Bethesda, MD 20894, USA.

Viruses represent important test cases for data federation due to their genome size and the rapid increase in sequence data in publicly available databases. However, some consequences of previously decentralized (unfederated) data are lack of consensus or comparisons between feature annotations. Unifying or displaying alternative annotations should be a priority both for communities with robust entry representation and for nascent communities with burgeoning data sources. To this end, during this three-day continuation of the Virus Hunting Toolkit codeathon series (VHT-2), a new integrated and federated viral index was elaborated. This Federated Index of Viral Experiments (FIVE) integrates pre-existing and novel functional and taxonomy annotations and virus-host pairings. Variability in the context of viral genomic diversity is often overlooked in virus databases. As a proof-of-concept, FIVE was the first attempt to include viral genome variation for HIV, the most well-studied human pathogen, through viral genome diversity graphs. As per the publication of this manuscript, FIVE is the first implementation of a virus-specific federated index of such scope. FIVE is coded in BigQuery for optimal access of large quantities of data and is publicly accessible. Many projects of database or index federation fail to provide easier alternatives to access or query information. To this end, a Python API query system was developed to enhance the accessibility of FIVE.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/v12121424DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7764237PMC
December 2020

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive.

F1000Res 2020 19;9:376. Epub 2020 May 19.

National Center for Biotechnology Information NLM, Bethesda, Maryland, 20894, USA.

The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations.  Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA's human RNA-seq data. The first tool, called the , finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type.  The second tool, called the , finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.23180.2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7445559PMC
May 2020

Integrated Informatics Analysis of Cancer-Related Variants.

JCO Clin Cancer Inform 2020 03;4:310-317

The Institute for Computational Medicine, The Johns Hopkins University, Baltimore, MD.

Purpose: The modern researcher is confronted with hundreds of published methods to interpret genetic variants. There are databases of genes and variants, phenotype-genotype relationships, algorithms that score and rank genes, and in silico variant effect prediction tools. Because variant prioritization is a multifactorial problem, a welcome development in the field has been the emergence of decision support frameworks, which make it easier to integrate multiple resources in an interactive environment. Current decision support frameworks are typically limited by closed proprietary architectures, access to a restricted set of tools, lack of customizability, Web dependencies that expose protected data, or limited scalability.

Methods: We present the Open Custom Ranked Analysis of Variants Toolkit (OpenCRAVAT) a new open-source, scalable decision support system for variant and gene prioritization. We have designed the resource catalog to be open and modular to maximize community and developer involvement, and as a result, the catalog is being actively developed and growing every month. Resources made available via the store are well suited for analysis of cancer, as well as Mendelian and complex diseases.

Results: OpenCRAVAT offers both command-line utility and dynamic graphical user interface, allowing users to install with a single command, easily download tools from an extensive resource catalog, create customized pipelines, and explore results in a richly detailed viewing environment. We present several case studies to illustrate the design of custom workflows to prioritize genes and variants.

Conclusion: OpenCRAVAT is distinguished from similar tools by its capabilities to access and integrate an unprecedented amount of diverse data resources and computational prediction methods, which span germline, somatic, common, rare, coding, and noncoding variants.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1200/CCI.19.00132DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7113103PMC
March 2020

Iron Hack - A symposium/hackathon focused on porphyrias, Friedreich's ataxia, and other rare iron-related diseases.

F1000Res 2019 19;8:1135. Epub 2019 Jul 19.

Global and Planetary Health, College of Public Health, University of South Florida, USF Genomics Program, 3720 Spectrum Blvd, Tampa, FL, 33612, USA.

: Basic and clinical scientific research at the University of South Florida (USF) have intersected to support a multi-faceted approach around a common focus on rare iron-related diseases. We proposed a modified version of the National Center for Biotechnology Information's (NCBI) Hackathon-model to take full advantage of local expertise in building "Iron Hack", a rare disease-focused hackathon. As the collaborative, problem-solving nature of hackathons tends to attract participants of highly-diverse backgrounds, organizers facilitated a symposium on rare iron-related diseases, specifically porphyrias and Friedreich's ataxia, pitched at general audiences. : The hackathon was structured to begin each day with presentations by expert clinicians, genetic counselors, researchers focused on molecular and cellular biology, public health/global health, genetics/genomics, computational biology, bioinformatics, biomolecular science, bioengineering, and computer science, as well as guest speakers from the American Porphyria Foundation (APF) and Friedreich's Ataxia Research Alliance (FARA) to inform participants as to the human impact of these diseases. : As a result of this hackathon, we developed resources that are relevant not only to these specific disease-models, but also to other rare diseases and general bioinformatics problems. Within two and a half days, "Iron Hack" participants successfully built collaborative projects to visualize data, build databases, improve rare disease diagnosis, and study rare-disease inheritance. : The purpose of this manuscript is to demonstrate the utility of a hackathon model to generate prototypes of generalizable tools for a given disease and train clinicians and data scientists to interact more effectively.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.19140.1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894363PMC
June 2020

NCBI's Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements.

Genes (Basel) 2019 09 16;10(9). Epub 2019 Sep 16.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA.

A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/genes10090714DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6771016PMC
September 2019

Magic-BLAST, an accurate RNA-seq aligner for long and short reads.

BMC Bioinformatics 2019 Jul 25;20(1):405. Epub 2019 Jul 25.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA.

Background: Next-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. We introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline.

Results: Magic-BLAST uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance of Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome.

Conclusions: We show that Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform. It is versatile and robust to high levels of mismatches or extreme base composition, and reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-019-2996-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6659269PMC
July 2019

geneHummus: an R package to define gene families and their expression in legumes and beyond.

BMC Genomics 2019 Jul 18;20(1):591. Epub 2019 Jul 18.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA.

Background: During the last decade, plant biotechnological laboratories have sparked a monumental revolution with the rapid development of next sequencing technologies at affordable prices. Soon, these sequencing technologies and assembling of whole genomes will extend beyond the plant computational biologists and become commonplace within the plant biology disciplines. The current availability of large-scale genomic resources for non-traditional plant model systems (the so-called 'orphan crops') is enabling the construction of high-density integrated physical and genetic linkage maps with potential applications in plant breeding. The newly available fully sequenced plant genomes represent an incredible opportunity for comparative analyses that may reveal new aspects of genome biology and evolution. The analysis of the expansion and evolution of gene families across species is a common approach to infer biological functions. To date, the extent and role of gene families in plants has only been partially addressed and many gene families remain to be investigated. Manual identification of gene families is highly time-consuming and laborious, requiring an iterative process of manual and computational analysis to identify members of a given family, typically combining numerous BLAST searches and manually cleaning data. Due to the increasing abundance of genome sequences and the agronomical interest in plant gene families, the field needs a clear, automated annotation tool.

Results: Here, we present the geneHummus package, an R-based pipeline for the identification and characterization of plant gene families. The impact of this pipeline comes from a reduction in hands-on annotation time combined with high specificity and sensitivity in extracting only proteins from the RefSeq database and providing the conserved domain architectures based on SPARCLE. As a case study we focused on the auxin receptor factors gene (ARF) family in Cicer arietinum (chickpea) and other legumes.

Conclusion: We anticipate that our pipeline should be suitable for any taxonomic plant family, and likely other gene families, vastly improving the speed and ease of genomic data processing.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-019-5952-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6639926PMC
July 2019

Box, stalked, and upside-down? Draft genomes from diverse jellyfish (Cnidaria, Acraspeda) lineages: Alatina alata (Cubozoa), Calvadosia cruxmelitensis (Staurozoa), and Cassiopea xamachana (Scyphozoa).

Gigascience 2019 07;8(7)

Whitney Laboratory for Marine Bioscience, University of Florida, 9505 Ocean Shore Boulevard, St. Augustine, FL, 32080, USA.

Background: Anthozoa, Endocnidozoa, and Medusozoa are the 3 major clades of Cnidaria. Medusozoa is further divided into 4 clades, Hydrozoa, Staurozoa, Cubozoa, and Scyphozoa-the latter 3 lineages make up the clade Acraspeda. Acraspeda encompasses extraordinary diversity in terms of life history, numerous nuisance species, taxa with complex eyes rivaling other animals, and some of the most venomous organisms on the planet. Genomes have recently become available within Scyphozoa and Cubozoa, but there are currently no published genomes within Staurozoa and Cubozoa.

Findings: Here we present 3 new draft genomes of Calvadosia cruxmelitensis (Staurozoa), Alatina alata (Cubozoa), and Cassiopea xamachana (Scyphozoa) for which we provide a preliminary orthology analysis that includes an inventory of their respective venom-related genes. Additionally, we identify synteny between POU and Hox genes that had previously been reported in a hydrozoan, suggesting this linkage is highly conserved, possibly dating back to at least the last common ancestor of Medusozoa, yet likely independent of vertebrate POU-Hox linkages.

Conclusions: These draft genomes provide a valuable resource for studying the evolutionary history and biology of these extraordinary animals, and for identifying genomic features underlying venom, vision, and life history traits in Acraspeda.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giz069DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6599738PMC
July 2019

NovoGraph: Human genome graph construction from multiple long-read assemblies.

F1000Res 2018 3;7:1391. Epub 2018 Sep 3.

National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20817, USA.

Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.15895.2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6305223PMC
November 2019

Reply to the paper: Misunderstood parameters of NCBI BLAST impacts the correctness of bioinformatics workflows.

Bioinformatics 2019 08;35(15):2699-2700

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty1026DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6662297PMC
August 2019

SeqAcademy: an educational pipeline for RNA-Seq and ChIP-Seq analysis.

F1000Res 2018 22;7. Epub 2018 May 22.

National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, 20894, USA.

Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy ( https://github.com/NCBI-Hackathons/seqacademy, http://www.seqacademy.org/). This user-friendly pipeline, fully written in markdown language, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.14880.4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7525341PMC
May 2018

Matchmaking in Bioinformatics.

F1000Res 2018 9;7. Epub 2018 Feb 9.

Department of Biological Sciences and School of Biomedical Sciences, Kent State University, Kent, OH, 44242, USA.

Ever return from a meeting feeling elated by all those exciting talks, yet unsure how all those presented glamorous and/or exciting tools can be useful in your research?  Or do you have a great piece of software you want to share, yet only a handful of people visited your poster? We have all been there, and that is why we organized the Matchmaking for Computational and Experimental Biologists Session at the latest ISCB/GLBIO'2017 meeting in Chicago (May 15-17, 2017). The session exemplifies a novel approach, mimicking "matchmaking", to encouraging communication, making connections and fostering collaborations between computational and non-computational biologists. More specifically, the session facilitates face-to-face communication between researchers with similar or differing research interests, which we feel are critical for promoting productive discussions and collaborations.  To accomplish this, three short scheduled talks were delivered, focusing on RNA-seq, integration of clinical and genomic data, and chromatin accessibility analyses.  Next, small-table developer-led discussions, modeled after speed-dating, enabled each developer (including the speakers) to introduce a specific tool and to engage potential users or other developers around the table.  Notably, we asked the audience whether any other tool developers would want to showcase their tool and we thus added four developers as moderators of these small-table discussions.  Given the positive feedback from the tool developers, we feel that this type of session is an effective approach for promoting valuable scientific discussion, and is particularly helpful in the context of conferences where the number of participants and activities could hamper such interactions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.13705.1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871941PMC
February 2018

PubRunner: A light-weight framework for updating text mining results.

F1000Res 2017 2;6:612. Epub 2017 May 2.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.

Biomedical text mining promises to assist biologists in quickly navigating the combined knowledge in their domain. This would allow improved understanding of the complex interactions within biological systems and faster hypothesis generation. New biomedical research articles are published daily and text mining tools are only as good as the corpus from which they work. Many text mining tools are underused because their results are static and do not reflect the constantly expanding knowledge in the field. In order for biomedical text mining to become an indispensable tool used by researchers, this problem must be addressed. To this end, we present PubRunner, a framework for regularly running text mining tools on the latest publications. PubRunner is lightweight, simple to use, and can be integrated with an existing text mining tool. The workflow involves downloading the latest abstracts from PubMed, executing a user-defined tool, pushing the resulting data to a public FTP or Zenodo dataset, and publicizing the location of these results on the public PubRunner website. We illustrate the use of this tool by re-running the commonly used word2vec tool on the latest PubMed abstracts to generate up-to-date word vector representations for the biomedical domain. This shows a proof of concept that we hope will encourage text mining developers to build tools that truly will aid biologists in exploring the latest publications.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.11389.2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5664974PMC
May 2017

Viewing RNA-seq data on the entire human genome.

F1000Res 2017 28;6:596. Epub 2017 Apr 28.

National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD, 20894, USA.

RNA-Seq Viewer is a web application that enables users to visualize genome-wide expression data from NCBI's Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO) databases. The application prototype was created by a small team during a three-day hackathon facilitated by NCBI at Brandeis University. The backend data pipeline was developed and deployed on a shared AWS EC2 instance. Source code is available at https://github.com/NCBI-Hackathons/rnaseqview.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.9762.1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5605993PMC
April 2017

Extending TCGA queries to automatically identify analogous genomic data from dbGaP.

F1000Res 2017 24;6:319. Epub 2017 Mar 24.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.

Data sharing is critical to advance genomic research by reducing the demand to collect new data by reusing and combining existing data and by promoting reproducible research. The Cancer Genome Atlas (TCGA) is a popular resource for individual-level genotype-phenotype cancer related data. The Database of Genotypes and Phenotypes (dbGaP) contains many datasets similar to those in TCGA. We have created a software pipeline that will allow researchers to discover relevant genomic data from dbGaP, based on matching TCGA metadata. The resulting research provides an easy to use tool to connect these two data sources.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.9837.1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5538035PMC
March 2017

DangerTrack: A scoring system to detect difficult-to-assess regions.

F1000Res 2017 7;6:443. Epub 2017 Apr 7.

National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.

Over recent years, multiple groups have shown that a large number of structural variants, repeats, or problems with the underlying genome assembly have dramatic effects on the mapping, calling, and overall reliability of single nucleotide polymorphism calls. This project endeavored to develop an easy-to-use track for looking at structural variant and repeat regions. This track, DangerTrack, can be displayed alongside the existing Genome Reference Consortium assembly tracks to warn clinicians and biologists when variants of interest may be incorrectly called, of dubious quality, or on an insertion or copy number expansion. While mapping and variant calling can be automated, it is our opinion that when these regions are of interest to a particular clinical or research group, they warrant a careful examination, potentially involving localized reassembly. DangerTrack is available at https://github.com/DCGenomics/DangerTrack.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.11254.1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5405793PMC
April 2017

dbVar structural variant cluster set for data analysis and variant comparison.

F1000Res 2016 13;5:673. Epub 2016 Apr 13.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

dbVar houses over 3 million submitted structural variants (SSV) from 120 human studies including copy number variations (CNV), insertions, deletions, inversions, translocations, and complex chromosomal rearrangements. Users can submit multiple SSVs to dbVAR  that are presumably identical, but were ascertained by different platforms and samples,  to calculate whether the variant is rare or common in the population and allow for cross validation. However, because SSV genomic location reporting can vary - including fuzzy locations where the start and/or end points are not precisely known - analysis, comparison, annotation, and reporting of SSVs across studies can be difficult. This project was initiated by the Structural Variant Comparison Group for the purpose of generating a non-redundant set of genomic regions defined by counts of concordance for all human SSVs placed on RefSeq assembly GRCh38 (RefSeq accession GCF_000001405.26). We intend that the availability of these regions, called structural variant clusters (SVCs), will facilitate the analysis, annotation, and exchange of SV data and allow for simplified display in genomic sequence viewers for improved variant interpretation. Sets of SVCs were generated by variant type for each of the 120 studies as well as for a combined set across all studies. Starting from 3.64 million SSVs, 2.5 million and 3.4 million non-redundant SVCs with count >=1 were generated by variant type for each study and across all studies, respectively. In addition, we have developed utilities for annotating, searching, and filtering SVC data in GVF format for computing summary statistics, exporting data for genomic viewers, and annotating the SVC using external data sources.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.8290.2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5345777PMC
April 2016

MetaNetVar: Pipeline for applying network analysis tools for genomic variants analysis.

F1000Res 2016 13;5:674. Epub 2016 Apr 13.

National Center for Biotechnology Information, National Library of Medicine, Bethesda, USA.

Network analysis can make variant analysis better. There are existing tools like HotNet2 and dmGWAS that can provide various analytical methods. We developed a prototype of a pipeline called MetaNetVar that allows execution of multiple tools. The code is published at https://github.com/NCBI-Hackathons/Network_SNPs. A working prototype is published as an Amazon Machine Image - ami-4510312f .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.8288.1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4857755PMC
May 2016

Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping.

F1000Res 2016 13;5:672. Epub 2016 Apr 13.

NIH Library, Division of Library Services, Office of Research Services, National Institutes of Health, Bethesda, MD, USA.

In genomics, bioinformatics and other areas of data science, gaps exist between extant public datasets and the open-source software tools built by the community to analyze similar data types.  The purpose of biological data science hackathons is to assemble groups of genomics or bioinformatics professionals and software developers to rapidly prototype software to address these gaps.  The only two rules for the NCBI-assisted hackathons run so far are that 1) data either must be housed in public data repositories or be deposited to such repositories shortly after the hackathon's conclusion, and 2) all software comprising the final pipeline must be open-source or open-use.  Proposed topics, as well as suggested tools and approaches, are distributed to participants at the beginning of each hackathon and refined during the event.  Software, scripts, and pipelines are developed and published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development. The code resulting from each hackathon is published at https://github.com/NCBI-Hackathons/ with separate directories or repositories for each team.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4837979PMC
http://dx.doi.org/10.12688/f1000research.8382.2DOI Listing
May 2016

Mitogen-activated protein kinase signaling causes malignant melanoma cells to differentially alter extracellular matrix biosynthesis to promote cell survival.

BMC Cancer 2016 Mar 5;16:186. Epub 2016 Mar 5.

Laboratory of Cell Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, 37 Convent Dr., Bethesda, MD, 20892, USA.

Background: Intrinsic and acquired resistance to drug therapies remains a challenge for malignant melanoma patients. Intratumoral heterogeneities within the tumor microenvironment contribute additional complexity to the determinants of drug efficacy and acquired resistance.

Methods: We use 3D biomimetic platforms to understand dynamics in extracellular matrix (ECM) biogenesis following pharmaceutical intervention against mitogen-activated protein kinases (MAPK) signaling. We further determined temporal evolution of secreted ECM components by isogenic melanoma cell clones.

Results: We found that the cell clones differentially secrete and assemble a myriad of ECM molecules into dense fibrillar and globular networks. We show that cells can modulate their ECM biosynthesis in response to external insults. Fibronectin (FN) is one of the key architectural components, modulating the efficacy of a broad spectrum of drug therapies. Stable cell lines engineered to secrete minimal levels of FN showed a concomitant increase in secretion of Tenascin-C and became sensitive to BRAF(V600E) and ERK inhibition as clonally- derived 3D tumor aggregates. These cells failed to assemble exogenous FN despite maintaining the integrin machinery to facilitate cell- ECM cross-talk. We determined that only clones that increased FN production via p38 MAPK and β1 integrin survived drug treatment.

Conclusions: These data suggest that tumor cells engineer drug resistance by altering their ECM biosynthesis. Therefore, drug treatment may induce ECM biosynthesis, contributing to de novo resistance.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12885-016-2211-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4779217PMC
March 2016

Effect of domestication on the spread of the [PIN+] prion in Saccharomyces cerevisiae.

Genetics 2014 Jul 8;197(3):1007-24. Epub 2014 May 8.

Laboratory of Biochemistry and Genetics, National Institute of Diabetes, Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892

Prions (infectious proteins) cause fatal neurodegenerative diseases in mammals. In the yeast Saccharomyces cerevisiae, many toxic and lethal variants of the [PSI+] and [URE3] prions have been identified in laboratory strains, although some commonly studied variants do not seem to impair cell growth. Phylogenetic analysis has revealed four major clades of S. cerevisiae that share histories of two prion proteins and largely correspond to different ecological niches of yeast. The [PIN+] prion was most prevalent in commercialized niches, infrequent among wine/vineyard strains, and not observed in ancestral isolates. As previously reported, the [PSI+] and [URE3] prions are not found in any of these strains. Patterns of heterozygosity revealed genetic mosaicism and indicated extensive outcrossing among divergent strains in commercialized environments. In contrast, ancestral isolates were all homozygous and wine/vineyard strains were closely related to each other and largely homozygous. Cellular growth patterns were highly variable within and among clades, although ancestral isolates were the most efficient sporulators and domesticated strains showed greater tendencies for flocculation. [PIN+]-infected strains had a significantly higher likelihood of polyploidy, showed a higher propensity for flocculation compared to uninfected strains, and had higher sporulation efficiencies compared to domesticated, uninfected strains. Extensive phenotypic variability among strains from different environments suggests that S. cerevisiae is a niche generalist and that most wild strains are able to switch from asexual to sexual and from unicellular to multicellular growth in response to environmental conditions. Our data suggest that outbreeding and multicellular growth patterns adapted for domesticated environments are ecological risk factors for the [PIN+] prion in wild yeast.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1534/genetics.114.165670DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4096356PMC
July 2014

Contribution of phage-derived genomic islands to the virulence of facultative bacterial pathogens.

Environ Microbiol 2013 Feb 4;15(2):307-12. Epub 2012 Oct 4.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

Facultative pathogens have extremely dynamic pan-genomes, to a large extent derived from bacteriophages and other mobile elements. We developed a simple approach to identify phage-derived genomic islands and apply it to show that pathogens from diverse bacterial genera are significantly enriched in clustered phage-derived genes compared with related benign strains. These findings show that genome expansion by integration of prophages containing virulence factors is a major route of evolution of facultative bacterial pathogens.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1111/j.1462-2920.2012.02886.xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5866053PMC
February 2013

Hydrophobic residues in small ankyrin 1 participate in binding to obscurin.

Mol Membr Biol 2012 Mar;29(2):36-51

Program in Biochemistry and Molecular Biology, University of Maryland, Baltimore, Baltimore, MD, USA.

Abstract Small ankyrin-1 is a splice variant of the ANK1 gene that binds to obscurin A. Previous studies have identified electrostatic interactions that contribute to this interaction. In addition, molecular dynamics (MD) simulations predict four hydrophobic residues in a 'hot spot' on the surface of the ankyrin-like repeats of sAnk1, near the charged residues involved in binding. We used site-directed mutagenesis, blot overlays and surface plasmon resonance assays to study the contribution of the hydrophobic residues, V70, F71, I102 and I103, to two different 30-mers of obscurin that bind sAnk1, Obsc₆₃₁₆₋₆₃₄₅ and Obsc₆₂₃₁₋₆₂₆₀. Alanine mutations of each of the hydrophobic residues disrupted binding to the high affinity binding site, Obsc₆₃₁₆₋₆₃₄₅. In contrast, V70A and I102A mutations had no effect on binding to the lower affinity site, Obsc₆₂₃₁₋₆₂₆₀. Alanine mutagenesis of the five hydrophobic residues present in Obsc₆₃₁₆₋₆₃₄₅ showed that V6328, I6332, and V6334 were critical to sAnk1 binding. Individual alanine mutants of the six hydrophobic residues of Obsc₆₂₃₁₋₆₂₆₀ had no effect on binding to sAnk1, although a triple alanine mutant of residues V6233/I6234/I6235 decreased binding. We also examined a model of the Obsc₆₃₁₆₋₆₃₄₅-sAnk1 complex in MD simulations and found I102 of sAnk1 to be within 2.2Å of V6334 of Obsc₆₃₁₆₋₆₃₄₅. In contrast to the I102A mutation, mutating I102 of sAnk1 to other hydrophobic amino acids such as phenylalanine or leucine did not disrupt binding to obscurin. Our results suggest that hydrophobic interactions contribute to the higher affinity of Obsc₆₃₁₆₋₆₃₄₅ for sAnk1 and to the dominant role exhibited by this sequence in binding.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3109/09687688.2012.660709DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3377380PMC
March 2012

Electrostatic interactions mediate binding of obscurin to small ankyrin 1: biochemical and molecular modeling studies.

J Mol Biol 2011 Apr 17;408(2):321-34. Epub 2011 Feb 17.

Department of Physiology, University of Maryland School of Medicine, Baltimore, MD, USA.

Small ankyrin 1 (sAnk1; also known as Ank1.5) is an integral protein of the sarcoplasmic reticulum (SR) in skeletal and cardiac muscle cells, where it is thought to bind to the C-terminal region of obscurin, a large modular protein that surrounds the contractile apparatus. Using fusion proteins in vitro, in combination with site-directed mutagenesis and surface plasmon resonance measurements, we previously showed that the binding site on sAnk1 for obscurin consists, in part, of six lysine and arginine residues. Here we show that four charged residues in the high-affinity binding site on obscurin for sAnk1 (between residues 6316 and 6345), consisting of three glutamates and a lysine, are necessary, but not sufficient, for this site on obscurin to bind to sAnk1 with high affinity. We also identify specific complementary mutations in sAnk1 that can partially or completely compensate for the changes in binding caused by charge-switching mutations in obscurin. We used molecular modeling to develop structural models of residues 6322-6339 of obscurin bound to sAnk1. The models, based on a combination of Brownian and molecular dynamics simulations, predict that the binding site on sAnk1 for obscurin is organized as two ankyrin-like repeats, with the last α-helical segment oriented at an angle to nearby helices, allowing lysine 6338 of obscurin to form an ionic interaction with aspartate 111 of sAnk1. This prediction was validated by double-mutant cycle experiments. Our results are consistent with a model in which electrostatic interactions between specific pairs of side chains on obscurin and sAnk1 promote binding and complex formation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jmb.2011.01.053DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3367564PMC
April 2011

Characterization and comparison of two binding sites on obscurin for small ankyrin 1.

Biochemistry 2010 Nov 1;49(46):9948-56. Epub 2010 Nov 1.

Department of Biochemistry and Molecular Biology, University of Maryland, Baltimore,Baltimore, Maryland 21201, United States.

Obscurin A, an ∼720 kDa modular protein of striated muscles, binds to small ankyrin 1 (sAnk1, Ank 1.5), an integral protein of the sarcoplasmic reticulum, through two distinct carboxy-terminal sequences, Obsc(6316-6436) and Obsc(6236-6260). We hypothesized that these sequences differ in affinity but that they compete for the same binding site on sAnk1. We show that the sequence within Obsc(6316-6436) that binds to sAnk1 is limited to residues 6316-6345. Comparison of Obsc(6231-6260) to Obsc(6316-6345) reveals that Obsc(6316-6345) binds sAnk1 with an affinity (133 ± 43 nM) comparable to that of the Obsc(6316-6436) fusion protein, whereas Obsc(6231-6260) binds with lower affinity (384 ± 53 nM). Oligopeptides of each sequence compete for binding with both sites at half-maximal inhibitory concentrations consistent with the affinities measured directly. Five of six site-directed mutants of sAnk1 showed similar reductions in binding to each binding site on obscurin, suggesting that they dock to many of the same residues of sAnk1. Circular dichroism (CD) analysis of the synthetic oligopeptides revealed a 2-fold greater α-helical content in Obsc(6316-6346), ∼35%, than Obsc(6231-6260,) ∼17%. Using these data, structural prediction algorithms, and homology modeling, we predict that Obsc(6316-6345) contains a bent α-helix of 12 amino acids, flanked by short disordered regions, and that Obsc(6231-6260) has a short, N-terminal α-helix of 4-5 residues followed by a long disordered region. Our results are consistent with a model in which both sequences of obscurin differ significantly in structure but bind to the ankyrin-like repeat motifs of sAnk1 in a similar though not identical manner.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1021/bi101165pDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3000613PMC
November 2010

A global benchmark study using affinity-based biosensors.

Authors:
Rebecca L Rich Giuseppe A Papalia Peter J Flynn Jamie Furneisen John Quinn Joshua S Klein Phini S Katsamba M Brent Waddell Michael Scott Joshua Thompson Judie Berlier Schuyler Corry Mireille Baltzinger Gabrielle Zeder-Lutz Andreas Schoenemann Anca Clabbers Sebastien Wieckowski Mary M Murphy Phillip Page Thomas E Ryan Jay Duffner Tanmoy Ganguly John Corbin Satyen Gautam Gregor Anderluh Andrej Bavdek Dana Reichmann Satya P Yadav Eric Hommema Ewa Pol Andrew Drake Scott Klakamp Trevor Chapman Dawn Kernaghan Ken Miller Jason Schuman Kevin Lindquist Kara Herlihy Michael B Murphy Richard Bohnsack Bruce Andrien Pietro Brandani Danny Terwey Rohn Millican Ryan J Darling Liann Wang Quincy Carter Joe Dotzlaf Jacinto Lopez-Sagaseta Islay Campbell Paola Torreri Sylviane Hoos Patrick England Yang Liu Yasmina Abdiche Daniel Malashock Alanna Pinkerton Melanie Wong Eileen Lafer Cynthia Hinck Kevin Thompson Carmelo Di Primo Alison Joyce Jonathan Brooks Federico Torta Anne Birgitte Bagge Hagel Janus Krarup Jesper Pass Monica Ferreira Sergei Shikov Malgorzata Mikolajczyk Yuki Abe Gaetano Barbato Anthony M Giannetti Ganeshram Krishnamoorthy Bianca Beusink Daulet Satpaev Tiffany Tsang Eric Fang James Partridge Stephen Brohawn James Horn Otto Pritsch Gonzalo Obal Sanjay Nilapwar Ben Busby Gerardo Gutierrez-Sanchez Ruchira Das Gupta Sylvie Canepa Krista Witte Zaneta Nikolovska-Coleska Yun Hee Cho Roberta D'Agata Kristian Schlick Rosy Calvert Eva M Munoz Maria Jose Hernaiz Tsafir Bravman Monica Dines Min-Hsiang Yang Agnes Puskas Erica Boni Jiejin Li Martin Wear Asya Grinberg Jason Baardsnes Olan Dolezal Melicia Gainey Henrik Anderson Jinlin Peng Mark Lewis Peter Spies Quyhn Trinh Sergei Bibikov Jill Raymond Mohammed Yousef Vidya Chandrasekaran Yuguo Feng Anne Emerick Suparna Mundodo Rejane Guimaraes Katy McGirr Yue-Ji Li Heather Hughes Hubert Mantz Rostislav Skrabana Mark Witmer Joshua Ballard Loic Martin Petr Skladal George Korza Ite Laird-Offringa Charlene S Lee Abdelkrim Khadir Frank Podlaski Phillippe Neuner Julie Rothacker Ashique Rafique Nico Dankbar Peter Kainz Erk Gedig Momchilo Vuyisich Christina Boozer Nguyen Ly Mark Toews Aykut Uren Oleksandr Kalyuzhniy Kenneth Lewis Eugene Chomey Brian J Pak David G Myszka

Anal Biochem 2009 Mar 27;386(2):194-216. Epub 2008 Nov 27.

Center for Biomolecular Interaction Analysis, School of Medicine, University of Utah, Salt Lake City, UT 84132, USA.

To explore the variability in biosensor studies, 150 participants from 20 countries were given the same protein samples and asked to determine kinetic rate constants for the interaction. We chose a protein system that was amenable to analysis using different biosensor platforms as well as by users of different expertise levels. The two proteins (a 50-kDa Fab and a 60-kDa glutathione S-transferase [GST] antigen) form a relatively high-affinity complex, so participants needed to optimize several experimental parameters, including ligand immobilization and regeneration conditions as well as analyte concentrations and injection/dissociation times. Although most participants collected binding responses that could be fit to yield kinetic parameters, the quality of a few data sets could have been improved by optimizing the assay design. Once these outliers were removed, the average reported affinity across the remaining panel of participants was 620 pM with a standard deviation of 980 pM. These results demonstrate that when this biosensor assay was designed and executed appropriately, the reported rate constants were consistent, and independent of which protein was immobilized and which biosensor was used.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ab.2008.11.021DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3793259PMC
March 2009