Publications by authors named "Stephen Ficklin"

43 Publications

Tripal, a community update after 10 years of supporting open source, standards-based genetic, genomic and breeding databases.

Brief Bioinform 2021 Jul 12. Epub 2021 Jul 12.

Washington State University, Pullman, WA USA.

Online, open access databases for biological knowledge serve as central repositories for research communities to store, find and analyze integrated, multi-disciplinary datasets. With increasing volumes, complexity and the need to integrate genomic, transcriptomic, metabolomic, proteomic, phenomic and environmental data, community databases face tremendous challenges in ongoing maintenance, expansion and upgrades. A common infrastructure framework using community standards shared by many databases can reduce development burden, provide interoperability, ensure use of common standards and support long-term sustainability. Tripal is a mature, open source platform built to meet this need. With ongoing improvement since its first release in 2009, Tripal provides full functionality for searching, browsing, loading and curating numerous types of data and is a primary technology powering at least 31 publicly available databases spanning plants, animals and human data, primarily storing genomics, genetics and breeding data. Tripal software development is managed by a shared, inclusive governance structure including both project management and advisory teams. Here, we report on the most important and innovative aspects of Tripal after 11 years development, including integration of diverse types of biological data, successful collaborative projects across member databases, and support for implementing FAIR principles.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbab238DOI Listing
July 2021

Transcriptomics of Differential Ripening in 'd'Anjou' Pear ( L.).

Front Plant Sci 2021 16;12:609684. Epub 2021 Jun 16.

USDA, ARS, Tree Fruit Research Laboratory, Wenatchee, WA, United States.

Estimating maturity in pome fruits is a critical task that directs virtually all postharvest supply chain decisions. This is especially important for European pear ( cultivars because losses due to spoilage and senescence must be minimized while ensuring proper ripening capacity is achieved (in part by satisfying a fruit chilling requirement). Reliable methods are lacking for accurate estimation of pear fruit maturity, and because ripening is maturity dependent it makes predicting ripening capacity a challenge. In this study of the European pear cultivar 'd'Anjou', we sorted fruit at harvest based upon on-tree fruit position to build contrasts of maturity. Our sorting scheme showed clear contrasts of maturity between canopy positions, yet there was substantial overlap in the distribution of values for the index of absorbance difference ( ), a non-destructive spectroscopic measurement that has been used as a proxy for pome fruit maturity. This presented an opportunity to explore a contrast of maturity that was more subtle than could differentiate, and thus guided our subsequent transcriptome analysis of tissue samples taken at harvest and during storage. Using a novel approach that tests for condition-specific differences of co-expressed genes, we discovered genes with a phased character that mirrored our sorting scheme. The expression patterns of these genes are associated with fruit quality and ripening differences across the experiment. Functional profiles of these co-expressed genes are concordant with previous findings, and also offer new clues, and thus hypotheses, about genes involved in pear fruit quality, maturity, and ripening. This work may lead to new tools for enhanced postharvest management based on activity of gene co-expression modules, rather than individual genes. Further, our results indicate that modules may have utility within specific windows of time during postharvest management of 'd'Anjou' pear.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fpls.2021.609684DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8243007PMC
June 2021

Chromosomal characteristics of salt stress heritable gene expression in the rice genome.

BMC Genom Data 2021 05 27;22(1):17. Epub 2021 May 27.

Molecular Plant Sciences Program, Washington State University, French Ad 324G, Pullman, WA, 99164, USA.

Background: Gene expression is potentially an important heritable quantitative trait that mediates between genetic variation and higher-level complex phenotypes through time and condition-dependent regulatory interactions. Therefore, we sought to explore both the genomic and condition-specific characteristics of gene expression heritability within the context of chromosomal structure.

Results: Heritability was estimated for biological gene expression using a diverse, 84-line, Oryza sativa (rice) population under optimal and salt-stressed conditions. Overall, 5936 genes were found to have heritable expression regardless of condition and 1377 genes were found to have heritable expression only during salt stress. These genes with salt-specific heritable expression are enriched for functional terms associated with response to stimulus and transcription factor activity. Additionally, we discovered that highly and lowly expressed genes, and genes with heritable expression are distributed differently along the chromosomes in patterns that follow previously identified high-throughput chromosomal conformation capture (Hi-C) A/B chromatin compartments. Furthermore, multiple genomic hot-spots enriched for genes with salt-specific heritability were identified on chromosomes 1, 4, 6, and 8. These hotspots were found to contain genes functionally enriched for transcriptional regulation and overlaps with a previously identified major QTL for salt-tolerance in rice.

Conclusions: Investigating the heritability of traits, and in-particular gene expression traits, is important towards developing a basic understanding of how regulatory networks behave across a population. This work provides insights into spatial patterns of heritable gene expression at the chromosomal level.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12863-021-00970-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8162008PMC
May 2021

Named Data Networking for Genomics Data Management and Integrated Workflows.

Front Big Data 2021 15;4:582468. Epub 2021 Feb 15.

Department of Computer Science, Tennessee Tech University, Cookeville, TN, United States.

Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA's GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then introduce Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contributions in this work are as follows 1) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and 2) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. 3) As a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in Section 4) to publish data from broadly used data repositories including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP). The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN's properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN-we are working on extending and evaluating our pilot deployment and will present systematic results in a future work.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fdata.2021.582468DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7968724PMC
February 2021

A Fixed Cohort Field Study of Gene Expression in Circulating Leukocytes From Dairy Cows With and Without Mastitis.

Front Vet Sci 2020 30;7:559279. Epub 2020 Sep 30.

Department of Animal Sciences, College of Agriculture, Human, and Natural Resource Sciences, Washington State University, Pullman, WA, United States.

Specifically designed gene expression studies can be used to prioritize candidate genes and identify novel biomarkers affecting resilience against mastitis and other diseases in dairy cattle. The primary goal of this study was to assess whether specific peripheral leukocyte genes expressed differentially in a previous study of dairy cattle with postpartum disease, also would be expressed differentially in peripheral leukocytes from a diverse set of different dairy cattle with moderate to severe clinical mastitis. Four genes were selected for this study due to their differential expression in a previous transcriptomic analysis of circulating leukocytes from dairy cows with and without evidence of early postpartum disease. An additional 15 genes were included based on their cellular, immunologic, and inflammatory functions associated with resistance and tolerance to mastitis. This fixed cohort study was conducted on a conventional dairy in Washington state. Cows >50 days in milk (DIM) with mastitis ( = 12) were enrolled along with healthy cows ( = 8) selected to match the DIM and lactation numbers of mastitic cows. Blood was collected for a complete blood count (CBC), serum biochemistry, leukocyte isolation, and RNA extraction on the day of enrollment and twice more at 6 to 8-days intervals. Latent class analysis was performed to discriminate healthy vs. mastitic cows and to describe disease resolution. RNA samples were processed by the Primate Diagnostic Services Laboratory (University of Washington, Seattle, WA). Gene expression analysis was performed using the Nanostring System (Nanostring Technologies, Seattle, Washington, USA). Of the four genes (, and ) with evidence of upregulation in cows with mastitis, three of those genes (, and ) were investigated due to their previously identified association with postpartum disease. These genes are responsible for immunomodulatory molecules that selectively enhance or alter host innate immune defense mechanisms and modulate pathogen-induced inflammatory responses. Although further research is warranted to explain their functional mechanisms and bioactivity in cattle, our findings suggest that these conserved elements of innate immunity have the potential to bridge disease states and target tissues in diverse dairy populations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fvets.2020.559279DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7554338PMC
September 2020

Tripal and Galaxy: supporting reproducible scientific workflows for community biological databases.

Database (Oxford) 2020 01;2020

Dept of Horticulture, Washington State University, 149 Johnson Hall 646414, Pullman, WA 99164-6414, USA.

Online biological databases housing genomics, genetic and breeding data can be constructed using the Tripal toolkit. Tripal is an open-source, internationally developed framework that implements FAIR data principles and is meant to ease the burden of constructing such websites for research communities. Use of a common, open framework improves the sustainability and manageability of such as site. Site developers can create extensions for their site and in turn share those extensions with others. One challenge that community databases often face is the need to provide tools for their users that analyze increasingly larger datasets using multiple software tools strung together in a scientific workflow on complicated computational resources. The Tripal Galaxy module, a 'plug-in' for Tripal, meets this need through integration of Tripal with the Galaxy Project workflow management system. Site developers can create workflows appropriate to the needs of their community using Galaxy and then share those for execution on their Tripal sites via automatically constructed, but configurable, web forms or using an application programming interface to power web-based analytical applications. The Tripal Galaxy module helps reduce duplication of effort by allowing site developers to spend time constructing workflows and building their applications rather than rebuilding infrastructure for job management of multi-step applications.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/baaa032DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7334887PMC
January 2020

Dissecting the Genetic Architecture of Aphanomyces Root Rot Resistance in Lentil by QTL Mapping and Genome-Wide Association Study.

Int J Mol Sci 2020 Mar 20;21(6). Epub 2020 Mar 20.

USDA-ARS Grain Legume Genetics and Physiology Research Unit, Pullman, WA 99164, USA.

Lentil ( Medikus) is an important source of protein for people in developing countries. Aphanomyces root rot (ARR) has emerged as one of the most devastating diseases affecting lentil production. In this study, we applied two complementary quantitative trait loci (QTL) analysis approaches to unravel the genetic architecture underlying this complex trait. A recombinant inbred line (RIL) population and an association mapping population were genotyped using genotyping by sequencing (GBS) to discover novel single nucleotide polymorphisms (SNPs). QTL mapping identified 19 QTL associated with ARR resistance, while association mapping detected 38 QTL and highlighted accumulation of favorable haplotypes in most of the resistant accessions. Seven QTL clusters were discovered on six chromosomes, and 15 putative genes were identified within the QTL clusters. To validate QTL mapping and genome-wide association study (GWAS) results, expression analysis of five selected genes was conducted on partially resistant and susceptible accessions. Three of the genes were differentially expressed at early stages of infection, two of which may be associated with ARR resistance. Our findings provide valuable insight into the genetic control of ARR, and genetic and genomic resources developed here can be used to accelerate development of lentil cultivars with high levels of partial resistance to ARR.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/ijms21062129DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7139309PMC
March 2020

Tripal MapViewer: A tool for interactive visualization and comparison of genetic maps.

Database (Oxford) 2019 01;2019

Department of Horticulture, Washington State University, Pullman, WA 99164-6414, USA.

Tripal is an open-source, resource-efficient toolkit for construction of genomic, genetic and breeding databases. It facilitates development of biological websites by providing tools to integrate and display biological data using the generic database schema, Chado, together with Drupal, a popular website creation and content management system. Tripal MapViewer is a new interactive tool for visualizing genetic map data. Developed as a Tripal replacement for Comparative Map Viewer (CMap), it enables visualization of entire maps or linkage groups and features such as molecular markers, quantitative trait loci (QTLs) and heritable phenotypic markers. It also provides graphical comparison of maps sharing the same markers as well as dot plot and correspondence matrices. MapViewer integrates directly with the Tripal application programming interface framework, improving data searching capability and providing a more seamless experience for site visitors. The Tripal MapViewer interface can be integrated in any Tripal map page and linked from any Tripal page for markers, QTLs, heritable morphological markers or genes. Configuration of the display is available through a control panel and the administration interface. The administration interface also allows configuration of the custom database query for building materialized views, providing better performance and flexibility in the way data is stored in the Chado database schema. MapViewer is implemented with the D3.js technology and is currently being used at the Genome Database for Rosaceae (https://www.rosaceae.org), CottonGen (https://www.cottongen.org), Citrus Genome Database (https://citrusgenomedb.org), Vaccinium Genome Database (https://www.vaccinium.org) and Cool Season Food Legume Database (https://www.coolseasonfoodlegume.org). It is also currently in development on the Hardwood Genomics Web (https://hardwoodgenomics.org) and TreeGenes (https://treegenesdb.org). Database URL: https://gitlab.com/mainlabwsu/tripal_map.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/baz100DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6829499PMC
January 2019

Tripal v3: an ontology-based toolkit for construction of FAIR biological community databases.

Database (Oxford) 2019 01;2019

Department of Horticulture, Washington State University, Pullman, WA, USA.

Community biological databases provide an important online resource for both public and private data, analysis tools and community engagement. These sites house genomic, transcriptomic, genetic, breeding and ancillary data for specific species, families or clades. Due to the complexity and increasing quantities of these data, construction of online resources is increasingly difficult especially with limited funding and access to technical expertise. Furthermore, online repositories are expected to promote FAIR data principles (findable, accessible, interoperable and reusable) that presents additional challenges. The open-source Tripal database toolkit seeks to mitigate these challenges by creating both the software and an interactive community of developers for construction of online community databases. Additionally, through coordinated, distributed co-development, Tripal sites encourage community-wide sustainability. Here, we report the release of Tripal version 3 that improves data accessibility and data sharing through systematic use of controlled vocabularies (CVs). Tripal uses the community-developed Chado database as a default data store, but now provides tools to support other data stores, while ensuring that CVs remain the central organizational structure for the data. A new site developer can use Tripal to develop a basic site with little to no programming, with the ability to integrate other data types using extension modules and the Tripal application programming interface. A thorough online User's Guide and Developer's Handbook are available at http://tripal.info, providing download, installation and step-by-step setup instructions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/baz077DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6643302PMC
January 2019

Cyberinfrastructure to Improve Forest Health and Productivity: The Role of Tree Databases in Connecting Genomes, Phenomes, and the Environment.

Front Plant Sci 2019 25;10:813. Epub 2019 Jun 25.

Department of Horticulture, Washington State University, Pullman, WA, United States.

Despite tremendous advancements in high throughput sequencing, the vast majority of tree genomes, and in particular, forest trees, remain elusive. Although primary databases store genetic resources for just over 2,000 forest tree species, these are largely focused on sequence storage, basic genome assemblies, and functional assignment through existing pipelines. The tree databases reviewed here serve as secondary repositories for community data. They vary in their focal species, the data they curate, and the analytics provided, but they are united in moving toward a goal of centralizing both data access and analysis. They provide frameworks to view and update annotations for complex genomes, interrogate systems level expression profiles, curate data for comparative genomics, and perform real-time analysis with genotype and phenotype data. The organism databases of today are no longer simply catalogs or containers of genetic information. These repositories represent integrated cyberinfrastructure that support cross-site queries and analysis in web-based environments. These resources are striving to integrate across diverse experimental designs, sequence types, and related measures through ontologies, community standards, and web services. Efficient, simple, and robust platforms that enhance the data generated by the research community, contribute to improving forest health and productivity.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fpls.2019.00813DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6603172PMC
June 2019

Structural and Functional Annotation of Eukaryotic Genomes with GenSAS.

Methods Mol Biol 2019 ;1962:29-51

Department of Horticulture, Washington State University, Pullman, WA, USA.

The Genome Sequence Annotation Server (GenSAS, https://www.gensas.org ) is a secure, web-based genome annotation platform for structural and functional annotation, as well as manual curation. Requiring no installation by users, GenSAS integrates popular command line-based, annotation tools under a single, easy-to-use, online interface. GenSAS integrates JBrowse and Apollo, so users can view annotation data and manually curate gene models. Users are guided step by step through the annotation process by embedded instructions and a more in-depth GenSAS User's Guide. In addition to a genome assembly file, users can also upload organism-specific transcript, protein, and RNA-seq read evidence for use in the annotation process. The latest versions of the NCBI RefSeq transcript and protein databases and the SwissProt and TrEMBL protein databases are provided for all users. GenSAS projects can be shared with other GenSAS users enabling collaborative annotation. Once annotation is complete, GenSAS generates the final files of the annotated gene models in common file formats for use with other annotation tools, submission to a repository, and use in publications.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/978-1-4939-9173-0_3DOI Listing
August 2019

High-density multi-population consensus genetic linkage map for peach.

PLoS One 2018 21;13(11):e0207724. Epub 2018 Nov 21.

Clemson University, Department of Plant and Environmental Sciences, Clemson, SC, United States of America.

Highly saturated genetic linkage maps are extremely helpful to breeders and are an essential prerequisite for many biological applications such as the identification of marker-trait associations, mapping quantitative trait loci (QTL), candidate gene identification, development of molecular markers for marker-assisted selection (MAS) and comparative genetic studies. Several high-density genetic maps, constructed using the 9K SNP peach array, are available for peach. However, each of these maps is based on a single mapping population and has limited use for QTL discovery and comparative studies. A consensus genetic linkage map developed from multiple populations provides not only a higher marker density and a greater genome coverage when compared to the individual maps, but also serves as a valuable tool for estimating genetic positions of unmapped markers. In this study, a previously developed linkage map from the cross between two peach cultivars 'Zin Dai' and 'Crimson Lady' (ZC2) was improved by genotyping additional progenies. In addition, a peach consensus map was developed based on the combination of the improved ZC2 genetic linkage map with three existing high-density genetic maps of peach and a reference map of Prunus. A total of 1,476 SNPs representing 351 unique marker positions were mapped across eight linkage groups on the ZC2 genetic map. The ZC2 linkage map spans 483.3 cM with an average distance between markers of 1.38 cM/marker. The MergeMap and LPmerge tools were used for the construction of a consensus map based on markers shared across five genetic linkage maps. The consensus linkage map contains a total of 3,092 molecular markers, consisting of 2,975 SNPs, 116 SSRs and 1 morphological marker associated with slow ripening in peach (SR). The consensus map provides valuable information on marker order and genetic position for QTL identification in peach and other genetic studies within Prunus and Rosaceae.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0207724PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6248993PMC
April 2019

15 years of GDR: New data and functionality in the Genome Database for Rosaceae.

Nucleic Acids Res 2019 01;47(D1):D1137-D1145

Department of Horticulture, Washington State University, Pullman, WA 99164-6414, USA.

The Genome Database for Rosaceae (GDR, https://www.rosaceae.org) is an integrated web-based community database resource providing access to publicly available genomics, genetics and breeding data and data-mining tools to facilitate basic, translational and applied research in Rosaceae. The volume of data in GDR has increased greatly over the last 5 years. The GDR now houses multiple versions of whole genome assembly and annotation data from 14 species, made available by recent advances in sequencing technology. Annotated and searchable reference transcriptomes, RefTrans, combining peer-reviewed published RNA-Seq as well as EST datasets, are newly available for major crop species. Significantly more quantitative trait loci, genetic maps and markers are available in MapViewer, a new visualization tool that better integrates with other pages in GDR. Pathways can be accessed through the new GDR Cyc Pathways databases, and synteny among the newest genome assemblies from eight species can be viewed through the new synteny browser, SynView. Collated single-nucleotide polymorphism diversity data and phenotypic data from publicly available breeding datasets are integrated with other relevant data. Also, the new Breeding Information Management System allows breeders to upload, manage and analyze their private breeding data within the secure GDR server with an option to release data publicly.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky1000DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6324069PMC
January 2019

AgBioData consortium recommendations for sustainable genomics and genetics databases for agriculture.

Database (Oxford) 2018 01 1;2018. Epub 2018 Jan 1.

Boyce Thompson Institute, Ithaca, NY, USA.

The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require data management plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData (https://www.agbiodata.org) is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bay088DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6146126PMC
January 2018

Growing and cultivating the forest genomics database, TreeGenes.

Database (Oxford) 2018 01 1;2018:1-11. Epub 2018 Jan 1.

Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT, USA.

Forest trees are valued sources of pulp, timber and biofuels, and serve a role in carbon sequestration, biodiversity maintenance and watershed stability. Examining the relationships among genetic, phenotypic and environmental factors for these species provides insight on the areas of concern for breeders and researchers alike. The TreeGenes database is a web-based repository that is home to 1790 tree species and over 1500 registered users. The database provides a curated archive for high-throughput genomics, including reference genomes, transcriptomes, genetic maps and variant data. These resources are paired with extensive phenotypic information and environmental layers. TreeGenes recently migrated to Tripal, an integrated and open-source database schema and content management system. This migration enabled developments focused on data exchange, data transfer and improved analytical capacity, as well as providing TreeGenes the opportunity to communicate with the following partner databases: Hardwood Genomics Web, Genome Database for Rosaceae, and the Citrus Genome Database. Recent development in TreeGenes has focused on coordinating information for georeferenced accessions, including metadata acquisition and ontological frameworks, to improve integration across studies combining genetic, phenotypic and environmental data. This focus was paired with the development of tools to enable comparative genomics and data visualization. By combining advanced data importers, relevant metadata standards and integrated analytical frameworks, TreeGenes provides a platform for researchers to store, submit and analyze forest tree data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bay084DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6146132PMC
January 2018

Discovery and validation of a glioblastoma co-expressed gene module.

Oncotarget 2018 Feb 13;9(13):10995-11008. Epub 2018 Jan 13.

Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA.

Tumors exhibit complex patterns of aberrant gene expression. Using a knowledge-independent, noise-reducing gene co-expression network construction software called KINC, we created multiple RNAseq-based gene co-expression networks relevant to brain and glioblastoma biology. In this report, we describe the discovery and validation of a glioblastoma-specific gene module that contains 22 co-expressed genes. The genes are upregulated in glioblastoma relative to normal brain and lower grade glioma samples; they are also hypo-methylated in glioblastoma relative to lower grade glioma tumors. Among the proneural, neural, mesenchymal, and classical glioblastoma subtypes, these genes are most-highly expressed in the mesenchymal subtype. Furthermore, high expression of these genes is associated with decreased survival across each glioblastoma subtype. These genes are of interest to glioblastoma biology and our gene interaction discovery and validation workflow can be used to discover and validate co-expressed gene modules derived from any co-expression network.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.18632/oncotarget.24228DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5834250PMC
February 2018

New extension software modules to enhance searching and display of transcriptome data in Tripal databases.

Database (Oxford) 2017 01;2017

Department of Entomology and Plant Pathology, University of Tennessee, Knoxville, TN, USA.

Database Url Tripal Elasticsearch Module: https://github.com/tripal/tripal_elasticsearch.

Tripal Analysis Expression Module: https://github.com/tripal/tripal_analysis_expression.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bax052DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5532966PMC
January 2017

Discovering Condition-Specific Gene Co-Expression Patterns Using Gaussian Mixture Models: A Cancer Case Study.

Sci Rep 2017 08 17;7(1):8617. Epub 2017 Aug 17.

Department of Genetics & Biochemistry, Clemson University, Clemson, SC, 29631, USA.

A gene co-expression network (GCN) describes associations between genes and points to genetic coordination of biochemical pathways. However, genetic correlations in a GCN are only detectable if they are present in the sampled conditions. With the increasing quantity of gene expression samples available in public repositories, there is greater potential for discovery of genetic correlations from a variety of biologically interesting conditions. However, even if gene correlations are present, their discovery can be masked by noise. Noise is introduced from natural variation (intrinsic and extrinsic), systematic variation (caused by sample measurement protocols and instruments), and algorithmic and statistical variation created by selection of data processing tools. A variety of published studies, approaches and methods attempt to address each of these contributions of variation to reduce noise. Here we describe an approach using Gaussian Mixture Models (GMMs) to address natural extrinsic (condition-specific) variation during network construction from mixed input conditions. To demonstrate utility, we build and analyze a condition-annotated GCN from a compendium of 2,016 mixed gene expression data sets from five tumor subtypes obtained from The Cancer Genome Atlas. Our results show that GMMs help discover tumor subtype specific gene co-expression patterns (modules) that are significantly enriched for clinical attributes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-017-09094-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5561081PMC
August 2017

blend4php: a PHP API for galaxy.

Database (Oxford) 2017 10;2017. Epub 2017 Jan 10.

Department of Horticulture and.

Galaxy is a popular framework for execution of complex analytical pipelines typically for large data sets, and is a commonly used for (but not limited to) genomic, genetic and related biological analysis. It provides a web front-end and integrates with high performance computing resources. Here we report the development of the blend4php library that wraps Galaxy's RESTful API into a PHP-based library. PHP-based web applications can use blend4php to automate execution, monitoring and management of a remote Galaxy server, including its users, workflows, jobs and more. The blend4php library was specifically developed for the integration of Galaxy with Tripal, the open-source toolkit for the creation of online genomic and genetic web sites. However, it was designed as an independent library for use by any application, and is freely available under version 3 of the GNU Lesser General Public License (LPGL v3.0) at https://github.com/galaxyproject/blend4phpDatabase URL: https://github.com/galaxyproject/blend4php.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/baw154DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225400PMC
November 2017

Extension modules for storage, visualization and querying of genomic, genetic and breeding data in Tripal databases.

Database (Oxford) 2017 Jan;2017

Department of Horticulture, Washington State University, Pullman, WA, 99164, USA.

Tripal is an open-source database platform primarily used for development of genomic, genetic and breeding databases. We report here on the release of the Chado Loader, Chado Data Display and Chado Search modules to extend the functionality of the core Tripal modules. These new extension modules provide additional tools for (1) data loading, (2) customized visualization and (3) advanced search functions for supported data types such as organism, marker, QTL/Mendelian Trait Loci, germplasm, map, project, phenotype, genotype and their respective metadata. The Chado Loader module provides data collection templates in Excel with defined metadata and data loaders with front end forms. The Chado Data Display module contains tools to visualize each data type and the metadata which can be used as is or customized as desired. The Chado Search module provides search and download functionality for the supported data types. Also included are the tools to visualize map and species summary. The use of materialized views in the Chado Search module enables better performance as well as flexibility of data modeling in Chado, allowing existing Tripal databases with different metadata types to utilize the module. These Tripal Extension modules are implemented in the Genome Database for Rosaceae (rosaceae.org), CottonGen (cottongen.org), Citrus Genome Database (citrusgenomedb.org), Genome Database for Vaccinium (vaccinium.org) and the Cool Season Food Legume Database (coolseasonfoodlegume.org). Database URL: https://www.citrusgenomedb.org/, https://www.coolseasonfoodlegume.org/, https://www.cottongen.org/, https://www.rosaceae.org/, https://www.vaccinium.org/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bax092DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5727400PMC
January 2017

Large-Scale Gene Relocations following an Ancient Genome Triplication Associated with the Diversification of Core Eudicots.

PLoS One 2016 19;11(5):e0155637. Epub 2016 May 19.

Plant Genome Mapping Laboratory, University of Georgia, Athens, Georgia, United States of America.

Different modes of gene duplication including whole-genome duplication (WGD), and tandem, proximal and dispersed duplications are widespread in angiosperm genomes. Small-scale, stochastic gene relocations and transposed gene duplications are widely accepted to be the primary mechanisms for the creation of dispersed duplicates. However, here we show that most surviving ancient dispersed duplicates in core eudicots originated from large-scale gene relocations within a narrow window of time following a genome triplication (γ) event that occurred in the stem lineage of core eudicots. We name these surviving ancient dispersed duplicates as relocated γ duplicates. In Arabidopsis thaliana, relocated γ, WGD and single-gene duplicates have distinct features with regard to gene functions, essentiality, and protein interactions. Relative to γ duplicates, relocated γ duplicates have higher non-synonymous substitution rates, but comparable levels of expression and regulation divergence. Thus, relocated γ duplicates should be distinguished from WGD and single-gene duplicates for evolutionary investigations. Our results suggest large-scale gene relocations following the γ event were associated with the diversification of core eudicots.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0155637PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4873151PMC
July 2017

Chado use case: storing genomic, genetic and breeding data of Rosaceae and Gossypium crops in Chado.

Database (Oxford) 2016 17;2016. Epub 2016 Mar 17.

Department of Horticulture, Washington State University Pullman, WA, USA.

The Genome Database for Rosaceae (GDR) and CottonGen are comprehensive online data repositories that provide access to integrated genomic, genetic and breeding data through search, visualization and analysis tools for Rosaceae crops and Gossypium (cotton). These online databases use Chado, an open-source, generic and ontology-driven database schema for biological data, as the primary data storage platform. Chado is highly normalized and uses ontologies to indicate the 'types' of data. Therefore, Chado is flexible such that it has been used to house genomic, genetic and breeding data for GDR and CottonGen. These data include whole genome sequence and annotation, transcripts, molecular markers, genetic maps, Quantitative Trait Loci, Mendelian Trait Loci, traits, germplasm, pedigrees, large scale phenotypic and genotypic data, ontologies and publications. We provide information about how to store these types of data in Chado using GDR and CottonGen as examples sites that were converted from an older legacy infrastructure. Database URL: GDR (www.rosaceae.org), CottonGen (www.cottongen.org).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/baw010DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4795932PMC
October 2016

Development and preliminary evaluation of a 90 K Axiom® SNP array for the allo-octoploid cultivated strawberry Fragaria × ananassa.

BMC Genomics 2015 Mar 7;16:155. Epub 2015 Mar 7.

Wageningen-UR Plant Breeding, Wageningen, The Netherlands.

Background: A high-throughput genotyping platform is needed to enable marker-assisted breeding in the allo-octoploid cultivated strawberry Fragaria × ananassa. Short-read sequences from one diploid and 19 octoploid accessions were aligned to the diploid Fragaria vesca 'Hawaii 4' reference genome to identify single nucleotide polymorphisms (SNPs) and indels for incorporation into a 90 K Affymetrix® Axiom® array. We report the development and preliminary evaluation of this array.

Results: About 36 million sequence variants were identified in a 19 member, octoploid germplasm panel. Strategies and filtering pipelines were developed to identify and incorporate markers of several types: di-allelic SNPs (66.6%), multi-allelic SNPs (1.8%), indels (10.1%), and ploidy-reducing "haploSNPs" (11.7%). The remaining SNPs included those discovered in the diploid progenitor F. iinumae (3.9%), and speculative "codon-based" SNPs (5.9%). In genotyping 306 octoploid accessions, SNPs were assigned to six classes with Affymetrix's "SNPolisher" R package. The highest quality classes, PolyHigh Resolution (PHR), No Minor Homozygote (NMH), and Off-Target Variant (OTV) comprised 25%, 38%, and 1% of array markers, respectively. These markers were suitable for genetic studies as demonstrated in the full-sib family 'Holiday' × 'Korona' with the generation of a genetic linkage map consisting of 6,594 PHR SNPs evenly distributed across 28 chromosomes with an average density of approximately one marker per 0.5 cM, thus exceeding our goal of one marker per cM.

Conclusions: The Affymetrix IStraw90 Axiom array is the first high-throughput genotyping platform for cultivated strawberry and is commercially available to the worldwide scientific community. The array's high success rate is likely driven by the presence of naturally occurring variation in ploidy level within the nominally octoploid genome, and by effectiveness of the employed array design and ploidy-reducing strategies. This array enables genetic analyses including generation of high-density linkage maps, identification of quantitative trait loci for economically important traits, and genome-wide association studies, thus providing a basis for marker-assisted breeding in this high value crop.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-015-1310-1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4374422PMC
March 2015

The Genome Database for Rosaceae (GDR): year 10 update.

Nucleic Acids Res 2014 Jan 12;42(Database issue):D1237-44. Epub 2013 Nov 12.

Department of Horticulture, Washington State University, Pullman, WA 99164-6414, USA, Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA, Boyce Thompson Institute for Plant Research, Tower Road, Ithaca, NY 14853, USA, Department of Computer Science, Saginaw Valley State University, University Center, MI 48710, USA and Horticultural Sciences Department, University of Florida, Gainesville, FL 32611, USA.

The Genome Database for Rosaceae (GDR, http:/www.rosaceae.org), the long-standing central repository and data mining resource for Rosaceae research, has been enhanced with new genomic, genetic and breeding data, and improved functionality. Whole genome sequences of apple, peach and strawberry are available to browse or download with a range of annotations, including gene model predictions, aligned transcripts, repetitive elements, polymorphisms, mapped genetic markers, mapped NCBI Rosaceae genes, gene homologs and association of InterPro protein domains, GO terms and Kyoto Encyclopedia of Genes and Genomes pathway terms. Annotated sequences can be queried using search interfaces and visualized using GBrowse. New expressed sequence tag unigene sets are available for major genera, and Pathway data are available through FragariaCyc, AppleCyc and PeachCyc databases. Synteny among the three sequenced genomes can be viewed using GBrowse_Syn. New markers, genetic maps and extensively curated qualitative/Mendelian and quantitative trait loci are available. Phenotype and genotype data from breeding projects and genetic diversity projects are also included. Improved search pages are available for marker, trait locus, genetic diversity and publication data. New search tools for breeders enable selection comparison and assistance with breeding decision making.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkt1012DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965003PMC
January 2014

CottonGen: a genomics, genetics and breeding database for cotton research.

Nucleic Acids Res 2014 Jan 6;42(Database issue):D1229-36. Epub 2013 Nov 6.

Department of Horticulture, Washington State University, Pullman, WA 99164-6414, USA, Cotton Incorporated, Cary, NC 27513, USA and Crop Germplasm Research Unit, USDA-ARS-SPARC, College Station, TX 77845, USA.

CottonGen (http://www.cottongen.org) is a curated and integrated web-based relational database providing access to publicly available genomic, genetic and breeding data for cotton. CottonGen supercedes CottonDB and the Cotton Marker Database, with enhanced tools for easier data sharing, mining, visualization and data retrieval of cotton research data. CottonGen contains annotated whole genome sequences, unigenes from expressed sequence tags (ESTs), markers, trait loci, genetic maps, genes, taxonomy, germplasm, publications and communication resources for the cotton community. Annotated whole genome sequences of Gossypium raimondii are available with aligned genetic markers and transcripts. These whole genome data can be accessed through genome pages, search tools and GBrowse, a popular genome browser. Most of the published cotton genetic maps can be viewed and compared using CMap, a comparative map viewer, and are searchable via map search tools. Search tools also exist for markers, quantitative trait loci (QTLs), germplasm, publications and trait evaluation data. CottonGen also provides online analysis tools such as NCBI BLAST and Batch BLAST.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkt1064DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3964939PMC
January 2014

Tripal v1.1: a standards-based toolkit for construction of online genetic and genomic databases.

Database (Oxford) 2013 25;2013:bat075. Epub 2013 Oct 25.

Department of Plant Sciences, University of Saskatchewan. Saskatoon, SK Canada, Department of Horticulture, Washington State University. Pullman, WA, USA and Department of Genetics and Biochemistry, Clemson University. Clemson, SC, USA.

Tripal is an open-source freely available toolkit for construction of online genomic and genetic databases. It aims to facilitate development of community-driven biological websites by integrating the GMOD Chado database schema with Drupal, a popular website creation and content management software. Tripal provides a suite of tools for interaction with a Chado database and display of content therein. The tools are designed to be generic to support the various ways in which data may be stored in Chado. Previous releases of Tripal have supported organisms, genomic libraries, biological stocks, stock collections and genomic features, their alignments and annotations. Also, Tripal and its extension modules provided loaders for commonly used file formats such as FASTA, GFF, OBO, GAF, BLAST XML, KEGG heir files and InterProScan XML. Default generic templates were provided for common views of biological data, which could be customized using an open Application Programming Interface to change the way data are displayed. Here, we report additional tools and functionality that are part of release v1.1 of Tripal. These include (i) a new bulk loader that allows a site curator to import data stored in a custom tab delimited format; (ii) full support of every Chado table for Drupal Views (a powerful tool allowing site developers to construct novel displays and search pages); (iii) new modules including 'Feature Map', 'Genetic', 'Publication', 'Project', 'Contact' and the 'Natural Diversity' modules. Tutorials, mailing lists, download and set-up instructions, extension modules and other documentation can be found at the Tripal website located at http://tripal.info. DATABASE URL: http://tripal.info/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bat075DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3808541PMC
March 2014

A systems-genetics approach and data mining tool to assist in the discovery of genes underlying complex traits in Oryza sativa.

PLoS One 2013 16;8(7):e68551. Epub 2013 Jul 16.

Plant and Environmental Sciences, Clemson University, Clemson, South Carolina, United States of America.

Many traits of biological and agronomic significance in plants are controlled in a complex manner where multiple genes and environmental signals affect the expression of the phenotype. In Oryza sativa (rice), thousands of quantitative genetic signals have been mapped to the rice genome. In parallel, thousands of gene expression profiles have been generated across many experimental conditions. Through the discovery of networks with real gene co-expression relationships, it is possible to identify co-localized genetic and gene expression signals that implicate complex genotype-phenotype relationships. In this work, we used a knowledge-independent, systems genetics approach, to discover a high-quality set of co-expression networks, termed Gene Interaction Layers (GILs). Twenty-two GILs were constructed from 1,306 Affymetrix microarray rice expression profiles that were pre-clustered to allow for improved capture of gene co-expression relationships. Functional genomic and genetic data, including over 8,000 QTLs and 766 phenotype-tagged SNPs (p-value < = 0.001) from genome-wide association studies, both covering over 230 different rice traits were integrated with the GILs. An online systems genetics data-mining resource, the GeneNet Engine, was constructed to enable dynamic discovery of gene sets (i.e. network modules) that overlap with genetic traits. GeneNet Engine does not provide the exact set of genes underlying a given complex trait, but through the evidence of gene-marker correspondence, co-expression, and functional enrichment, site visitors can identify genes with potential shared causality for a trait which could then be used for experimental validation. A set of 2 million SNPs was incorporated into the database and serve as a potential set of testable biomarkers for genes in modules that overlap with genetic traits. Herein, we describe two modules found using GeneNet Engine, one with significant overlap with the trait amylose content and another with significant overlap with blast disease resistance.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0068551PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3713027PMC
February 2014

Maximizing capture of gene co-expression relationships through pre-clustering of input expression samples: an Arabidopsis case study.

BMC Syst Biol 2013 Jun 5;7:44. Epub 2013 Jun 5.

Department of Genetics & Biochemistry, Clemson University, 105 Collings Street, Clemson, SC 29634, USA.

Background: In genomics, highly relevant gene interaction (co-expression) networks have been constructed by finding significant pair-wise correlations between genes in expression datasets. These networks are then mined to elucidate biological function at the polygenic level. In some cases networks may be constructed from input samples that measure gene expression under a variety of different conditions, such as for different genotypes, environments, disease states and tissues. When large sets of samples are obtained from public repositories it is often unmanageable to associate samples into condition-specific groups, and combining samples from various conditions has a negative effect on network size. A fixed significance threshold is often applied also limiting the size of the final network. Therefore, we propose pre-clustering of input expression samples to approximate condition-specific grouping of samples and individual network construction of each group as a means for dynamic significance thresholding. The net effect is increase sensitivity thus maximizing the total co-expression relationships in the final co-expression network compendium.

Results: A total of 86 Arabidopsis thaliana co-expression networks were constructed after k-means partitioning of 7,105 publicly available ATH1 Affymetrix microarray samples. We term each pre-sorted network a Gene Interaction Layer (GIL). Random Matrix Theory (RMT), an un-supervised thresholding method, was used to threshold each of the 86 networks independently, effectively providing a dynamic (non-global) threshold for the network. The overall gene count across all GILs reached 19,588 genes (94.7% measured gene coverage) and 558,022 unique co-expression relationships. In comparison, network construction without pre-sorting of input samples yielded only 3,297 genes (15.9%) and 129,134 relationships. in the global network.

Conclusions: Here we show that pre-clustering of microarray samples helps approximate condition-specific networks and allows for dynamic thresholding using un-supervised methods. Because RMT ensures only highly significant interactions are kept, the GIL compendium consists of 558,022 unique high quality A. thaliana co-expression relationships across almost all of the measurable genes on the ATH1 array. For A. thaliana, these networks represent the largest compendium to date of significant gene co-expression relationships, and are a means to explore complex pathway, polygenic, and pleiotropic relationships for this focal model plant. The networks can be explored at sysbio.genome.clemson.edu. Finally, this method is applicable to any large expression profile collection for any organism and is best suited where a knowledge-independent network construction method is desired.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1752-0509-7-44DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3679940PMC
June 2013
-->