Publications by authors named "Ian Foster"

60 Publications

Enabling deeper learning on big data for materials informatics applications.

Sci Rep 2021 Feb 19;11(1):4244. Epub 2021 Feb 19.

Department of Electrical and Computer Engineering, Northwestern University, Evanston, USA.

The application of machine learning (ML) techniques in materials science has attracted significant attention in recent years, due to their impressive ability to efficiently extract data-driven linkages from various input materials representations to their output properties. While the application of traditional ML techniques has become quite ubiquitous, there have been limited applications of more advanced deep learning (DL) techniques, primarily because big materials datasets are relatively rare. Given the demonstrated potential and advantages of DL and the increasing availability of big materials datasets, it is attractive to go for deeper neural networks in a bid to boost model performance, but in reality, it leads to performance degradation due to the vanishing gradient problem. In this paper, we address the question of how to enable deeper learning for cases where big materials data is available. Here, we present a general deep learning framework based on Individual Residual learning (IRNet) composed of very deep neural networks that can work with any vector-based materials representation as input to build accurate property prediction models. We find that the proposed IRNet models can not only successfully alleviate the vanishing gradient problem and enable deeper learning, but also lead to significantly (up to 47%) better model accuracy as compared to plain deep neural networks and traditional ML techniques for a given input materials representation in the presence of big data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-021-83193-1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7895970PMC
February 2021

Sediment source fingerprinting: benchmarking recent outputs, remaining challenges and emerging themes.

J Soils Sediments 2020 16;20(12):4160-4193. Epub 2020 Sep 16.

Sustainable Agriculture Sciences, Rothamsted Research, North Wyke, Okehampton, Devon EX20 2SB UK.

Purpose: This review of sediment source fingerprinting assesses the current state-of-the-art, remaining challenges and emerging themes. It combines inputs from international scientists either with track records in the approach or with expertise relevant to progressing the science.

Methods: Web of Science and Google Scholar were used to review published papers spanning the period 2013-2019, inclusive, to confirm publication trends in quantities of papers by study area country and the types of tracers used. The most recent (2018-2019, inclusive) papers were also benchmarked using a methodological decision-tree published in 2017.

Scope: Areas requiring further research and international consensus on methodological detail are reviewed, and these comprise spatial variability in tracers and corresponding sampling implications for end-members, temporal variability in tracers and sampling implications for end-members and target sediment, tracer conservation and knowledge-based pre-selection, the physico-chemical basis for source discrimination and dissemination of fingerprinting results to stakeholders. Emerging themes are also discussed: novel tracers, concentration-dependence for biomarkers, combining sediment fingerprinting and age-dating, applications to sediment-bound pollutants, incorporation of supportive spatial information to augment discrimination and modelling, aeolian sediment source fingerprinting, integration with process-based models and development of open-access software tools for data processing.

Conclusions: The popularity of sediment source fingerprinting continues on an upward trend globally, but with this growth comes issues surrounding lack of standardisation and procedural diversity. Nonetheless, the last 2 years have also evidenced growing uptake of critical requirements for robust applications and this review is intended to signpost investigators, both old and new, towards these benchmarks and remaining research challenges for, and emerging options for different applications of, the fingerprinting approach.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s11368-020-02755-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7679299PMC
September 2020

Atlas of Transcription Factor Binding Sites from ENCODE DNase Hypersensitivity Data across 27 Tissue Types.

Cell Rep 2020 08;32(7):108029

Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA; Department of Psychiatry, University of Maryland School of Medicine, Baltimore, MD 21201, USA. Electronic address:

Characterizing the tissue-specific binding sites of transcription factors (TFs) is essential to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting enables the prediction of genome-wide binding sites for hundreds of TFs simultaneously. Despite the public availability of high-quality DNase-seq data from hundreds of samples, a comprehensive, up-to-date resource for the locations of genomic footprints is lacking. Here, we develop a scalable footprinting workflow using two state-of-the-art algorithms: Wellington and HINT. We apply our workflow to detect footprints in 192 ENCODE DNase-seq experiments and predict the genomic occupancy of 1,515 human TFs in 27 human tissues. We validate that these footprints overlap true-positive TF binding sites from ChIP-seq. We demonstrate that the locations, depth, and tissue specificity of footprints predict effects of genetic variants on gene expression and capture a substantial proportion of genetic risk for complex traits.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.celrep.2020.108029DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7462736PMC
August 2020

Quantum-Chemically Informed Machine Learning: Prediction of Energies of Organic Molecules with 10 to 14 Non-hydrogen Atoms.

J Phys Chem A 2020 Jul 2;124(28):5804-5811. Epub 2020 Jul 2.

Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States.

High-fidelity quantum-chemical calculations can provide accurate predictions of molecular energies, but their high computational costs limit their utility, especially for larger molecules. We have shown in previous work that machine learning models trained on high-level quantum-chemical calculations (G4MP2) for organic molecules with one to nine non-hydrogen atoms can provide accurate predictions for other molecules of comparable size at much lower costs. Here we demonstrate that such models can also be used to effectively predict energies of molecules larger than those in the training set. To implement this strategy, we first established a set of 191 molecules with 10-14 non-hydrogen atoms having reliable experimental enthalpies of formation. We then assessed the accuracy of computed G4MP2 enthalpies of formation for these 191 molecules. The error in the G4MP2 results was somewhat larger than that for smaller molecules, and the reason for this increase is discussed. Two density functional methods, B3LYP and ωB97X-D, were also used on this set of molecules, with ωB97X-D found to perform better than B3LYP at predicting energies. The G4MP2 energies for the 191 molecules were then predicted using these two functionals with two machine learning methods, the FCHL-Δ and SchNet-Δ models, with the learning done on calculated energies of the one to nine non-hydrogen atom molecules. The better-performing model, FCHL-Δ, gave atomization energies of the 191 organic molecules with 10-14 non-hydrogen atoms within 0.4 kcal/mol of their G4MP2 energies. Thus, this work demonstrates that quantum-chemically informed machine learning can be used to successfully predict the energies of large organic molecules whose size is beyond that in the training set.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.jpca.0c01777DOI Listing
July 2020

Massive formation of early diagenetic dolomite in the Ediacaran ocean: Constraints on the "dolomite problem".

Proc Natl Acad Sci U S A 2020 06 8;117(25):14005-14014. Epub 2020 Jun 8.

Institute for Geology, Mineralogy, and Geophysics, Ruhr University Bochum, D-44801 Bochum, Germany.

Paleozoic and Precambrian sedimentary successions frequently contain massive dolomicrite [CaMg(CO)] units despite kinetic inhibitions to nucleation and precipitation of dolomite at Earth surface temperatures (<60 °C). This paradoxical observation is known as the "dolomite problem." Accordingly, the genesis of these dolostones is usually attributed to burial-hydrothermal dolomitization of primary limestones (CaCO) at temperatures of >100 °C, thus raising doubt about the validity of these deposits as archives of Earth surface environments. We present a high-resolution, >63-My-long clumped-isotope temperature (T) record of shallow-marine dolomicrites from two drillcores of the Ediacaran (635 to 541 Ma) Doushantuo Formation in South China. Our T record indicates that a majority (87%) of these dolostones formed at temperatures of <100 °C. When considering the regional thermal history, modeling of the influence of solid-state reordering on our T record further suggests that most of the studied dolostones formed at temperatures of <60 °C, providing direct evidence of a low-temperature origin of these dolostones. Furthermore, calculated δO values of diagenetic fluids, rare earth element plus yttrium compositions, and petrographic observations of these dolostones are consistent with an early diagenetic origin in a rock-buffered environment. We thus propose that a precursor precipitate from seawater was subsequently dolomitized during early diagenesis in a near-surface setting to produce the large volume of dolostones in the Doushantuo Formation. Our findings suggest that the preponderance of dolomite in Paleozoic and Precambrian deposits likely reflects oceanic conditions specific to those eras and that dolostones can be faithful recorders of environmental conditions in the early oceans.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.1916673117DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7321997PMC
June 2020

A regional nuclear conflict would compromise global food security.

Proc Natl Acad Sci U S A 2020 03 16;117(13):7071-7081. Epub 2020 Mar 16.

Goddard Institute for Space Studies, National Aeronautics and Space Administration, New York, NY 10025.

A limited nuclear war between India and Pakistan could ignite fires large enough to emit more than 5 Tg of soot into the stratosphere. Climate model simulations have shown severe resulting climate perturbations with declines in global mean temperature by 1.8 °C and precipitation by 8%, for at least 5 y. Here we evaluate impacts for the global food system. Six harmonized state-of-the-art crop models show that global caloric production from maize, wheat, rice, and soybean falls by 13 (±1)%, 11 (±8)%, 3 (±5)%, and 17 (±2)% over 5 y. Total single-year losses of 12 (±4)% quadruple the largest observed historical anomaly and exceed impacts caused by historic droughts and volcanic eruptions. Colder temperatures drive losses more than changes in precipitation and solar radiation, leading to strongest impacts in temperate regions poleward of 30°N, including the United States, Europe, and China for 10 to 15 y. Integrated food trade network analyses show that domestic reserves and global trade can largely buffer the production anomaly in the first year. Persistent multiyear losses, however, would constrain domestic food availability and propagate to the Global South, especially to food-insecure countries. By year 5, maize and wheat availability would decrease by 13% globally and by more than 20% in 71 countries with a cumulative population of 1.3 billion people. In view of increasing instability in South Asia, this study shows that a regional conflict using <1% of the worldwide nuclear arsenal could have adverse consequences for global food security unmatched in modern history.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.1919049117DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7132296PMC
March 2020

TomoGAN: low-dose synchrotron x-ray tomography with generative adversarial networks: discussion.

J Opt Soc Am A Opt Image Sci Vis 2020 Mar;37(3):422-434

Synchrotron-based x-ray tomography is a noninvasive imaging technique that allows for reconstructing the internal structure of materials at high spatial resolutions from tens of micrometers to a few nanometers. In order to resolve sample features at smaller length scales, however, a higher radiation dose is required. Therefore, the limitation on the achievable resolution is set primarily by noise at these length scales. We present TomoGAN, a denoising technique based on generative adversarial networks, for improving the quality of reconstructed images for low-dose imaging conditions. We evaluate our approach in two photon-budget-limited experimental conditions: (1) sufficient number of low-dose projections (based on Nyquist sampling), and (2) insufficient or limited number of high-dose projections. In both cases, the angular sampling is assumed to be isotropic, and the photon budget throughout the experiment is fixed based on the maximum allowable radiation dose on the sample. Evaluation with both simulated and experimental datasets shows that our approach can significantly reduce noise in reconstructed images, improving the structural similarity score of simulation and experimental data from 0.18 to 0.9 and from 0.18 to 0.41, respectively. Furthermore, the quality of the reconstructed images with filtered back projection followed by our denoising approach exceeds that of reconstructions with the simultaneous iterative reconstruction technique, showing the computational superiority of our approach.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1364/JOSAA.375595DOI Listing
March 2020

Exascale applications: skin in the game.

Philos Trans A Math Phys Eng Sci 2020 Mar 20;378(2166):20190056. Epub 2020 Jan 20.

Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

As noted in Wikipedia, refers to having 'incurred risk by being involved in achieving a goal', where ' is a synecdoche for the person involved, and is the metaphor for actions on the field of play under discussion'. For exascale applications under development in the US Department of Energy Exascale Computing Project, nothing could be more apt, with the being exascale applications and the being delivering comprehensive science-based computational applications that effectively exploit exascale high-performance computing technologies to provide breakthrough modelling and simulation and data science solutions. These solutions will yield high-confidence insights and answers to the most critical problems and challenges for the USA in scientific discovery, national security, energy assurance, economic competitiveness and advanced healthcare. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1098/rsta.2019.0056DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7015298PMC
March 2020

Prevalence of Inherited Mutations in Breast Cancer Predisposition Genes among Women in Uganda and Cameroon.

Cancer Epidemiol Biomarkers Prev 2020 02 23;29(2):359-367. Epub 2019 Dec 23.

Center for Clinical Cancer Genetics and Global Health, Department of Medicine, The University of Chicago, Chicago, Illinois.

Background: Sub-Saharan Africa (SSA) has a high proportion of premenopausal hormone receptor negative breast cancer. Previous studies reported a strikingly high prevalence of germline mutations in and among Nigerian patients with breast cancer. It is unknown if this exists in other SSA countries.

Methods: Breast cancer cases, unselected for age at diagnosis and family history, were recruited from tertiary hospitals in Kampala, Uganda and Yaoundé, Cameroon. Controls were women without breast cancer recruited from the same hospitals and age-matched to cases. A multigene sequencing panel was used to test for germline mutations.

Results: There were 196 cases and 185 controls with a mean age of 46.2 and 46.6 years for cases and controls, respectively. Among cases, 15.8% carried a pathogenic or likely pathogenic mutation in a breast cancer susceptibility gene: 5.6% in , 5.6% in , 1.5% in , 1% in , 0.5% in , 0.5% in , and 0.5% in . Among controls, 1.6% carried a mutation in one of these genes. Cases were 11-fold more likely to carry a mutation compared with controls (OR = 11.34; 95% confidence interval, 3.44-59.06; < 0.001). The mean age of cases with mutations was 38.3 years compared with 46.7 years among other cases without such mutations ( = 0.03).

Conclusions: Our findings replicate the earlier report of a high proportion of mutations in among patients with symptomatic breast cancer in SSA.

Impact: Given the high burden of inherited breast cancer in SSA countries, genetic risk assessment could be integrated into national cancer control plans.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1158/1055-9965.EPI-19-0506DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7007381PMC
February 2020

Virtual Excited State Reference for the Discovery of Electronic Materials Database: An Open-Access Resource for Ground and Excited State Properties of Organic Molecules.

J Phys Chem Lett 2019 Nov 23;10(21):6835-6841. Epub 2019 Oct 23.

Northeastern University , Boston , Massachusetts 02115 , United States.

This letter announces the Virtual Excited State Reference for the Discovery of Electronic Materials Database (), the first database to include downloadable excited-state structures (S, S, T) and photophysical properties. is searchable, open-access via www.verdedb.org , and focused on light-responsive π-conjugated organic molecules with applications in green chemistry, organic solar cells, and organic redox flow batteries. It includes results of our active and past virtual screening studies; to date, more than 13 000 density functional theory (DFT) calculations have been performed on 1 500 molecules to obtain frontier molecular orbitals and photophysical properties, including excitation energies, dipole moments, and redox potentials. To improve community access, we have made available via an integration with the Materials Data Facility. We are leveraging to train machine learning algorithms to identify new materials and structure-property relationships between molecular ground- and excited-states. We present a case-study involving photoaffinity labels, including predictions of new diazirine-based photoaffinity labels anticipated to have high photostabilities.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.jpclett.9b02577DOI Listing
November 2019

Reproducible big data science: A case study in continuous FAIRness.

PLoS One 2019 11;14(4):e0213013. Epub 2019 Apr 11.

Globus, University of Chicago, Chicago, Illinois, United States of America.

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0213013PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6459504PMC
December 2019

BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments.

PeerJ 2018 29;6:e5551. Epub 2018 Aug 29.

National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil.

Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.7717/peerj.5551DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6119457PMC
August 2018

Convergent downstream candidate mechanisms of independent intergenic polymorphisms between co-classified diseases implicate epistasis among noncoding elements.

Pac Symp Biocomput 2018 ;23:524-535

Center for Biomedical Informatics and Biostatistics (CB2) and Departments of Medicine and of Systems and Industrial Engineering, The University of Arizona, Tucson, AZ 85721, USA,

Eighty percent of DNA outside protein coding regions was shown biochemically functional by the ENCODE project, enabling studies of their interactions. Studies have since explored how convergent downstream mechanisms arise from independent genetic risks of one complex disease. However, the cross-talk and epistasis between intergenic risks associated with distinct complex diseases have not been comprehensively characterized. Our recent integrative genomic analysis unveiled downstream biological effectors of disease-specific polymorphisms buried in intergenic regions, and we then validated their genetic synergy and antagonism in distinct GWAS. We extend this approach to characterize convergent downstream candidate mechanisms of distinct intergenic SNPs across distinct diseases within the same clinical classification. We construct a multipartite network consisting of 467 diseases organized in 15 classes, 2,358 disease-associated SNPs, 6,301 SNPassociated mRNAs by eQTL, and mRNA annotations to 4,538 Gene Ontology mechanisms. Functional similarity between two SNPs (similar SNP pairs) is imputed using a nested information theoretic distance model for which p-values are assigned by conservative scale-free permutation of network edges without replacement (node degrees constant). At FDR≤5%, we prioritized 3,870 intergenic SNP pairs associated, among which 755 are associated with distinct diseases sharing the same disease class, implicating 167 intergenic SNPs, 14 classes, 230 mRNAs, and 134 GO terms. Co-classified SNP pairs were more likely to be prioritized as compared to those of distinct classes confirming a noncoding genetic underpinning to clinical classification (odds ratio ∼3.8; p≤10-25). The prioritized pairs were also enriched in regions bound to the same/interacting transcription factors and/or interacting in long-range chromatin interactions suggestive of epistasis (odds ratio ∼ 2,500; p≤10-25). This prioritized network implicates complex epistasis between intergenic polymorphisms of co-classified diseases and offers a roadmap for a novel therapeutic paradigm: repositioning medications that target proteins within downstream mechanisms of intergenic disease-associated SNPs. Supplementary information and software: http://lussiergroup.org/publications/disease_class.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5730078PMC
August 2018

Toward a new generation of agricultural system data, models, and knowledge products: State of agricultural systems science.

Agric Syst 2017 Jul;155:269-288

University of Reading, UK.

We review the current state of agricultural systems science, focusing in particular on the capabilities and limitations of agricultural systems models. We discuss the state of models relative to five different Use Cases spanning field, farm, landscape, regional, and global spatial scales and engaging questions in past, current, and future time periods. Contributions from multiple disciplines have made major advances relevant to a wide range of agricultural system model applications at various spatial and temporal scales. Although current agricultural systems models have features that are needed for the Use Cases, we found that all of them have limitations and need to be improved. We identified common limitations across all Use Cases, namely 1) a scarcity of data for developing, evaluating, and applying agricultural system models and 2) inadequate knowledge systems that effectively communicate model results to society. We argue that these limitations are greater obstacles to progress than gaps in conceptual theory or available methods for using system models. New initiatives on open data show promise for addressing the data problem, but there also needs to be a cultural change among agricultural researchers to ensure that data for addressing the range of Use Cases are available for future model improvements and applications. We conclude that multiple platforms and multiple models are needed for model applications for different purposes. The Use Cases provide a useful framework for considering capabilities and limitations of existing models and data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.agsy.2016.09.021DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5485672PMC
July 2017

Brief history of agricultural systems modeling.

Agric Syst 2017 Jul;155:240-254

University of Reading, UK.

Agricultural systems science generates knowledge that allows researchers to consider complex problems or take informed agricultural decisions. The rich history of this science exemplifies the diversity of systems and scales over which they operate and have been studied. Modeling, an essential tool in agricultural systems science, has been accomplished by scientists from a wide range of disciplines, who have contributed concepts and tools over more than six decades. As agricultural scientists now consider the "next generation" models, data, and knowledge products needed to meet the increasingly complex systems problems faced by society, it is important to take stock of this history and its lessons to ensure that we avoid re-invention and strive to consider all dimensions of associated challenges. To this end, we summarize here the history of agricultural systems modeling and identify lessons learned that can help guide the design and development of next generation of agricultural system tools and methods. A number of past events combined with overall technological progress in other fields have strongly contributed to the evolution of agricultural system modeling, including development of process-based bio-physical models of crops and livestock, statistical models based on historical observations, and economic optimization and simulation models at household and regional to global scales. Characteristics of agricultural systems models have varied widely depending on the systems involved, their scales, and the wide range of purposes that motivated their development and use by researchers in different disciplines. Recent trends in broader collaboration across institutions, across disciplines, and between the public and private sectors suggest that the stage is set for the major advances in agricultural systems science that are needed for the next generation of models, databases, knowledge products and decision support systems. The lessons from history should be considered to help avoid roadblocks and pitfalls as the community develops this next generation of agricultural systems models.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.agsy.2016.05.014DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5485640PMC
July 2017

Towards a new generation of agricultural system data, models and knowledge products: Information and communication technology.

Agric Syst 2017 Jul;155:200-212

Oregon State University, Corvallis, OR, USA.

Agricultural modeling has long suffered from fragmentation in model implementation. Many models are developed, there is much redundancy, models are often poorly coupled, model component re-use is rare, and it is frequently difficult to apply models to generate real solutions for the agricultural sector. To improve this situation, we argue that an open, self-sustained, and committed community is required to co-develop agricultural models and associated data and tools as a common resource. Such a community can benefit from recent developments in information and communications technology (ICT). We examine how such developments can be leveraged to design and implement the next generation of data, models, and decision support tools for agricultural production systems. Our objective is to assess relevant technologies for their maturity, expected development, and potential to benefit the agricultural modeling community. The technologies considered encompass methods for collaborative development and for involving stakeholders and users in development in a transdisciplinary manner. Our qualitative evaluation suggests that as an overall research challenge, the interoperability of data sources, modular granular open models, reference data sets for applications and specific user requirements analysis methodologies need to be addressed to allow agricultural modeling to enter in the big data era. This will enable much higher analytical capacities and the integrated use of new data sources. Overall agricultural systems modeling needs to rapidly adopt and absorb state-of-the-art data and ICT technologies with a focus on the needs of beneficiaries and on facilitating those who develop applications of their models. This adoption requires the widespread uptake of a set of best practices as standard operating procedures.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.agsy.2016.09.017DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5485661PMC
July 2017

Trace: a high-throughput tomographic reconstruction engine for large-scale datasets.

Adv Struct Chem Imaging 2017 28;3(1). Epub 2017 Jan 28.

Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Ave., Lemont, IL 60439 USA.

Background: Modern synchrotron light sources and detectors produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used imaging techniques that generates data at tens of gigabytes per second is computed tomography (CT). Although CT experiments result in rapid data generation, the analysis and reconstruction of the collected data may require hours or even days of computation time with a medium-sized workstation, which hinders the scientific progress that relies on the results of analysis.

Methods: We present Trace, a data-intensive computing engine that we have developed to enable high-performance implementation of iterative tomographic reconstruction algorithms for parallel computers. Trace provides fine-grained reconstruction of tomography datasets using both (thread-level) shared memory and (process-level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations that we apply to the replicated reconstruction objects and evaluate them using tomography datasets collected at the Advanced Photon Source.

Results: Our experimental evaluations show that our optimizations and parallelization techniques can provide 158× speedup using 32 compute nodes (384 cores) over a single-core configuration and decrease the end-to-end processing time of a large sinogram (with 4501 × 1 × 22,400 dimensions) from 12.5 h to <5 min per iteration.

Conclusion: The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s40679-017-0040-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5313579PMC
January 2017

Blending Education and Polymer Science: Semi Automated Creation of a Thermodynamic Property Database.

J Chem Educ 2016 09 15;93(9):1561-1568. Epub 2016 Aug 15.

The Institute for Molecular Engineering, The University of Chicago, Illinois 60637 United States; The Computation Institute, The University of Chicago, Illinois 60637 United States; Materials Science Division, Argonne National Laboratory, Illinois 60439 United States.

Structured databases of chemical and physical properties play a central role in the everyday research activities of scientists and engineers. In materials science, researchers and engineers turn to these databases to quickly query, compare, and aggregate various properties, thereby allowing for the development or application of new materials. The vast majority of these databases have been generated manually, through decades of labor-intensive harvesting of information from the literature; yet, while there are many examples of commonly used databases, a significant number of important properties remain locked within the tables, figures, and text of publications. The question addressed in our work is whether, and to what extent, the process of data collection can be automated. Students of the physical sciences and engineering are often confronted with the challenge of finding and applying property data from the literature, and a central aspect of their education is to develop the critical skills needed to identify such data and discern their meaning or validity. To address shortcomings associated with automated information extraction, while simultaneously preparing the next generation of scientists for their future endeavors, we developed a novel course-based approach in which students develop skills in polymer chemistry and physics and apply their knowledge by assisting with the semi-automated creation of a thermodynamic property database.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5082748PMC
http://dx.doi.org/10.1021/acs.jchemed.5b01032DOI Listing
September 2016

Use of sediment source fingerprinting to assess the role of subsurface erosion in the supply of fine sediment in a degraded catchment in the Eastern Cape, South Africa.

J Environ Manage 2017 Jun 4;194:27-41. Epub 2016 Aug 4.

Department of Sustainable Soil and Grassland Systems, Rothamstead Research, North Wyke, EX20 2SB, UK. Electronic address:

Sediment source fingerprinting has been successfully deployed to provide information on the surface and subsurface sources of sediment in many catchments around the world. However, there is still scope to re-examine some of the major assumptions of the technique with reference to the number of fingerprint properties used in the model, the number of model iterations and the potential uncertainties of using more than one sediment core collected from the same floodplain sink. We investigated the role of subsurface erosion in the supply of fine sediment to two sediment cores collected from a floodplain in a small degraded catchment in the Eastern Cape, South Africa. The results showed that increasing the number of individual fingerprint properties in the composite signature did not improve the model goodness-of-fit. This is still a much debated issue in sediment source fingerprinting. To test the goodness-of-fit further, the number of model repeat iterations was increased from 5000 to 30,000. However, this did not reduce uncertainty ranges in modelled source proportions nor improve the model goodness-of-fit. The estimated sediment source contributions were not consistent with the available published data on erosion processes in the study catchment. The temporal pattern of sediment source contributions predicted for the two sediment cores was very different despite the cores being collected in close proximity from the same floodplain. This highlights some of the potential limitations associated with using floodplain cores to reconstruct catchment erosion processes and associated sediment source contributions. For the source tracing approach in general, the findings here suggest the need for further investigations into uncertainties related to the number of fingerprint properties included in un-mixing models. The findings support the current widespread use of ≤5000 model repeat iterations for estimating the key sources of sediment samples.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jenvman.2016.07.019DOI Listing
June 2017

Predictive Big Data Analytics: A Study of Parkinson's Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations.

PLoS One 2016 5;11(8):e0157077. Epub 2016 Aug 5.

Stevens Neuroimaging and Informatics Institute, University of Southern California, Los Angeles, California, United States of America.

Background: A unique archive of Big Data on Parkinson's Disease is collected, managed and disseminated by the Parkinson's Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson's disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data-large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources-all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data.

Methods And Findings: Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson's disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting.

Conclusions: Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson's disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer's, Huntington's, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0157077PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4975403PMC
August 2017

Integrative genomics analyses unveil downstream biological effectors of disease-specific polymorphisms buried in intergenic regions.

NPJ Genom Med 2016;1. Epub 2016 Apr 27.

BIO5 institute, University of Arizona, Tucson, AZ, USA; Department of Medicine, University of Arizona, Tucson, AZ, USA; Department of Medicine, University of Illinois at Chicago, IL, USA; Section of Genetic Medicine, Department of Medicine, University of Chicago, IL, USA; Center for Biomedical Informatics, Department of Medicine, University of Chicago, IL, USA; Computation Institute, Argonne National Laboratory and University of Chicago, IL, USA; Institute for Genomics and Systems Biology, Argonne National Laboratory & University of Chicago, IL, USA; University of Arizona Cancer Center, University of Arizona, Tucson, AZ, USA; Section for Bioinformatics, Department of Bioengineering, University of Illinois at Chicago, IL, USA; Department of Biopharmaceutical Sciences, University of Illinois at Chicago, IL, USA.

Functionally altered biological mechanisms arising from disease-associated polymorphisms, remain difficult to characterize when those variants are intergenic, or, fall between genes. We sought to identify shared downstream mechanisms by which inter- and intragenic single nucleotide polymorphisms (SNPs) contribute to a specific physiopathology. Using computational modeling of 2 million pairs of disease-associated SNPs drawn from genome wide association studies (GWAS), integrated with expression Quantitative Trait Loci (eQTL) and Gene Ontology functional annotations, we predicted 3,870 inter-intra and inter-intra SNP pairs with convergent biological mechanisms (FDR<0.05). These prioritized SNP pairs with overlapping mRNA targets or similar functional annotations were more likely to be associated with the same disease than unrelated pathologies (OR>12). We additionally confirmed synergistic and antagonistic genetic interactions for a subset of prioritized SNP pairs in independent studies of Alzheimer's disease (entropy p=0.046), bladder cancer (entropy p=0.039), and rheumatoid arthritis (PheWAS case-control p<10). Using ENCODE datasets, we further statistically validated that the biological mechanisms shared within prioritized SNP pairs are frequently governed by matching transcription factor binding sites and long-range chromatin interactions. These results provide a "roadmap" of disease mechanisms emerging from GWAS and further identify candidate therapeutic targets among downstream effectors of intergenic SNPs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/npjgenmed.2016.6DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4966659PMC
April 2016

Optimization of tomographic reconstruction workflows on geographically distributed resources.

J Synchrotron Radiat 2016 07 15;23(Pt 4):997-1005. Epub 2016 Jun 15.

Mathematics and Computer Science Division, Argonne National Laboratory, USA.

New technological advancements in synchrotron light sources enable data acquisitions at unprecedented levels. This emergent trend affects not only the size of the generated data but also the need for larger computational resources. Although beamline scientists and users have access to local computational resources, these are typically limited and can result in extended execution times. Applications that are based on iterative processing as in tomographic reconstruction methods require high-performance compute clusters for timely analysis of data. Here, time-sensitive analysis and processing of Advanced Photon Source data on geographically distributed resources are focused on. Two main challenges are considered: (i) modeling of the performance of tomographic reconstruction workflows and (ii) transparent execution of these workflows on distributed resources. For the former, three main stages are considered: (i) data transfer between storage and computational resources, (i) wait/queue time of reconstruction jobs at compute resources, and (iii) computation of reconstruction tasks. These performance models allow evaluation and estimation of the execution time of any given iterative tomographic reconstruction workflow that runs on geographically distributed resources. For the latter challenge, a workflow management system is built, which can automate the execution of workflows and minimize the user interaction with the underlying infrastructure. The system utilizes Globus to perform secure and efficient data transfer operations. The proposed models and the workflow management system are evaluated by using three high-performance computing and two storage resources, all of which are geographically distributed. Workflows were created with different computational requirements using two compute-intensive tomographic reconstruction algorithms. Experimental evaluation shows that the proposed models and system can be used for selecting the optimum resources, which in turn can provide up to 3.13× speedup (on experimented resources). Moreover, the error rates of the models range between 2.1 and 23.3% (considering workflow execution times), where the accuracy of the model estimations increases with higher computational demands in reconstruction tasks.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1107/S1600577516007980DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5315096PMC
July 2016

The impact of catchment source group classification on the accuracy of sediment fingerprinting outputs.

J Environ Manage 2017 Jun 6;194:16-26. Epub 2016 May 6.

Sustainable Soils and Grassland Systems Department, Rothamsted Research, North Wyke, Okehampton EX20 2 SB, UK.

The objective classification of sediment source groups is at present an under-investigated aspect of source tracing studies, which has the potential to statistically improve discrimination between sediment sources and reduce uncertainty. This paper investigates this potential using three different source group classification schemes. The first classification scheme was simple surface and subsurface groupings (Scheme 1). The tracer signatures were then used in a two-step cluster analysis to identify the sediment source groupings naturally defined by the tracer signatures (Scheme 2). The cluster source groups were then modified by splitting each one into a surface and subsurface component to suit catchment management goals (Scheme 3). The schemes were tested using artificial mixtures of sediment source samples. Controlled corruptions were made to some of the mixtures to mimic the potential causes of tracer non-conservatism present when using tracers in natural fluvial environments. It was determined how accurately the known proportions of sediment sources in the mixtures were identified after unmixing modelling using the three classification schemes. The cluster analysis derived source groups (2) significantly increased tracer variability ratios (inter-/intra-source group variability) (up to 2122%, median 194%) compared to the surface and subsurface groupings (1). As a result, the composition of the artificial mixtures was identified an average of 9.8% more accurately on the 0-100% contribution scale. It was found that the cluster groups could be reclassified into a surface and subsurface component (3) with no significant increase in composite uncertainty (a 0.1% increase over Scheme 2). The far smaller effects of simulated tracer non-conservatism for the cluster analysis based schemes (2 and 3) was primarily attributed to the increased inter-group variability producing a far larger sediment source signal that the non-conservatism noise (1). Modified cluster analysis based classification methods have the potential to reduce composite uncertainty significantly in future source tracing studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jenvman.2016.04.048DOI Listing
June 2017

Data publication with the structural biology data grid supports live analysis.

Nat Commun 2016 Mar 7;7:10882. Epub 2016 Mar 7.

Department of Biological Chemistry and Molecular Pharmacology, Boston, Massachusetts 02115, USA.

Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ncomms10882DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4786681PMC
March 2016

A case study for cloud based high throughput analysis of NGS data using the globus genomics system.

Comput Struct Biotechnol J 2015 7;13:64-74. Epub 2014 Nov 7.

Innovation Center for Biomedical Informatics (ICBI), Georgetown University, Washington, DC 20007, USA.

Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the "Globus Genomics" system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-endNGS analysis requirements. The Globus Genomics system is built on Amazon 's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.csbj.2014.11.001DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720014PMC
February 2016

A Hybrid Human-Computer Approach to the Extraction of Scientific Facts from the Literature.

Procedia Comput Sci 2016 1;80:386-397. Epub 2016 Jun 1.

Department of Computer Science, The University of Chicago, Chicago, IL, USA.

A wealth of valuable data is locked within the millions of research articles published each year. Reading and extracting pertinent information from those articles has become an unmanageable task for scientists. This problem hinders scientific progress by making it hard to build on results buried in literature. Moreover, these data are loosely structured, encoded in manuscripts of various formats, embedded in different content types, and are, in general, not machine accessible. We present a hybrid human-computer solution for semi-automatically extracting scientific facts from literature. This solution combines an automated discovery, download, and extraction phase with a semi-expert crowd assembled from students to extract specific scientific facts. To evaluate our approach we apply it to a challenging molecular engineering scenario, extraction of a polymer property: the Flory-Huggins interaction parameter. We demonstrate useful contributions to a comprehensive database of polymer properties.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.procs.2016.05.338DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5482373PMC
June 2016

Globus Nexus: A Platform-as-a-Service Provider of Research Identity, Profile, and Group Management.

Future Gener Comput Syst 2016 Mar;56:571-583

Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, IL, USA ; Department of Computer Science, University of Chicago, Chicago, IL, USA ; Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA.

Globus Nexus is a professionally hosted Platform-as-a-Service that provides identity, profile and group management functionality for the research community. Many collaborative e-Science applications need to manage large numbers of user identities, profiles, and groups. However, developing and maintaining such capabilities is often challenging given the complexity of modern security protocols and requirements for scalable, robust, and highly available implementations. By outsourcing this functionality to Globus Nexus, developers can leverage best-practice implementations without incurring development and operations overhead. Users benefit from enhanced capabilities such as identity federation, flexible profile management, and user-oriented group management. In this paper we present Globus Nexus, describe its capabilities and architecture, summarize how several e-Science applications leverage these capabilities, and present results that characterize its scalability, reliability, and availability.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.future.2015.09.006DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681010PMC
March 2016

Choosing experiments to accelerate collective discovery.

Proc Natl Acad Sci U S A 2015 Nov 9;112(47):14569-74. Epub 2015 Nov 9.

Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, IL 60637; Department of Sociology, University of Chicago, Chicago, IL 60637

A scientist's choice of research problem affects his or her personal career trajectory. Scientists' combined choices affect the direction and efficiency of scientific discovery as a whole. In this paper, we infer preferences that shape problem selection from patterns of published findings and then quantify their efficiency. We represent research problems as links between scientific entities in a knowledge network. We then build a generative model of discovery informed by qualitative research on scientific problem selection. We map salient features from this literature to key network properties: an entity's importance corresponds to its degree centrality, and a problem's difficulty corresponds to the network distance it spans. Drawing on millions of papers and patents published over 30 years, we use this model to infer the typical research strategy used to explore chemical relationships in biomedicine. This strategy generates conservative research choices focused on building up knowledge around important molecules. These choices become more conservative over time. The observed strategy is efficient for initial exploration of the network and supports scientific careers that require steady output, but is inefficient for science as a whole. Through supercomputer experiments on a sample of the network, we study thousands of alternatives and identify strategies much more efficient at exploring mature knowledge networks. We find that increased risk-taking and the publication of experimental failures would substantially improve the speed of discovery. We consider institutional shifts in grant making, evaluation, and publication that would help realize these efficiencies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.1509757112DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4664375PMC
November 2015

Big biomedical data as the key resource for discovery science.

J Am Med Inform Assoc 2015 Nov 21;22(6):1126-31. Epub 2015 Jul 21.

Institute for Systems Biology, Seattle, WA, USA.

Modern biomedical data collection is generating exponentially more data in a multitude of formats. This flood of complex data poses significant opportunities to discover and understand the critical interplay among such diverse domains as genomics, proteomics, metabolomics, and phenomics, including imaging, biometrics, and clinical data. The Big Data for Discovery Science Center is taking an "-ome to home" approach to discover linkages between these disparate data sources by mining existing databases of proteomic and genomic data, brain images, and clinical assessments. In support of this work, the authors developed new technological capabilities that make it easy for researchers to manage, aggregate, manipulate, integrate, and model large amounts of distributed data. Guided by biological domain expertise, the Center's computational resources and software will reveal relationships and patterns, aiding researchers in identifying biomarkers for the most confounding conditions and diseases, such as Parkinson's and Alzheimer's.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/jamia/ocv077DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009918PMC
November 2015

Zodiac: A Comprehensive Depiction of Genetic Interactions in Cancer by Integrating TCGA Data.

J Natl Cancer Inst 2015 Aug 8;107(8). Epub 2015 May 8.

Program of Computational Genomics & Medicine (YZ, SY, SS, YJ), Center for Molecular Medicine (DLH Jr, KG), and Center for Biomedical Research Informatics (JCS, NP), NorthShore University HealthSystem, Evanston, IL; Department of Mathematics, The University of Texas at Austin, Austin, TX (YX, PM); Computation Institute (LLP, IF) and Institute for Genomics and Systems Biology (KPW), The University of Chicago and Argonne National Laboratory, Chicago IL; Department of Bioinformatics & Biostatistics, University of Louisville, Louisville, KY (RM); School of Public Health, Fudan University, Shanghai, P. R. China (WG); Department of Human Genetics and Department of Ecology & Evolution (KPW) and Department of Public Health Sciences (YJ), The University of Chicago, Chicago, IL.

Background: Genetic interactions play a critical role in cancer development. Existing knowledge about cancer genetic interactions is incomplete, especially lacking evidences derived from large-scale cancer genomics data. The Cancer Genome Atlas (TCGA) produces multimodal measurements across genomics and features of thousands of tumors, which provide an unprecedented opportunity to investigate the interplays of genes in cancer.

Methods: We introduce Zodiac, a computational tool and resource to integrate existing knowledge about cancer genetic interactions with new information contained in TCGA data. It is an evolution of existing knowledge by treating it as a prior graph, integrating it with a likelihood model derived by Bayesian graphical model based on TCGA data, and producing a posterior graph as updated and data-enhanced knowledge. In short, Zodiac realizes "Prior interaction map + TCGA data → Posterior interaction map."

Results: Zodiac provides molecular interactions for about 200 million pairs of genes. All the results are generated from a big-data analysis and organized into a comprehensive database allowing customized search. In addition, Zodiac provides data processing and analysis tools that allow users to customize the prior networks and update the genetic pathways of their interest. Zodiac is publicly available at www.compgenome.org/ZODIAC.

Conclusions: Zodiac recapitulates and extends existing knowledge of molecular interactions in cancer. It can be used to explore novel gene-gene interactions, transcriptional regulation, and other types of molecular interplays in cancer.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/jnci/djv129DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4554190PMC
August 2015