Redundancy control in pathway databases (ReCiPa): an application for improving gene-set enrichment analysis in Omics studies and "Big data" biology.

Authors:
Juan C Vivar
Juan C Vivar
Biomedical Biotechnology Research Institute
Priscilla Pemu
Priscilla Pemu
Morehouse School of Medicine
United States
Ruth McPherson
Ruth McPherson
University of Ottawa Heart Institute
Canada
Sujoy Ghosh
Sujoy Ghosh
Duke-NUS Medical School
Singapore

OMICS 2013 Aug 11;17(8):414-22. Epub 2013 Jun 11.

Biomedical Biotechnology Research Institute, North Carolina Central University, Durham, North Carolina, USA.

Abstract Unparalleled technological advances have fueled an explosive growth in the scope and scale of biological data and have propelled life sciences into the realm of "Big Data" that cannot be managed or analyzed by conventional approaches. Big Data in the life sciences are driven primarily via a diverse collection of 'omics'-based technologies, including genomics, proteomics, metabolomics, transcriptomics, metagenomics, and lipidomics. Gene-set enrichment analysis is a powerful approach for interrogating large 'omics' datasets, leading to the identification of biological mechanisms associated with observed outcomes. While several factors influence the results from such analysis, the impact from the contents of pathway databases is often under-appreciated. Pathway databases often contain variously named pathways that overlap with one another to varying degrees. Ignoring such redundancies during pathway analysis can lead to the designation of several pathways as being significant due to high content-similarity, rather than truly independent biological mechanisms. Statistically, such dependencies also result in correlated p values and overdispersion, leading to biased results. We investigated the level of redundancies in multiple pathway databases and observed large discrepancies in the nature and extent of pathway overlap. This prompted us to develop the application, ReCiPa (Redundancy Control in Pathway Databases), to control redundancies in pathway databases based on user-defined thresholds. Analysis of genomic and genetic datasets, using ReCiPa-generated overlap-controlled versions of KEGG and Reactome pathways, led to a reduction in redundancy among the top-scoring gene-sets and allowed for the inclusion of additional gene-sets representing possibly novel biological mechanisms. Using obesity as an example, bioinformatic analysis further demonstrated that gene-sets identified from overlap-controlled pathway databases show stronger evidence of prior association to obesity compared to pathways identified from the original databases.

Download full-text PDF

Source
http://dx.doi.org/10.1089/omi.2012.0083DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3727566PMC

Still can't find the full text of the article?

We can help you send a request to the authors directly.
August 2013
183 Reads

Publication Analysis

Top Keywords

pathway databases
28
biological mechanisms
12
pathway
9
life sciences
8
gene-set enrichment
8
enrichment analysis
8
redundancies pathway
8
"big data"
8
control pathway
8
databases
8
redundancy control
8
analysis
6
recipa redundancy
4
associated observed
4
mechanisms associated
4
observed outcomes
4
leading identification
4
databases control
4
identification biological
4
outcomes factors
4

Similar Publications