Publications by authors named "Zhaohui S Qin"

86 Publications

Disease category-specific annotation of variants using an ensemble learning framework.

Brief Bioinform 2021 Oct 13. Epub 2021 Oct 13.

NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.

Understanding the impact of non-coding sequence variants on complex diseases is an essential problem. We present a novel ensemble learning framework-CASAVA, to predict genomic loci in terms of disease category-specific risk. Using disease-associated variants identified by GWAS as training data, and diverse sequencing-based genomics and epigenomics profiles as features, CASAVA provides risk prediction of 24 major categories of diseases throughout the human genome. Our studies showed that CASAVA scores at a genomic locus provide a reasonable prediction of the disease-specific and disease category-specific risk prediction for non-coding variants located within the locus. Taking MHC2TA and immune system diseases as an example, we demonstrate the potential of CASAVA in revealing variant-disease associations. A website (http://zhanglabtools.org/CASAVA) has been built to facilitate easily access to CASAVA scores.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbab438DOI Listing
October 2021

Systematic Evaluation of DNA Sequence Variations on Transcription Factor Binding Affinity.

Front Genet 2021 9;12:667866. Epub 2021 Sep 9.

Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States.

The majority of the single nucleotide variants (SNVs) identified by genome-wide association studies (GWAS) fall outside of the protein-coding regions. Elucidating the functional implications of these variants has been a major challenge. A possible mechanism for functional non-coding variants is that they disrupted the canonical transcription factor (TF) binding sites that affect the binding of the TF. However, their impact varies since many positions within a TF binding motif are not well conserved. Therefore, simply annotating all variants located in putative TF binding sites may overestimate the functional impact of these SNVs. We conducted a comprehensive survey to study the effect of SNVs on the TF binding affinity. A sequence-based machine learning method was used to estimate the change in binding affinity for each SNV located inside a putative motif site. From the results obtained on 18 TF binding motifs, we found that there is a substantial variation in terms of a SNV's impact on TF binding affinity. We found that only about 20% of SNVs located inside putative TF binding sites would likely to have significant impact on the TF-DNA binding.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2021.667866DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8458901PMC
September 2021

Plasma Metabolic Phenotypes of HPV-Associated versus Smoking-Associated Head and Neck Cancer and Patient Survival.

Cancer Epidemiol Biomarkers Prev 2021 Oct 10;30(10):1858-1866. Epub 2021 Aug 10.

Nell Hodgson Woodruff School of Nursing, Emory University, Atlanta, Georgia.

Background: Metabolic differences between human papillomavirus (HPV)-associated head and neck squamous cell carcinoma (HNSCC) and smoking-associated HNSCC may partially explain differences in prognosis. The former relies on mitochondrial oxidative phosphorylation (OXPHOS) while the latter relies on glycolysis. These differences have not been studied in blood.

Methods: We extracted metabolites using untargeted liquid chromatography high-resolution mass spectrometry from pretreatment plasma in a cohort of 55 HPV-associated and 82 smoking-associated HNSCC subjects. Metabolic pathway enrichment analysis of differentially expressed metabolites produced pathway-based signatures. Significant pathways ( < 0.05) were reduced via principal component analysis and assessed with overall survival via Cox models. We classified each subject as glycolytic or OXPHOS phenotype and assessed it with survival.

Results: Of 2,410 analyzed metabolites, 191 were differentially expressed. Relative to smoking-associated HNSCC, bile acid biosynthesis ( < 0.0001) and octadecatrienoic acid beta-oxidation ( = 0.01), were upregulated in HPV-associated HNSCC, while galactose metabolism ( = 0.001) and vitamin B6 metabolism ( = 0.01) were downregulated; the first two suggest an OXPHOS phenotype while the latter two suggest glycolytic. First principal components of bile acid biosynthesis [HR = 0.52 per SD; 95% confidence interval (CI), 0.38-0.72; < 0.001] and octadecatrienoic acid beta-oxidation (HR = 0.54 per SD; 95% CI, 0.38-0.78; < 0.001) were significantly associated with overall survival independent of HPV and smoking. The glycolytic versus OXPHOS phenotype was also independently associated with survival (HR = 3.17; 95% CI, 1.07-9.35; = 0.04).

Conclusions: Plasma metabolites related to glycolysis and mitochondrial OXPHOS may be biomarkers of HNSCC patient prognosis independent of HPV or smoking. Future investigations should determine whether they predict treatment efficacy.

Impact: Blood metabolomics may be a useful marker to aid HNSCC patient prognosis.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1158/1055-9965.EPI-21-0576DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8492502PMC
October 2021

A machine learning approach to brain epigenetic analysis reveals kinases associated with Alzheimer's disease.

Nat Commun 2021 07 22;12(1):4472. Epub 2021 Jul 22.

Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA.

Alzheimer's disease (AD) is influenced by both genetic and environmental factors; thus, brain epigenomic alterations may provide insights into AD pathogenesis. Multiple array-based Epigenome-Wide Association Studies (EWASs) have identified robust brain methylation changes in AD; however, array-based assays only test about 2% of all CpG sites in the genome. Here, we develop EWASplus, a computational method that uses a supervised machine learning strategy to extend EWAS coverage to the entire genome. Application to six AD-related traits predicts hundreds of new significant brain CpGs associated with AD, some of which are further validated experimentally. EWASplus also performs well on data collected from independent cohorts and different brain regions. Genes found near top EWASplus loci are enriched for kinases and for genes with evidence for physical interactions with known AD genes. In this work, we show that EWASplus implicates additional epigenetic loci for AD that are not found using array-based AD EWASs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-021-24710-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8298578PMC
July 2021

An empirical bayesian approach for testing gene expression fold change and its application in detecting global dosage effects.

NAR Genom Bioinform 2020 Sep 18;2(3):lqaa072. Epub 2020 Sep 18.

Department of Statistics, University of Missouri at Columbia, Columbia, MO 65211, USA.

We are motivated by biological studies intended to understand global gene expression fold change. Biologists have generally adopted a fixed cutoff to determine the significance of fold changes in gene expression studies (e.g. by using an observed fold change equal to two as a fixed threshold). Scientists can also use a -test or a modified differential expression test to assess the significance of fold changes. However, these methods either fail to take advantage of the high dimensionality of gene expression data or fail to test fold change directly. Our research develops a new empirical Bayesian approach to substantially improve the power and accuracy of fold-change detection. Specifically, we more accurately estimate gene-wise error variation in the log of fold change. We then adopt a -test with adjusted degrees of freedom for significance assessment. We apply our method to a dosage study in Arabidopsis and a Down syndrome study in humans to illustrate the utility of our approach. We also present a simulation study based on real datasets to demonstrate the accuracy of our method relative to error variance estimation and power in fold-change detection. Our developed R package with a detailed user manual is publicly available on GitHub at https://github.com/cuiyingbeicheng/Foldseq.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nargab/lqaa072DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671412PMC
September 2020

Author Correction: Truncation of mutant huntingtin in knock-in mice demonstrates exon1 huntingtin is a key pathogenic form.

Nat Commun 2020 Nov 19;11(1):5989. Epub 2020 Nov 19.

Guangdong-Hongkong-Macau Institute of CNS Regeneration, Ministry of Education CNS Regeneration Collaborative Joint Laboratory, Jinan University, 510632, Guangzhou, China.

A Correction to this paper has been published: https://doi.org/10.1038/s41467-020-19873-9.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-020-19873-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7677395PMC
November 2020

Signatures of somatic mutations and gene expression from p16INK4A positive head and neck squamous cell carcinomas (HNSCC).

PLoS One 2020 28;15(9):e0238497. Epub 2020 Sep 28.

Winship Cancer Institute, Emory University, Atlanta, GA, United States of America.

Human papilloma virus (HPV) causes a subset of head and neck squamous cell carcinomas (HNSCC) of the oropharynx. We combined targeted DNA- and genome-wide RNA-sequencing to identify genetic variants and gene expression signatures respectively from patients with HNSCC including oropharyngeal squamous cell carcinomas (OPSCC). DNA and RNA were purified from 35- formalin fixed and paraffin embedded (FFPE) HNSCC tumor samples. Immuno-histochemical evaluation of tumors was performed to determine the expression levels of p16INK4A and classified tumor samples either p16+ or p16-. Using ClearSeq Comprehensive Cancer panel, we examined the distribution of somatic mutations. Somatic single-nucleotide variants (SNV) were called using GATK-Mutect2 ("tumor-only" mode) approach. Using RNA-seq, we identified a catalog of 1,044 and 8 genes as significantly expressed between p16+ and p16-, respectively at FDR 0.05 (5%) and 0.1 (10%). The clinicopathological characteristics of the patients including anatomical site, smoking and survival were analyzed when comparing p16+ and p16- tumors. The majority of tumors (65%) were p16+. Population sequence variant databases, including gnomAD, ExAC, COSMIC and dbSNP, were used to identify the mutational landscape of somatic sequence variants within sequenced genes. Hierarchical clustering of The Cancer Genome Atlas (TCGA) samples based on HPV-status was observed using differentially expressed genes. Using RNA-seq in parallel with targeted DNA-seq, we identified mutational and gene expression signatures characteristic of p16+ and p16- HNSCC. Our gene signatures are consistent with previously published data including TCGA and support the need to further explore the biologic relevance of these alterations in HNSCC.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0238497PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7521680PMC
October 2020

Proceedings of the 2019 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference.

BMC Bioinformatics 2020 07 6;21(Suppl 4):254. Epub 2020 Jul 6.

Foundational Medical Studies, Oakland University William Beaumont School of Medicine, Rochester, MI, 48309-4482, USA.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-020-03580-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7336605PMC
July 2020

DeconPeaker, a Deconvolution Model to Identify Cell Types Based on Chromatin Accessibility in ATAC-Seq Data of Mixture Samples.

Front Genet 2020 8;11:392. Epub 2020 May 8.

State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China.

While our understanding of cellular and molecular processes has grown exponentially, issues related to the cell microenvironment and cellular heterogeneity have sparked a new debate concerning the cell identity. Cell composition (chromatin and nuclear architecture) poses a strong risk for dynamic changes in the diseased condition. Since chromatin accessibility patterns play a major role in human diseases, it is therefore anticipated that a deconvolution tool based on open chromatin data will provide better performance in identifying cell composition. Herein, we have designed the deconvolution tool "DeconPeaker," which can precisely define the uniqueness among subpopulations of cells using open chromatin datasets. Using this tool, we simultaneously evaluated chromatin accessibility and gene expression datasets to estimate cell types and their respective proportions in a mixture of samples. In comparison to other known deconvolution methods, we observed the lowest average root-mean-square error (RMSE = 0.042) and the highest average correlation coefficient ( = 0.919) between the prediction and "true" proportion. As a proof-of-concept, we also tested chromatin accessibility data from acute myeloid leukemia (AML) and successfully obtained unique cell types associated with AML progression. Furthermore, we showed that chromatin accessibility represents more essential characteristics in the identification of cell types than gene expression. Taken together, DeconPeaker as a powerful tool has the potential to combine different datasets (primarily, chromatin accessibility and gene expression) and define different cell types in mixtures. The Python package of DeconPeaker is now available at https://github.com/lihuamei/DeconPeaker.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2020.00392DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7269180PMC
May 2020

Truncation of mutant huntingtin in knock-in mice demonstrates exon1 huntingtin is a key pathogenic form.

Nat Commun 2020 05 22;11(1):2582. Epub 2020 May 22.

Guangdong-Hongkong-Macau Institute of CNS Regeneration, Ministry of Education CNS Regeneration Collaborative Joint Laboratory, Jinan University, 510632, Guangzhou, China.

Polyglutamine expansion in proteins can cause selective neurodegeneration, although the mechanisms are not fully understood. In Huntington's disease (HD), proteolytic processing generates toxic N-terminal huntingtin (HTT) fragments that preferentially kill striatal neurons. Here, using CRISPR/Cas9 to truncate full-length mutant HTT in HD140Q knock-in (KI) mice, we show that exon 1 HTT is stably present in the brain, regardless of truncation sites in full-length HTT. This N-terminal HTT leads to similar HD-like phenotypes and age-dependent HTT accumulation in the striatum in different KI mice. We find that exon 1 HTT is constantly generated but its selective accumulation in the striatum is associated with the age-dependent expression of striatum-enriched HspBP1, a chaperone inhibitory protein. Our findings suggest that tissue-specific chaperone function contributes to the selective neuropathology in HD, and highlight the therapeutic potential in blocking generation of exon 1 HTT.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-020-16318-1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7244548PMC
May 2020

Application of topic models to a compendium of ChIP-Seq datasets uncovers recurrent transcriptional regulatory modules.

Bioinformatics 2020 04;36(8):2352-2358

Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, USA.

Motivation: The availability of thousands of genome-wide coupling chromatin immunoprecipitation (ChIP)-Seq datasets across hundreds of transcription factors (TFs) and cell lines provides an unprecedented opportunity to jointly analyze large-scale TF-binding in vivo, making possible the discovery of the potential interaction and cooperation among different TFs. The interacted and cooperated TFs can potentially form a transcriptional regulatory module (TRM) (e.g. co-binding TFs), which helps decipher the combinatorial regulatory mechanisms.

Results: We develop a computational method tfLDA to apply state-of-the-art topic models to multiple ChIP-Seq datasets to decipher the combinatorial binding events of multiple TFs. tfLDA is able to learn high-order combinatorial binding patterns of TFs from multiple ChIP-Seq profiles, interpret and visualize the combinatorial patterns. We apply the tfLDA to two cell lines with a rich collection of TFs and identify combinatorial binding patterns that show well-known TRMs and related TF co-binding events.

Availability And Implementation: A software R package tfLDA is freely available at https://github.com/lichen-lab/tfLDA.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz975DOI Listing
April 2020

An Integrated System Biology Approach Yields Drug Repositioning Candidates for the Treatment of Heart Failure.

Front Genet 2019 25;10:916. Epub 2019 Sep 25.

Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States.

Identifying effective pharmacological treatments for heart failure (HF) patients remains critically important. Given that the development of drugs de novo is expensive and time consuming, drug repositioning has become an increasingly important branch. In the present study, we propose a two-step drug repositioning pipeline and investigate the novel therapeutic effects of existing drugs approved by the US Food and Drug Administration to discover potential therapeutic drugs for HF. In the first step, we compared the gene expression pattern of HF patients with drug-induced gene expression profiles to obtain preliminary candidates. In the second step, we performed a systems biology approach based on the known protein-protein interaction information and targets of drugs to narrow down preliminary candidates to obtain final candidates. Drug set enrichment analysis and literature search were applied to assess the performance of our repositioning approach. We also constructed a mode of action network for each candidate and performed pathway analysis for each candidate using genes contained in their mode of action network to uncover pathways that potentially reflect the mechanisms of candidates' therapeutic efficacy to HF. We discovered numerous preliminary candidates, some of which are used in clinical practice and supported by the literature. The final candidates contained nearly all of the preliminary candidates supported by previous studies. Drug set enrichment analysis and literature search support the validity of our repositioning approach. The mode of action network for each candidate not only displayed the underlying mechanisms of drug efficacy but also uncovered potential biomarkers and therapeutic targets for HF. Our two-step drug repositioning approach is efficient to find candidates with potential therapeutic efficiency to HF supported by the literature and might be of particular use in the discovery of novel effective pharmacological therapies for HF.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2019.00916DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6773955PMC
September 2019

Regulatory annotation of genomic intervals based on tissue-specific expression QTLs.

Bioinformatics 2020 02;36(3):690-697

Department of Biostatics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA.

Motivation: Annotating a given genomic locus or a set of genomic loci is an important yet challenging task. This is especially true for the non-coding part of the genome which is enormous yet poorly understood. Since gene set enrichment analyses have demonstrated to be effective approach to annotate a set of genes, the same idea can be extended to explore the enrichment of functional elements or features in a set of genomic intervals to reveal potential functional connections.

Results: In this study, we describe a novel computational strategy named loci2path that takes advantage of the newly emerged, genome-wide and tissue-specific expression quantitative trait loci (eQTL) information to help annotate a set of genomic intervals in terms of transcription regulation. By checking the presence or the absence of millions of eQTLs in a set of input genomic intervals, combined with grouping eQTLs by the pathways or gene sets that their target genes belong to, loci2path build a bridge connecting genomic intervals to functional pathways and pre-defined biological-meaningful gene sets, revealing potential for regulatory connection. Our method enjoys two key advantages over existing methods: first, we no longer rely on proximity to link a locus to a gene which has shown to be unreliable; second, eQTL allows us to provide the regulatory annotation under the context of specific tissue types. To demonstrate its utilities, we apply loci2path on sets of genomic intervals harboring disease-associated variants as query. Using 1 702 612 eQTLs discovered by the Genotype-Tissue Expression (GTEx) project across 44 tissues and 6320 pathways or gene sets cataloged in MSigDB as annotation resource, our method successfully identifies highly relevant biological pathways and revealed disease mechanisms for psoriasis and other immune-related diseases. Tissue specificity analysis of associated eQTLs provide additional evidence of the distinct roles of different tissues played in the disease mechanisms.

Availability And Implementation: loci2path is published as an open source Bioconductor package, and it is available at http://bioconductor.org/packages/release/bioc/html/loci2path.html.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz669DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8215915PMC
February 2020

Rapid Irreversible Transcriptional Reprogramming in Human Stem Cells Accompanied by Discordance between Replication Timing and Chromatin Compartment.

Stem Cell Reports 2019 07 20;13(1):193-206. Epub 2019 Jun 20.

Department of Biological Science, Florida State University, 319 Stadium Drive, Tallahassee, FL 32306, USA. Electronic address:

The temporal order of DNA replication is regulated during development and is highly correlated with gene expression, histone modifications and 3D genome architecture. We tracked changes in replication timing, gene expression, and chromatin conformation capture (Hi-C) A/B compartments over the first two cell cycles during differentiation of human embryonic stem cells to definitive endoderm. Remarkably, transcriptional programs were irreversibly reprogrammed within the first cell cycle and were largely but not universally coordinated with replication timing changes. Moreover, changes in A/B compartment and several histone modifications that normally correlate strongly with replication timing showed weak correlation during the early cell cycles of differentiation but showed increased alignment in later differentiation stages and in terminally differentiated cell lines. Thus, epigenetic cell fate transitions during early differentiation can occur despite dynamic and discordant changes in otherwise highly correlated genomic properties.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.stemcr.2019.05.021DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6627004PMC
July 2019

Integrative characterization of G-Quadruplexes in the three-dimensional chromatin structure.

Epigenetics 2019 09 10;14(9):894-911. Epub 2019 Jun 10.

a State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University , Nanjing , Jiangsu , China.

DNA molecules are highly compacted in the eukaryotic nucleus where distal regulatory elements reach their targets through three-dimensional chromosomal interactions. G-quadruplexes, stable four-stranded non-canonical DNA structures, can change local chromatin organization through the exclusion of nucleosomes. However, the relationship between G-quadruplexes and higher-order genome organization remains unknown. Here, we found that G-quadruplexes are significantly enriched at boundaries of topological associated domains (TADs). Architectural protein occupancy, which plays critical roles in the formation of TADs, was highly correlated with the content of G-quadruplexes at TAD boundaries. Moreover, adjacent boundaries containing G-quadruplexes frequently interacted with each other because of the high enrichment of architectural protein binding sites. Similar to CCCTC-binding factor (CTCF) binding sites, G-quadruplexes also showed strong insulation ability in the separation of adjacent regions. Additionally, the insulation ability of CTCF binding sites and TAD boundaries was significantly reinforced by G-quadruplexes. Furthermore, G-quadruplex motifs on different strands were associated with the orientation of CTCF binding sites. These findings suggest a potential role for G-quadruplexes in loop extrusion. The enrichment of transcription factor binding sites (TFBSs) around regulatory elements containing G-quadruplexes led to frequent interactions between regulatory elements containing G-quadruplexes. Intriguingly, more than 99% of G-quadruplexes overlapped with TFBSs. The binding sites of CTCF and cohesin proteins were preferentially located surrounding G-quadruplexes. Accordingly, we proposed a new mechanism of long-distance gene regulation in which G-quadruplexes are involved in distal interactions between enhancers and promoters.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1080/15592294.2019.1621140DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6691997PMC
September 2019

Association study in African-admixed populations across the Americas recapitulates asthma risk loci in non-African populations.

Nat Commun 2019 02 20;10(1):880. Epub 2019 Feb 20.

National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, 20892, USA.

Asthma is a complex disease with striking disparities across racial and ethnic groups. Despite its relatively high burden, representation of individuals of African ancestry in asthma genome-wide association studies (GWAS) has been inadequate, and true associations in these underrepresented minority groups have been inconclusive. We report the results of a genome-wide meta-analysis from the Consortium on Asthma among African Ancestry Populations (CAAPA; 7009 asthma cases, 7645 controls). We find strong evidence for association at four previously reported asthma loci whose discovery was driven largely by non-African populations, including the chromosome 17q12-q21 locus and the chr12q13 region, a novel (and not previously replicated) asthma locus recently identified by the Trans-National Asthma Genetic Consortium (TAGC). An additional seven loci reported by TAGC show marginal evidence for association in CAAPA. We also identify two novel loci (8p23 and 8q24) that may be specific to asthma risk in African ancestry populations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-019-08469-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6382865PMC
February 2019

RT States: systematic annotation of the human genome using cell type-specific replication timing programs.

Bioinformatics 2019 07;35(13):2167-2176

Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA.

Motivation: The replication timing (RT) program has been linked to many key biological processes including cell fate commitment, 3D chromatin organization and transcription regulation. Significant technology progress now allows to characterize the RT program in the entire human genome in a high-throughput and high-resolution fashion. These experiments suggest that RT changes dynamically during development in coordination with gene activity. Since RT is such a fundamental biological process, we believe that an effective quantitative profile of the local RT program from a diverse set of cell types in various developmental stages and lineages can provide crucial biological insights for a genomic locus.

Results: In this study, we explored recurrent and spatially coherent combinatorial profiles from 42 RT programs collected from multiple lineages at diverse differentiation states. We found that a Hidden Markov Model with 15 hidden states provide a good model to describe these genome-wide RT profiling data. Each of the hidden state represents a unique combination of RT profiles across different cell types which we refer to as 'RT states'. To understand the biological properties of these RT states, we inspected their relationship with chromatin states, gene expression, functional annotation and 3D chromosomal organization. We found that the newly defined RT states possess interesting genome-wide functional properties that add complementary information to the existing annotation of the human genome.

Availability And Implementation: R scripts for inferring HMM models and Perl scripts for further analysis are available https://github.com/PouletAxel/script_HMM_Replication_timing.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty957DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6681175PMC
July 2019

EWS/ETS-Driven Ewing Sarcoma Requires BET Bromodomain Proteins.

Cancer Res 2018 08 13;78(16):4760-4773. Epub 2018 Jun 13.

Department of Cancer Biology, Perelman School of Medicine, University of Pennsylvania, BRBII/III, Philadelphia, Pennsylvania.

The EWS/ETS fusion transcription factors drive Ewing sarcoma (EWS) by orchestrating an oncogenic transcription program. Therapeutic targeting of EWS/ETS has been unsuccessful; however, identifying mediators of the EWS/ETS function could offer new therapeutic options. Here, we describe the dependency of EWS/ETS-driven transcription upon chromatin reader BET bromdomain proteins and investigate the potential of BET inhibitors in treating EWS. EWS/FLI1 and EWS/ERG were found in a transcriptional complex with BRD4, and knockdown of BRD2/3/4 significantly impaired the oncogenic phenotype of EWS cells. RNA-seq analysis following BRD4 knockdown or inhibition with JQ1 revealed an attenuated EWS/ETS transcriptional signature. In contrast to previous reports, JQ1 reduced proliferation and induced apoptosis through MYC-independent mechanisms without affecting EWS/ETS protein levels; this was confirmed by depleting BET proteins using PROTAC-BET degrader (BETd). Polycomb repressive complex 2 (PRC2)-associated factor PHF19 was downregulated by JQ1/BETd or BRD4 knockdown in multiple EWS lines. EWS/FLI1 bound a distal regulatory element of PHF19, and EWS/FLI1 knockdown resulted in downregulation of PHF19 expression. Deletion of PHF19 via CRISPR-Cas9 resulted in a decreased tumorigenic phenotype, a transcriptional signature that overlapped with JQ1 treatment, and increased sensitivity to JQ1. PHF19 expression was also associated with worse prognosis in patients with EWS. , JQ1 demonstrated antitumor efficacy in multiple mouse xenograft models of EWS. Together these results indicate that EWS/ETS requires BET epigenetic reader proteins for its transcriptional program and can be mitigated by BET inhibitors. This study provides a clear rationale for the clinical utility of BET inhibitors in treating EWS. These findings reveal the dependency of EWS/ETS transcription factors on BET epigenetic reader proteins and demonstrate the potential of BET inhibitors for the treatment of EWS. .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1158/0008-5472.CAN-18-0484DOI Listing
August 2018

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.

Gigascience 2018 06;7(6)

Department of Medical Informatics, Emory University School of Medicine, Atlanta, GA 30322, USA.

Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance.

Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools.

Conclusions: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giy052DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6007233PMC
June 2018

Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval.

Database (Oxford) 2018 01;2018

Department of Computer Science, Mathematics & Science Center, Emory University, Suite W401, 400 Dowman Drive NE, Atlanta, Georgia 30322, USA.

The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bax104DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5887275PMC
January 2018

Using DIVAN to assess disease/trait-associated single nucleotide variants in genome-wide scale.

BMC Res Notes 2017 Oct 30;10(1):530. Epub 2017 Oct 30.

Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, 30322, USA.

Objective: The majority of sequence variants identified by Genome-wide association studies (GWASs) fall outside of the protein-coding regions. Unlike coding variants, it is challenging to connect these noncoding variants to the pathophysiology of complex diseases/traits due to the lack of functional annotations in the non-coding regions. To overcome this, by leveraging the rich collection of genomic and epigenomic profiles, we have developed DIVAN, or Disease/trait-specific Variant ANnotation, which enables the assignment of a measurement (D-score) for each base of the human genome in a disease/trait-specific manner. To facilitate the utilization of DIVAN, we pre-computed D-scores for every base of the human genome (hg19) for 45 different diseases/traits.

Results: In this work, we present a detailed protocol on how to utilize DIVAN software toolkit to retrieve D-scores either by variant identifiers or by genomic regions for a disease/trait of interest. We also demonstrate the utilities of the D-scores using real data examples. We believe that the pre-computed D-scores for 45 diseases/traits is a useful resource to follow up on the discoveries made by GWASs, and the DIVAN software toolkit provides a convenient way to access this resource. DIVAN is freely available at https://sites.google.com/site/emorydivan/software .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13104-017-2851-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5663107PMC
October 2017

Improving Hierarchical Models Using Historical Data with Applications in High-Throughput Genomics Data Analysis.

Stat Biosci 2017 Jun 8;9(1):73-90. Epub 2016 Jul 8.

Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA.

Modern high-throughput biotechnologies such as microarray and next generation sequencing produce a massive amount of information for each sample assayed. However, in a typical high-throughput experiment, only limited amount of data are observed for each individual feature, thus the classical 'large , small ' problem. Bayesian hierarchical model, capable of borrowing strength across features within the same dataset, has been recognized as an effective tool in analyzing such data. However, the shrinkage effect, the most prominent feature of hierarchical features, can lead to undesirable over-correction for some features. In this work, we discuss possible causes of the over-correction problem and propose several alternative solutions. Our strategy is rooted in the fact that in the Big Data era, large amount of historical data are available which should be taken advantage of. Our strategy presents a new framework to enhance the Bayesian hierarchical model. Through simulation and real data analysis, we demonstrated superior performance of the proposed strategy. Our new strategy also enables borrowing information across different platforms which could be extremely useful with emergence of new technologies and accumulation of data from different platforms in the Big Data era. Our method has been implemented in R package "adaptiveHM", which is freely available from https://github.com/benliemory/adaptiveHM.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s12561-016-9156-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5599104PMC
June 2017

Identifying tagging SNPs for African specific genetic variation from the African Diaspora Genome.

Sci Rep 2017 04 21;7:46398. Epub 2017 Apr 21.

Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA.

A primary goal of The Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) is to develop an 'African Diaspora Power Chip' (ADPC), a genotyping array consisting of tagging SNPs, useful in comprehensively identifying African specific genetic variation. This array is designed based on the novel variation identified in 642 CAAPA samples of African ancestry with high coverage whole genome sequence data (~30× depth). This novel variation extends the pattern of variation catalogued in the 1000 Genomes and Exome Sequencing Projects to a spectrum of populations representing the wide range of West African genomic diversity. These individuals from CAAPA also comprise a large swath of the African Diaspora population and incorporate historical genetic diversity covering nearly the entire Atlantic coast of the Americas. Here we show the results of designing and producing such a microchip array. This novel array covers African specific variation far better than other commercially available arrays, and will enable better GWAS analyses for researchers with individuals of African descent in their study populations. A recent study cataloging variation in continental African populations suggests this type of African-specific genotyping array is both necessary and valuable for facilitating large-scale GWAS in populations of African ancestry.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/srep46398DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5399604PMC
April 2017

Omicseq: a web-based search engine for exploring omics datasets.

Nucleic Acids Res 2017 07;45(W1):W445-W452

Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA 30322, USA.

The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve 'findability' of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkx258DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5793835PMC
July 2017

DIVAN: accurate identification of non-coding disease-specific risk variants using multi-omics profiles.

Genome Biol 2016 12 6;17(1):252. Epub 2016 Dec 6.

Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, 30322, USA.

Understanding the link between non-coding sequence variants, identified in genome-wide association studies, and the pathophysiology of complex diseases remains challenging due to a lack of annotations in non-coding regions. To overcome this, we developed DIVAN, a novel feature selection and ensemble learning framework, which identifies disease-specific risk variants by leveraging a comprehensive collection of genome-wide epigenomic profiles across cell types and factors, along with other static genomic features. DIVAN accurately and robustly recognizes non-coding disease-specific risk variants under multiple testing scenarios; among all the features, histone marks, especially those marks associated with repressed chromatin, are often more informative than others.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-016-1112-zDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5139035PMC
December 2016

Genome-Wide STAT3 Binding Analysis after Histone Deacetylase Inhibition Reveals Novel Target Genes in Dendritic Cells.

J Innate Immun 2017 19;9(2):126-144. Epub 2016 Nov 19.

Division of Hematology/Oncology, Department of Internal Medicine, University of Michigan, Ann Arbor, Mich., USA.

STAT3 is a master transcriptional regulator that plays an important role in the induction of both immune activation and immune tolerance in dendritic cells (DCs). The transcriptional targets of STAT3 in promoting DC activation are becoming increasingly understood; however, the mechanisms underpinning its role in causing DC suppression remain largely unknown. To determine the functional gene targets of STAT3, we compared the genome-wide binding of STAT3 using ChIP sequencing coupled with gene expression microarrays to determine STAT3-dependent gene regulation in DCs after histone deacetylase (HDAC) inhibition. HDAC inhibition boosted the ability of STAT3 to bind to distinct DNA targets and regulate gene expression. Among the top 500 STAT3 binding sites, the frequency of canonical motifs was significantly higher than that of noncanonical motifs. Functional analysis revealed that after treatment with an HDAC inhibitor, the upregulated STAT3 target genes were those that were primarily the negative regulators of proinflammatory cytokines and those in the IL-10 signaling pathway. The downregulated STAT3-dependent targets were those involved in immune effector processes and antigen processing/presentation. The expression and functional relevance of these genes were validated. Specifically, functional studies confirmed that the upregulation of IL-10Ra by STAT3 contributed to the suppressive function of DCs following HDAC inhibition.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1159/000450681DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5330838PMC
November 2017

The single-species metagenome: subtyping core genome sequences from shotgun metagenomic data.

PeerJ 2016 18;4:e2571. Epub 2016 Oct 18.

Department of Medicine, Division of Infectious Diseases, Emory University School of Medicine, Atlanta, GA, USA; Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, USA.

In this study we developed a genome-based method for detecting subtypes from metagenome shotgun sequence data. We used a binomial mixture model and the coverage counts at >100,000 known SNP (single nucleotide polymorphism) sites derived from prior comparative genomic analysis to estimate the proportion of 40 subtypes in metagenome samples. We were able to obtain >87% sensitivity and >94% specificity at 0.025X coverage for . We found that 321 and 149 metagenome samples from the Human Microbiome Project and metaSUB analysis of the New York City subway, respectively, contained at genome coverage >0.025. In both projects, CC8 and CC30 were the most common clonal complexes encountered. We found evidence that the subtype composition at different body sites of the same individual were more similar than random sampling and more limited evidence that certain body sites were enriched for particular subtypes. One surprising finding was the apparent high frequency of CC398, a lineage often associated with livestock, in samples from the tongue dorsum. Epidemiologic analysis of the HMP subject population suggested that high BMI (body mass index) and health insurance are possibly associated with carriage but there was limited power to identify factors linked to carriage of even the most common subtype. In the NYC subway data, we found a small signal of geographic distance affecting subtype clustering but other unknown factors influence taxonomic distribution of the species around the city.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.7717/peerj.2571DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5075713PMC
October 2016

A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome.

Nat Commun 2016 10 11;7:12522. Epub 2016 Oct 11.

Department of Medicine, Northwestern University, Chicago, Illinois 60637, USA.

The African Diaspora in the Western Hemisphere represents one of the largest forced migrations in history and had a profound impact on genetic diversity in modern populations. To date, the fine-scale population structure of descendants of the African Diaspora remains largely uncharacterized. Here we present genetic variation from deeply sequenced genomes of 642 individuals from North and South American, Caribbean and West African populations, substantially increasing the lexicon of human genomic variation and suggesting much variation remains to be discovered in African-admixed populations in the Americas. We summarize genetic variation in these populations, quantifying the postcolonial sex-biased European gene flow across multiple regions. Moreover, we refine estimates on the burden of deleterious variants carried across populations and how this varies with African ancestry. Our data are an important resource for empowering disease mapping studies in African-admixed individuals and will facilitate gene discovery for diseases disproportionately affecting individuals of African ancestry.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ncomms12522DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5062574PMC
October 2016

MLL1 and MLL1 fusion proteins have distinct functions in regulating leukemic transcription program.

Cell Discov 2016 17;2:16008. Epub 2016 May 17.

Department of Pathology, University of Michigan, Ann Arbor, MI, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA.

Mixed lineage leukemia protein-1 (MLL1) has a critical role in human MLL1 rearranged leukemia (MLLr) and is a validated therapeutic target. However, its role in regulating global gene expression in MLLr cells, as well as its interplay with MLL1 fusion proteins remains unclear. Here we show that despite shared DNA-binding and cofactor interacting domains at the N terminus, MLL1 and MLL-AF9 are recruited to distinct chromatin regions and have divergent functions in regulating the leukemic transcription program. We demonstrate that MLL1, probably through C-terminal interaction with WDR5, is recruited to regulatory enhancers that are enriched for binding sites of E-twenty-six (ETS) family transcription factors, whereas MLL-AF9 binds to chromatin regions that have no H3K4me1 enrichment. Transcriptome-wide changes induced by different small molecule inhibitors also highlight the distinct functions of MLL1 and MLL-AF9. Taken together, our studies provide novel insights on how MLL1 and MLL fusion proteins contribute to leukemic gene expression, which have implications for developing effective therapies in the future.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/celldisc.2016.8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4869169PMC
July 2016
-->