Publications by authors named "Momiao Xiong"

113 Publications

Forecasting and Evaluating Multiple Interventions for COVID-19 Worldwide.

Front Artif Intell 2020 22;3:41. Epub 2020 May 22.

School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States.

As the Covid-19 pandemic surges around the world, questions arise about the number of global cases at the pandemic's peak, the length of the pandemic before receding, and the timing of intervention strategies to significantly stop the spread of Covid-19. We have developed artificial intelligence (AI)-inspired methods for modeling the transmission dynamics of the epidemics and evaluating interventions to curb the spread and impact of COVID-19. The developed methods were applied to the surveillance data of cumulative and new COVID-19 cases and deaths reported by WHO as of March 16th, 2020. Both the timing and the degree of intervention were evaluated. The average error of five-step ahead forecasting was 2.5%. The total peak number of cumulative cases, new cases, and the maximum number of cumulative cases in the world with complete intervention implemented 4 weeks later than the beginning date (March 16th, 2020) reached 75,249,909, 10,086,085, and 255,392,154, respectively. However, the total peak number of cumulative cases, new cases, and the maximum number of cumulative cases in the world with complete intervention after 1 week were reduced to 951,799, 108,853 and 1,530,276, respectively. Duration time of the COVID-19 spread was reduced from 356 days to 232 days between later and earlier interventions. We observed that delaying intervention for 1 month caused the maximum number of cumulative cases reduce by -166.89 times that of earlier complete intervention, and the number of deaths increased from 53,560 to 8,938,725. Earlier and complete intervention is necessary to stem the tide of COVID-19 infection.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/frai.2020.00041DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7861333PMC
May 2020

Gene-based analysis of bi-variate survival traits via functional regressions with applications to eye diseases.

Genet Epidemiol 2021 Mar 1. Epub 2021 Mar 1.

Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA.

Genetic studies of two related survival outcomes of a pleiotropic gene are commonly encountered but statistical models to analyze them are rarely developed. To analyze sequencing data, we propose mixed effect Cox proportional hazard models by functional regressions to perform gene-based joint association analysis of two survival traits motivated by our ongoing real studies. These models extend fixed effect Cox models of univariate survival traits by incorporating variations and correlation of multivariate survival traits into the models. The associations between genetic variants and two survival traits are tested by likelihood ratio test statistics. Extensive simulation studies suggest that type I error rates are well controlled and power performances are stable. The proposed models are applied to analyze bivariate survival traits of left and right eyes in the age-related macular degeneration progression.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/gepi.22381DOI Listing
March 2021

Conditional Generative Adversarial Networks for Individualized Treatment Effect Estimation and Treatment Selection.

Front Genet 2020 11;11:585804. Epub 2020 Dec 11.

Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States.

Treatment response is heterogeneous. However, the classical methods treat the treatment response as homogeneous and estimate the average treatment effects. The traditional methods are difficult to apply to precision oncology. Artificial intelligence (AI) is a powerful tool for precision oncology. It can accurately estimate the individualized treatment effects and learn optimal treatment choices. Therefore, the AI approach can substantially improve progress and treatment outcomes of patients. One AI approach, conditional generative adversarial nets for inference of individualized treatment effects (GANITE) has been developed. However, GANITE can only deal with binary treatment and does not provide a tool for optimal treatment selection. To overcome these limitations, we modify conditional generative adversarial networks (MCGANs) to allow estimation of individualized effects of any types of treatments including binary, categorical and continuous treatments. We propose to use sparse techniques for selection of biomarkers that predict the best treatment for each patient. Simulations show that MCGANs outperform seven other state-of-the-art methods: linear regression (LR), Bayesian linear ridge regression (BLR), k-Nearest Neighbor (KNN), random forest classification [RF (C)], random forest regression [RF (R)], logistic regression (LogR), and support vector machine (SVM). To illustrate their applications, the proposed MCGANs were applied to 256 patients with newly diagnosed acute myeloid leukemia (AML) who were treated with high dose ara-C (HDAC), Idarubicin (IDA) and both of these two treatments (HDAC+IDA) at M. D. Anderson Cancer Center. Our results showed that MCGAN can more accurately and robustly estimate the individualized treatment effects than other state-of-the art methods. Several biomarkers such as GSK3, BILIRUBIN, SMAC are identified and a total of 30 biomarkers can explain 36.8% of treatment effect variation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2020.585804DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7759680PMC
December 2020

Shared Causal Paths underlying Alzheimer's dementia and Type 2 Diabetes.

Sci Rep 2020 03 5;10(1):4107. Epub 2020 Mar 5.

Department of Biostatistics and Data Science, School of Public Health, University of Texas Health Science Center at Houston, Houston, Texas, USA.

Although Alzheimer's disease (AD) is a central nervous system disease and type 2 diabetes MELLITUS (T2DM) is a metabolic disorder, an increasing number of genetic epidemiological studies show clear link between AD and T2DM. The current approach to uncovering the shared pathways between AD and T2DM involves association analysis; however such analyses lack power to discover the mechanisms of the diseases. As an alternative, we developed novel causal inference methods for genetic studies of AD and T2DM and pipelines for systematic multi-omic casual analysis to infer multilevel omics causal networks for the discovery of common paths from genetic variants to AD and T2DM. The proposed pipelines were applied to 448 individuals from the ROSMAP Project. We identified 13 shared causal genes, 16 shared causal pathways between AD and T2DM, and 754 gene expression and 101 gene methylation nodes that were connected to both AD and T2DM in multi-omics causal networks.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-020-60682-3DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7058072PMC
March 2020

Deep Feature Selection and Causal Analysis of Alzheimer's Disease.

Front Neurosci 2019 15;13:1198. Epub 2019 Nov 15.

Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center, Houston, TX, United States.

Deep convolutional neural networks (DCNNs) have achieved great success for image classification in medical research. Deep learning with brain imaging is the imaging method of choice for the diagnosis and prediction of Alzheimer's disease (AD). However, it is also well known that DCNNs are "black boxes" owing to their low interpretability to humans. The lack of transparency of deep learning compromises its application to the prediction and mechanism investigation in AD. To overcome this limitation, we develop a novel general framework that integrates deep leaning, feature selection, causal inference, and genetic-imaging data analysis for predicting and understanding AD. The proposed algorithm not only improves the prediction accuracy but also identifies the brain regions underlying the development of AD and causal paths from genetic variants to AD via image mediation. The proposed algorithm is applied to the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset with diffusion tensor imaging (DTI) in 151 subjects (51 AD and 100 non-AD) who were measured at four time points of baseline, 6 months, 12 months, and 24 months. The algorithm identified brain regions underlying AD consisting of the temporal lobes (including the hippocampus) and the ventricular system.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fnins.2019.01198DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6872503PMC
November 2019

Gene-based association analysis of survival traits via functional regression-based mixed effect cox models for related samples.

Genet Epidemiol 2019 12 10;43(8):952-965. Epub 2019 Sep 10.

Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia.

The importance to integrate survival analysis into genetics and genomics is widely recognized, but only a small number of statisticians have produced relevant work toward this study direction. For unrelated population data, functional regression (FR) models have been developed to test for association between a quantitative/dichotomous/survival trait and genetic variants in a gene region. In major gene association analysis, these models have higher power than sequence kernel association tests. In this paper, we extend this approach to analyze censored traits for family data or related samples using FR based mixed effect Cox models (FamCoxME). The FamCoxME model effect of major gene as fixed mean via functional data analysis techniques, the local gene or polygene variations or both as random, and the correlation of pedigree members by kinship coefficients or genetic relationship matrix or both. The association between the censored trait and the major gene is tested by likelihood ratio tests (FamCoxME FR LRT). Simulation results indicate that the LRT control the type I error rates accurately/conservatively and have good power levels when both local gene or polygene variations are modeled. The proposed methods were applied to analyze a breast cancer data set from the Consortium of Investigators of Modifiers of BRCA1 and BRCA2 (CIMBA). The FamCoxME provides a new tool for gene-based analysis of family-based studies or related samples.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/gepi.22254DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6829032PMC
December 2019

A Frameshift Variant in the CHST9 Gene Identified by Family-Based Whole Genome Sequencing Is Associated with Schizophrenia in Chinese Population.

Sci Rep 2019 09 3;9(1):12717. Epub 2019 Sep 3.

410 AI, LLC, Germantown, MD, 20876, USA.

Recent studies imply that rare variants contribute to the risk of schizophrenia, however, the exact variants or genes responsible for this condition are largely unknown. In this study, we conducted whole genome sequencing (WGS) of 20 Chinese families. Each family consisted of at least two affected siblings diagnosed with schizophrenia and at least one unaffected sibling. We examined functional variants that were found in affected sibling(s) but not in unaffected sibling(s) within a family. Matching this criterion, a frameshift heterozygous deletion of CA (-/CA) at chromosome 18:24722722, also referred to as rs752084147, in the Carbohydrate Sulfotransferase 9 (CHST9) gene, was detected in two families. This deletion was confirmed by PCR-based Sanger sequencing. With the observed frequency of 0.00076 in Han Chinese population, we performed both case-control and family-based analyses to evaluate its association with schizophrenia. In the case-control analyses, Chi-square test P-value was 6.80e-12 and the P-value was 0.0008 after one million simulations. In family-based segregation analyses, segregation P-value was 7.72e-7 and simulated P-value was 5.70e-6. For both the case-control and family-based analyses, the CA deletion was significantly associated with schizophrenia in the Chinese population. Further investigation of this gene  is warranted in the development of schizophrenia by utilizing larger and more ethnically diverse samples.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-019-49052-wDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6722128PMC
September 2019

Role of Immune Response, Inflammation, and Tumor Immune Response-Related Cytokines/Chemokines in Melanoma Progression.

J Invest Dermatol 2019 11 7;139(11):2352-2358.e3. Epub 2019 Jun 7.

Department of Surgical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA. Electronic address:

To investigate the role of tumor cytokines/chemokines in melanoma immune response, we estimated the proportions of immune cell subsets in melanoma tumors from The Cancer Genome Atlas, followed by evaluation of the association between cytokine/chemokine expression and these subsets. We then investigated the association of immune cell subsets, chemokines, and cytokines with patient survival. Finally, we evaluated the immune cell tumor-infiltrating lymphocyte (TIL) score for correlation with melanoma patient outcome in a separate cohort. There was good agreement between RNA sequencing estimation of T-cell subset and pathologist-determined TIL score. Expression levels of cytokines IL-12A, IFNG, and IL-10, and chemokines CXCL9 and CXCL10 were positively correlated with PDCD1, CTLA-4, and CD8 T-cell subset, but negatively correlated with tumor purity (Bonferroni-corrected P < 0.05). In multivariable analysis, higher expression levels of cytokines IFN-γ and TGFB1, but not chemokines, were associated with improved overall survival. A higher expression level of CD8 T-cell subset was also associated with improved overall survival (hazard ratio [HR] = 0.06, 95% confidence interval [CI] = 0.01-0.35, P = 0.002). Finally, multivariable analysis showed that patients with a brisk TIL score had improved melanoma-specific survival than those with a nonbrisk score (HR = 0.51, 95% CI = 0.27-0.98, P = 0.0423). These results suggest that the expression of specific tumor cytokines represents important biomarkers of melanoma immune response.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jid.2019.03.1158DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6814532PMC
November 2019

Robust Reference Powered Association Test of Genome-Wide Association Studies.

Front Genet 2019 9;10:319. Epub 2019 Apr 9.

Human Phenome Institute, Fudan University, Shanghai, China.

Genome-wide association studies (GWASs) have identified abundant genetic susceptibility loci, GWAS of small sample size are far less from meeting the previous expectations due to low statistical power and false positive results. Effective statistical methods are required to further improve the analyses of massive GWAS data. Here we presented a new statistic (Robust Reference Powered Association Test) to use large public database (gnomad) as reference to reduce concern of potential population stratification. To evaluate the performance of this statistic for various situations, we simulated multiple sets of sample size and frequencies to compute statistical power. Furthermore, we applied our method to several real datasets (psoriasis genome-wide association datasets and schizophrenia genome-wide association dataset) to evaluate the performance. Careful analyses indicated that our newly developed statistic outperformed several previously developed GWAS applications. Importantly, this statistic is more robust than naive merging method in the presence of small control-reference differentiation, therefore likely to detect more association signals.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2019.00319DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6465778PMC
April 2019

Linear mixed models for association analysis of quantitative traits with next-generation sequencing data.

Genet Epidemiol 2019 03 9;43(2):189-206. Epub 2018 Dec 9.

Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland.

We develop linear mixed models (LMMs) and functional linear mixed models (FLMMs) for gene-based tests of association between a quantitative trait and genetic variants on pedigrees. The effects of a major gene are modeled as a fixed effect, the contributions of polygenes are modeled as a random effect, and the correlations of pedigree members are modeled via inbreeding/kinship coefficients. F -statistics and χ likelihood ratio test (LRT) statistics based on the LMMs and FLMMs are constructed to test for association. We show empirically that the F -distributed statistics provide a good control of the type I error rate. The F -test statistics of the LMMs have similar or higher power than the FLMMs, kernel-based famSKAT (family-based sequence kernel association test), and burden test famBT (family-based burden test). The F -statistics of the FLMMs perform well when analyzing a combination of rare and common variants. For small samples, the LRT statistics of the FLMMs control the type I error rate well at the nominal levels α = 0.01 and 0.05 . For moderate/large samples, the LRT statistics of the FLMMs control the type I error rates well. The LRT statistics of the LMMs can lead to inflated type I error rates. The proposed models are useful in whole genome and whole exome association studies of complex traits.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/gepi.22177DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6375753PMC
March 2019

knnAUC: an open-source R package for detecting nonlinear dependence between one continuous variable and one binary variable.

BMC Bioinformatics 2018 Nov 22;19(1):448. Epub 2018 Nov 22.

Six Industrial Research Institute, Fudan University, Shanghai, China.

Background: Testing the dependence of two variables is one of the fundamental tasks in statistics. In this work, we developed an open-source R package (knnAUC) for detecting nonlinear dependence between one continuous variable X and one binary dependent variables Y (0 or 1).

Results: We addressed this problem by using knnAUC (k-nearest neighbors AUC test, the R package is available at https://sourceforge.net/projects/knnauc/ ). In the knnAUC software framework, we first resampled a dataset to get the training and testing dataset according to the sample ratio (from 0 to 1), and then constructed a k-nearest neighbors algorithm classifier to get the yhat estimator (the probability of y = 1) of testy (the true label of testing dataset). Finally, we calculated the AUC (area under the curve of receiver operating characteristic) estimator and tested whether the AUC estimator is greater than 0.5. To evaluate the advantages of knnAUC compared to seven other popular methods, we performed extensive simulations to explore the relationships between eight different methods and compared the false positive rates and statistical power using both simulated and real datasets (Chronic hepatitis B datasets and kidney cancer RNA-seq datasets).

Conclusions: We concluded that knnAUC is an efficient R package to test non-linear dependence between one continuous variable and one binary dependent variable especially in computational biology area.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-018-2427-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6249767PMC
November 2018

Bivariate Causal Discovery and Its Applications to Gene Expression and Imaging Data Analysis.

Front Genet 2018 31;9:347. Epub 2018 Aug 31.

Department of Biostatistics and Data Science, The University of Texas School of Public Health, Houston, TX, United States.

The mainstream of research in genetics, epigenetics, and imaging data analysis focuses on statistical association or exploring statistical dependence between variables. Despite their significant progresses in genetic research, understanding the etiology and mechanism of complex phenotypes remains elusive. Using association analysis as a major analytical platform for the complex data analysis is a key issue that hampers the theoretic development of genomic science and its application in practice. Causal inference is an essential component for the discovery of mechanical relationships among complex phenotypes. Many researchers suggest making the transition from association to causation. Despite its fundamental role in science, engineering, and biomedicine, the traditional methods for causal inference require at least three variables. However, quantitative genetic analysis such as QTL, eQTL, mQTL, and genomic-imaging data analysis requires exploring the causal relationships between two variables. This paper will focus on bivariate causal discovery with continuous variables. We will introduce independence of cause and mechanism (ICM) as a basic principle for causal inference, algorithmic information theory and additive noise model (ANM) as major tools for bivariate causal discovery. Large-scale simulations will be performed to evaluate the feasibility of the ANM for bivariate causal discovery. To further evaluate their performance for causal inference, the ANM will be applied to the construction of gene regulatory networks. Also, the ANM will be applied to trait-imaging data analysis to illustrate three scenarios: presence of both causation and association, presence of association while absence of causation, and presence of causation, while lack of association between two variables. Telling cause from effect between two continuous variables from observational data is one of the fundamental and challenging problems in omics and imaging data analysis. Our preliminary simulations and real data analysis will show that the ANMs will be one of choice for bivariate causal discovery in genomic and imaging data analysis.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2018.00347DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6127271PMC
August 2018

Application of Causal Inference to Genomic Analysis: Advances in Methodology.

Front Genet 2018 10;9:238. Epub 2018 Jul 10.

Department of Biostatistics and Data Science, School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, United States.

The current paradigm of genomic studies of complex diseases is association and correlation analysis. Despite significant progress in dissecting the genetic architecture of complex diseases by genome-wide association studies (GWAS), the identified genetic variants by GWAS can only explain a small proportion of the heritability of complex diseases. A large fraction of genetic variants is still hidden. Association analysis has limited power to unravel mechanisms of complex diseases. It is time to shift the paradigm of genomic analysis from association analysis to causal inference. Causal inference is an essential component for the discovery of mechanism of diseases. This paper will review the major platforms of the genomic analysis in the past and discuss the perspectives of causal inference as a general framework of genomic analysis. In genomic data analysis, we usually consider four types of associations: association of discrete variables (DNA variation) with continuous variables (phenotypes and gene expressions), association of continuous variables (expressions, methylations, and imaging signals) with continuous variables (gene expressions, imaging signals, phenotypes, and physiological traits), association of discrete variables (DNA variation) with binary trait (disease status) and association of continuous variables (gene expressions, methylations, phenotypes, and imaging signals) with binary trait (disease status). In this paper, we will review algorithmic information theory as a general framework for causal discovery and the recent development of statistical methods for causal inference on discrete data, and discuss the possibility of extending the association analysis of discrete variable with disease to the causal analysis for discrete variable and disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2018.00238DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6048229PMC
July 2018

Nuclear Norm Clustering: a promising alternative method for clustering tasks.

Sci Rep 2018 07 18;8(1):10873. Epub 2018 Jul 18.

State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China.

Clustering techniques are widely used in many applications. The goal of clustering is to identify patterns or groups of similar objects within a dataset of interest. However, many cluster methods are neither robust nor sensitive to noises and outliers in real data. In this paper, we present Nuclear Norm Clustering (NNC, available at https://sourceforge.net/projects/nnc/), an algorithm that can be used in various fields as a promising alternative to the k-means clustering method. The NNC algorithm requires users to provide a data matrix M and a desired number of cluster K. We employed simulated annealing techniques to choose an optimal label vector that minimizes nuclear norm of the pooled within cluster residual matrix. To evaluate the performance of the NNC algorithm, we compared the performance of both 15 public datasets and 2 genome-wide association studies (GWAS) on psoriasis, comparing our method with other classic methods. The results indicate that NNC method has a competitive performance in terms of F-score on 15 benchmarked public datasets and 2 psoriasis GWAS datasets. So NNC is a promising alternative method for clustering tasks.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-018-29246-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6052164PMC
July 2018

A quadratically regularized functional canonical correlation analysis for identifying the global structure of pleiotropy with NGS data.

PLoS Comput Biol 2017 Oct 17;13(10):e1005788. Epub 2017 Oct 17.

Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States of America.

Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. To explore correlation information of genetic variants, effectively reduce data dimensions, and overcome critical barriers in advancing the development of novel statistical methods and computational algorithms for genetic pleiotropic analysis, we proposed a new statistic method referred to as a quadratically regularized functional CCA (QRFCCA) for association analysis which combines three approaches: (1) quadratically regularized matrix factorization, (2) functional data analysis and (3) canonical correlation analysis (CCA). Large-scale simulations show that the QRFCCA has a much higher power than that of the ten competing statistics while retaining the appropriate type 1 errors. To further evaluate performance, the QRFCCA and ten other statistics are applied to the whole genome sequencing dataset from the TwinsUK study. We identify a total of 79 genes with rare variants and 67 genes with common variants significantly associated with the 46 traits using QRFCCA. The results show that the QRFCCA substantially outperforms the ten other statistics.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1005788DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5659802PMC
October 2017

Bagging Nearest-Neighbor Prediction independence Test: an efficient method for nonlinear dependence of two continuous variables.

Sci Rep 2017 10 6;7(1):12736. Epub 2017 Oct 6.

State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China.

Testing dependence/correlation of two variables is one of the fundamental tasks in statistics. In this work, we proposed an efficient method for nonlinear dependence of two continuous variables (X and Y). We addressed this research question by using BNNPT (Bagging Nearest-Neighbor Prediction independence Test, software available at https://sourceforge.net/projects/bnnpt/). In the BNNPT framework, we first used the value of X to construct a bagging neighborhood structure. We then obtained the out of bag estimator of Y based on the bagging neighborhood structure. The square error was calculated to measure how well Y is predicted by X. Finally, a permutation test was applied to determine the significance of the observed square error. To evaluate the strength of BNNPT compared to seven other methods, we performed extensive simulations to explore the relationship between various methods and compared the false positive rates and statistical power using both simulated and real datasets (Rugao longevity cohort mitochondrial DNA haplogroups and kidney cancer RNA-seq datasets). We concluded that BNNPT is an efficient computational approach to test nonlinear correlation in real world applications.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-017-12783-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5630623PMC
October 2017

Functional regression method for whole genome eQTL epistasis analysis with sequencing data.

BMC Genomics 2017 05 18;18(1):385. Epub 2017 May 18.

State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, 200438, China.

Background: Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential component of the genetic architecture of the gene expressions. However, interaction analysis of gene expressions remains fundamentally unexplored due to great computational challenges and data availability. Due to variation in splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the cells, RNA-seq measurements generate large expression variability and collectively create the observed position level read count curves. A single number for measuring gene expression which is widely used for microarray measured gene expression analysis is highly unlikely to sufficiently account for large expression variation across the gene. Simultaneously analyzing epistatic architecture using the RNA-seq and whole genome sequencing (WGS) data poses enormous challenges.

Methods: We develop a nonlinear functional regression model (FRGM) with functional responses where the position-level read counts within a gene are taken as a function of genomic position, and functional predictors where genotype profiles are viewed as a function of genomic position, for epistasis analysis with RNA-seq data. Instead of testing the interaction of all possible pair-wises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and use all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions.

Results: By large-scale simulations, we demonstrate that the proposed FRGM for epistasis analysis can achieve the correct type 1 error and has higher power to detect the interactions between genes than the existing methods. The proposed methods are applied to the RNA-seq and WGS data from the 1000 Genome Project. The numbers of pairs of significantly interacting genes after Bonferroni correction identified using FRGM, RPKM and DESeq were 16,2361, 260 and 51, respectively, from the 350 European samples.

Conclusions: The proposed FRGM for epistasis analysis of RNA-seq can capture isoform and position-level information and will have a broad application. Both simulations and real data analysis highlight the potential for the FRGM to be a good choice of the epistatic analysis with sequencing data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-017-3777-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5436462PMC
May 2017

Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models.

Eur J Hum Genet 2017 02 21;25(3):350-359. Epub 2016 Dec 21.

Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD USA.

To analyze next-generation sequencing data, multivariate functional linear models are developed for a meta-analysis of multiple studies to connect genetic variant data to multiple quantitative traits adjusting for covariates. The goal is to take the advantage of both meta-analysis and pleiotropic analysis in order to improve power and to carry out a unified association analysis of multiple studies and multiple traits of complex disorders. Three types of approximate F -distributions based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants. Simulation analysis is performed to evaluate false-positive rates and power of the proposed tests. The proposed methods are applied to analyze lipid traits in eight European cohorts. It is shown that it is more advantageous to perform multivariate analysis than univariate analysis in general, and it is more advantageous to perform meta-analysis of multiple studies instead of analyzing the individual studies separately. The proposed models require individual observations. The value of the current paper can be seen at least for two reasons: (a) the proposed methods can be applied to studies that have individual genotype data; (b) the proposed methods can be used as a criterion for future work that uses summary statistics to build test statistics to meta-analyze the data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ejhg.2016.170DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5315507PMC
February 2017

A comparison study of multivariate fixed models and Gene Association with Multiple Traits (GAMuT) for next-generation sequencing.

Genet Epidemiol 2017 Jan 5;41(1):18-34. Epub 2016 Dec 5.

Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, USA.

In this paper, extensive simulations are performed to compare two statistical methods to analyze multiple correlated quantitative phenotypes: (1) approximate F-distributed tests of multivariate functional linear models (MFLM) and additive models of multivariate analysis of variance (MANOVA), and (2) Gene Association with Multiple Traits (GAMuT) for association testing of high-dimensional genotype data. It is shown that approximate F-distributed tests of MFLM and MANOVA have higher power and are more appropriate for major gene association analysis (i.e., scenarios in which some genetic variants have relatively large effects on the phenotypes); GAMuT has higher power and is more appropriate for analyzing polygenic effects (i.e., effects from a large number of genetic variants each of which contributes a small amount to the phenotypes). MFLM and MANOVA are very flexible and can be used to perform association analysis for (i) rare variants, (ii) common variants, and (iii) a combination of rare and common variants. Although GAMuT was designed to analyze rare variants, it can be applied to analyze a combination of rare and common variants and it performs well when (1) the number of genetic variants is large and (2) each variant contributes a small amount to the phenotypes (i.e., polygenes). MFLM and MANOVA are fixed effect models that perform well for major gene association analysis. GAMuT can be viewed as an extension of sequence kernel association tests (SKAT). Both GAMuT and SKAT are more appropriate for analyzing polygenic effects and they perform well not only in the rare variant case, but also in the case of a combination of rare and common variants. Data analyses of European cohorts and the Trinity Students Study are presented to compare the performance of the two methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/gepi.22014DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5154843PMC
January 2017

A new statistical framework for genetic pleiotropic analysis of high dimensional phenotype data.

BMC Genomics 2016 11 7;17(1):881. Epub 2016 Nov 7.

Human Genetics Center, Department of Biostatistics, University of Texas School of Public Health, Houston, TX, 77030, USA.

Background: The widely used genetic pleiotropic analyses of multiple phenotypes are often designed for examining the relationship between common variants and a few phenotypes. They are not suited for both high dimensional phenotypes and high dimensional genotype (next-generation sequencing) data. To overcome limitations of the traditional genetic pleiotropic analysis of multiple phenotypes, we develop sparse structural equation models (SEMs) as a general framework for a new paradigm of genetic analysis of multiple phenotypes. To incorporate both common and rare variants into the analysis, we extend the traditional multivariate SEMs to sparse functional SEMs. To deal with high dimensional phenotype and genotype data, we employ functional data analysis and the alternative direction methods of multiplier (ADMM) techniques to reduce data dimension and improve computational efficiency.

Results: Using large scale simulations we showed that the proposed methods have higher power to detect true causal genetic pleiotropic structure than other existing methods. Simulations also demonstrate that the gene-based pleiotropic analysis has higher power than the single variant-based pleiotropic analysis. The proposed method is applied to exome sequence data from the NHLBI's Exome Sequencing Project (ESP) with 11 phenotypes, which identifies a network with 137 genes connected to 11 phenotypes and 341 edges. Among them, 114 genes showed pleiotropic genetic effects and 45 genes were reported to be associated with phenotypes in the analysis or other cardiovascular disease (CVD) related phenotypes in the literature.

Conclusions: Our proposed sparse functional SEMs can incorporate both common and rare variants into the analysis and the ADMM algorithm can efficiently solve the penalized SEMs. Using this model we can jointly infer genetic architecture and casual phenotype network structure, and decompose the genetic effect into direct, indirect and total effect. Using large scale simulations we showed that the proposed methods have higher power to detect true causal genetic pleiotropic structure than other existing methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-016-3169-1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5100198PMC
November 2016

Random Bits Forest: a Strong Classifier/Regressor for Big Data.

Sci Rep 2016 07 22;6:30086. Epub 2016 Jul 22.

Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai 200433, China.

Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/srep30086DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4957112PMC
July 2016

A Comparison Study of Fixed and Mixed Effect Models for Gene Level Association Studies of Complex Traits.

Genet Epidemiol 2016 Dec 4;40(8):702-721. Epub 2016 Jul 4.

Human Genetics Center, University of Texas-Houston, Houston, Texas, United States of America.

In association studies of complex traits, fixed-effect regression models are usually used to test for association between traits and major gene loci. In recent years, variance-component tests based on mixed models were developed for region-based genetic variant association tests. In the mixed models, the association is tested by a null hypothesis of zero variance via a sequence kernel association test (SKAT), its optimal unified test (SKAT-O), and a combined sum test of rare and common variant effect (SKAT-C). Although there are some comparison studies to evaluate the performance of mixed and fixed models, there is no systematic analysis to determine when the mixed models perform better and when the fixed models perform better. Here we evaluated, based on extensive simulations, the performance of the fixed and mixed model statistics, using genetic variants located in 3, 6, 9, 12, and 15 kb simulated regions. We compared the performance of three models: (i) mixed models that lead to SKAT, SKAT-O, and SKAT-C, (ii) traditional fixed-effect additive models, and (iii) fixed-effect functional regression models. To evaluate the type I error rates of the tests of fixed models, we generated genotype data by two methods: (i) using all variants, (ii) using only rare variants. We found that the fixed-effect tests accurately control or have low false positive rates. We performed simulation analyses to compare power for two scenarios: (i) all causal variants are rare, (ii) some causal variants are rare and some are common. Either one or both of the fixed-effect models performed better than or similar to the mixed models except when (1) the region sizes are 12 and 15 kb and (2) effect sizes are small. Therefore, the assumption of mixed models could be satisfied and SKAT/SKAT-O/SKAT-C could perform better if the number of causal variants is large and each causal variant contributes a small amount to the traits (i.e., polygenes). In major gene association studies, we argue that the fixed-effect models perform better or similarly to mixed models in most cases because some variants should affect the traits relatively large. In practice, it makes sense to perform analysis by both the fixed and mixed effect models and to make a comparison, and this can be readily done using our R codes and the SKAT packages.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/gepi.21984DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5567849PMC
December 2016

Functional Regression Models for Epistasis Analysis of Multiple Quantitative Traits.

PLoS Genet 2016 Apr 22;12(4):e1005965. Epub 2016 Apr 22.

Human Genetics Center, Division of Biostatistics, The University of Texas School of Public Health, Houston, Texas, United States of America.

To date, most genetic analyses of phenotypes have focused on analyzing single traits or analyzing each phenotype independently. However, joint epistasis analysis of multiple complementary traits will increase statistical power and improve our understanding of the complicated genetic structure of the complex diseases. Despite their importance in uncovering the genetic structure of complex traits, the statistical methods for identifying epistasis in multiple phenotypes remains fundamentally unexplored. To fill this gap, we formulate a test for interaction between two genes in multiple quantitative trait analysis as a multiple functional regression (MFRG) in which the genotype functions (genetic variant profiles) are defined as a function of the genomic position of the genetic variants. We use large-scale simulations to calculate Type I error rates for testing interaction between two genes with multiple phenotypes and to compare the power with multivariate pairwise interaction analysis and single trait interaction analysis by a single variate functional regression model. To further evaluate performance, the MFRG for epistasis analysis is applied to five phenotypes of exome sequence data from the NHLBI's Exome Sequencing Project (ESP) to detect pleiotropic epistasis. A total of 267 pairs of genes that formed a genetic interaction network showed significant evidence of epistasis influencing five traits. The results demonstrate that the joint interaction analysis of multiple phenotypes has a much higher power to detect interaction than the interaction analysis of a single trait and may open a new direction to fully uncovering the genetic structure of multiple phenotypes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1005965DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4841563PMC
April 2016

Copy Number Variation of HLA-DQA1 and APOBEC3A/3B Contribute to the Susceptibility of Systemic Sclerosis in the Chinese Han Population.

J Rheumatol 2016 05 1;43(5):880-6. Epub 2016 Apr 1.

From the State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University; Institute of Rheumatology, Immunology and Allergy, Fudan University; Shanghai Traditional Chinese Medicine (TCM)-Integrated Hospital; Division of Dermatology, and Division of Rheumatology, Huashan Hospital, Fudan University, Shanghai; Yiling Hospital, Shijiazhuang; Division of Rheumatology, Teaching Hospital of Chengdu University of TCM, Chengdu; Department of Dermatology, Second Xiangya Hospital, Central South University, Changsha, China; School of Public Health, and Medical School at Houston, University of Texas, Houston, Texas, USA.S. Guo, PhD, State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, and Institute of Rheumatology, Immunology and Allergy, Fudan University; Y. Li, MS, State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University; Y. Wang, PhD, State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University; H. Chu, PhD, State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University; Y. Chen, PhD, State Key Laboratory o

Objective: Systemic sclerosis (SSc) is a systemic connective tissue disease caused by a genetic aberrant. The involvement of the copy number variations (CNV) in the pathogenesis of SSc is unclear. We tried to identify some CNV that are involved with the susceptibility to SSc.

Methods: A genome-wide CNV screening was performed in 20 patients with SSc. Five SSc-associated common CNV that included HLA-DRB5, HLA-DQA1, IRGM, CDC42EP3, and APOBEC3A/3B were identified from the screening and were then validated in 365 patients with SSc and 369 matched healthy controls.

Results: Three hundred forty-four CNV (140 gains and 204 losses) and 2 CNV hotspots (6q21.3 and 22q11.2) were found in the SSc genomes (covering 24.2 megabases), suggesting that CNV were ubiquitous in the SSc genome and played important roles in the pathogenesis of SSc. The high copy number of HLA-DQA1 was a significantly protective factor for SSc (OR 0.07, p = 2.99 × 10(-17)), while the high copy number of APOBEC3A/B was a significant risk factor (OR 3.45, p = 6.4 × 10(-18)), adjusted with sex and age. The risk prediction model based on genetic factors in logistic regression showed moderate prediction ability, with area under the curve = 0.80 (95% CI 0.77-0.83), which demonstrated that APOBEC3A/B and HLA-DQA1 were powerful biomarkers for SSc risk evaluation and contributed to the susceptibility to SSc.

Conclusion: CNV of HLA-DQA1 and APOBEC3A/B contribute to the susceptibility to SSc in a Chinese Han population.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3899/jrheum.150945DOI Listing
May 2016

An estimating equation approach to dimension reduction for longitudinal data.

Biometrika 2016 Mar 16;103(1):189-203. Epub 2016 Feb 16.

State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, 2005 Songhu Road, Shanghai 200438, China ,

Sufficient dimension reduction has been extensively explored in the context of independent and identically distributed data. In this article we generalize sufficient dimension reduction to longitudinal data and propose an estimating equation approach to estimating the central mean subspace. The proposed method accounts for the covariance structure within each subject and improves estimation efficiency when the covariance structure is correctly specified. Even if the covariance structure is misspecified, our estimator remains consistent. In addition, our method relaxes distributional assumptions on the covariates and is doubly robust. To determine the structural dimension of the central mean subspace, we propose a Bayesian-type information criterion. We show that the estimated structural dimension is consistent and that the estimated basis directions are root-[Formula: see text] consistent, asymptotically normal and locally efficient. Simulations and an analysis of the Framingham Heart Study data confirm the effectiveness of our approach.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/biomet/asv066DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4803001PMC
March 2016

Gene-Based Association Analysis for Censored Traits Via Fixed Effect Functional Regressions.

Genet Epidemiol 2016 Feb 18;40(2):133-43. Epub 2016 Jan 18.

Division of Pulmonary Medicine, Allergy and Immunology, Children's Hospital of Pittsburgh at The University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America.

Genetic studies of survival outcomes have been proposed and conducted recently, but statistical methods for identifying genetic variants that affect disease progression are rarely developed. Motivated by our ongoing real studies, here we develop Cox proportional hazard models using functional regression (FR) to perform gene-based association analysis of survival traits while adjusting for covariates. The proposed Cox models are fixed effect models where the genetic effects of multiple genetic variants are assumed to be fixed. We introduce likelihood ratio test (LRT) statistics to test for associations between the survival traits and multiple genetic variants in a genetic region. Extensive simulation studies demonstrate that the proposed Cox RF LRT statistics have well-controlled type I error rates. To evaluate power, we compare the Cox FR LRT with the previously developed burden test (BT) in a Cox model and sequence kernel association test (SKAT), which is based on mixed effect Cox models. The Cox FR LRT statistics have higher power than or similar power as Cox SKAT LRT except when 50%/50% causal variants had negative/positive effects and all causal variants are rare. In addition, the Cox FR LRT statistics have higher power than Cox BT LRT. The models and related test statistics can be useful in the whole genome and whole exome association studies. An age-related macular degeneration dataset was analyzed as an example.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/gepi.21947DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4724326PMC
February 2016

Multiple functional linear model for association analysis of RNA-seq with imaging.

Quant Biol 2015 Jun 15;3(2):90-102. Epub 2015 Aug 15.

Human Genetics Center, Division of Biostatistics, The University of Texas School of Public Health, Houston, TX 77030, USA.

Emerging integrative analysis of genomic and anatomical imaging data which has not been well developed, provides invaluable information for the holistic discovery of the genomic structure of disease and has the potential to open a new avenue for discovering novel disease susceptibility genes which cannot be identified if they are analyzed separately. A key issue to the success of imaging and genomic data analysis is how to reduce their dimensions. Most previous methods for imaging information extraction and RNA-seq data reduction do not explore imaging spatial information and often ignore gene expression variation at the genomic positional level. To overcome these limitations, we extend functional principle component analysis from one dimension to two dimensions (2DFPCA) for representing imaging data and develop a multiple functional linear model (MFLM) in which functional principal scores of images are taken as multiple quantitative traits and RNA-seq profile across a gene is taken as a function predictor for assessing the association of gene expression with images. The developed method has been applied to image and RNA-seq data of ovarian cancer and kidney renal clear cell carcinoma (KIRC) studies. We identified 24 and 84 genes whose expressions were associated with imaging variations in ovarian cancer and KIRC studies, respectively. Our results showed that many significantly associated genes with images were not differentially expressed, but revealed their morphological and metabolic functions. The results also demonstrated that the peaks of the estimated regression coefficient function in the MFLM often allowed the discovery of splicing sites and multiple isoforms of gene expressions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s40484-015-0048-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4705549PMC
June 2015

Meta-analysis of Complex Diseases at Gene Level with Generalized Functional Linear Models.

Genetics 2016 Feb 29;202(2):457-70. Epub 2015 Dec 29.

Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan 48109.

We developed generalized functional linear models (GFLMs) to perform a meta-analysis of multiple case-control studies to evaluate the relationship of genetic data to dichotomous traits adjusting for covariates. Unlike the previously developed meta-analysis for sequence kernel association tests (MetaSKATs), which are based on mixed-effect models to make the contributions of major gene loci random, GFLMs are fixed models; i.e., genetic effects of multiple genetic variants are fixed. Based on GFLMs, we developed chi-squared-distributed Rao's efficient score test and likelihood-ratio test (LRT) statistics to test for an association between a complex dichotomous trait and multiple genetic variants. We then performed extensive simulations to evaluate the empirical type I error rates and power performance of the proposed tests. The Rao's efficient score test statistics of GFLMs are very conservative and have higher power than MetaSKATs when some causal variants are rare and some are common. When the causal variants are all rare [i.e., minor allele frequencies (MAF) < 0.03], the Rao's efficient score test statistics have similar or slightly lower power than MetaSKATs. The LRT statistics generate accurate type I error rates for homogeneous genetic-effect models and may inflate type I error rates for heterogeneous genetic-effect models owing to the large numbers of degrees of freedom and have similar or slightly higher power than the Rao's efficient score test statistics. GFLMs were applied to analyze genetic data of 22 gene regions of type 2 diabetes data from a meta-analysis of eight European studies and detected significant association for 18 genes (P < 3.10 × 10(-6)), tentative association for 2 genes (HHEX and HMGA2; P ≈ 10(-5)), and no association for 2 genes, while MetaSKATs detected none. In addition, the traditional additive-effect model detects association at gene HHEX. GFLMs and related tests can analyze rare or common variants or a combination of the two and can be useful in whole-genome and whole-exome association studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1534/genetics.115.180869DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4788228PMC
February 2016

Statistical Analysis of High-Dimensional Genetic Data in Complex Traits.

Biomed Res Int 2015 4;2015:564273. Epub 2015 Aug 4.

Division of Biostatistics, Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, 1200 Herman Pressler, Houston, TX 77030, USA.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1155/2015/564273DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539419PMC
June 2016

Dynamic Model for RNA-seq Data Analysis.

Biomed Res Int 2015 4;2015:916352. Epub 2015 Aug 4.

Human Genetics Center, Division of Biostatistics, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.

By measuring messenger RNA levels for all genes in a sample, RNA-seq provides an attractive option to characterize the global changes in transcription. RNA-seq is becoming the widely used platform for gene expression profiling. However, real transcription signals in the RNA-seq data are confounded with measurement and sequencing errors and other random biological/technical variation. To extract biologically useful transcription process from the RNA-seq data, we propose to use the second ODE for modeling the RNA-seq data. We use differential principal analysis to develop statistical methods for estimation of location-varying coefficients of the ODE. We validate the accuracy of the ODE model to fit the RNA-seq data by prediction analysis and 5-fold cross validation. To further evaluate the performance of the ODE model for RNA-seq data analysis, we used the location-varying coefficients of the second ODE as features to classify the normal and tumor cells. We demonstrate that even using the ODE model for single gene we can achieve high classification accuracy. We also conduct response analysis to investigate how the transcription process responds to the perturbation of the external signals and identify dozens of genes that are related to cancer.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1155/2015/916352DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539434PMC
June 2016