Publications by authors named "Manuel A Rivas"

82 Publications

Corrigendum to: Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank.

Biostatistics 2021 Jul 7. Epub 2021 Jul 7.

Department of Statistics and Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/biostatistics/kxab019DOI Listing
July 2021

Nonsense-mediated decay is highly stable across individuals and tissues.

Am J Hum Genet 2021 08 2;108(8):1401-1408. Epub 2021 Jul 2.

Department of Pathology, School of Medicine, Stanford University, Stanford, CA 94305, USA; Department of Genetics, School of Medicine, Stanford University, Stanford, CA 94305, USA. Electronic address:

Precise interpretation of the effects of rare protein-truncating variants (PTVs) is important for accurate determination of variant impact. Current methods for assessing the ability of PTVs to induce nonsense-mediated decay (NMD) focus primarily on the position of the variant in the transcript. We used RNA sequencing of the Genotype Tissue Expression v.8 cohort to compute the efficiency of NMD using allelic imbalance for 2,320 rare (genome aggregation database minor allele frequency ≤ 1%) PTVs across 809 individuals in 49 tissues. We created an interpretable predictive model using penalized logistic regression in order to evaluate the comprehensive influence of variant annotation, tissue, and inter-individual variation on NMD. We found that variant position, allele frequency, the inclusion of ultra-rare and singleton variants, and conservation were predictive of allelic imbalance. Furthermore, we found that NMD effects were highly concordant across tissues and individuals. Due to this high consistency, we demonstrate in silico that utilizing peripheral tissues or cell lines provides accurate prediction of NMD for PTVs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ajhg.2021.06.008DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8387471PMC
August 2021

Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks.

Bioinformatics 2021 Jun 19. Epub 2021 Jun 19.

Department of Statistics, Stanford University, Stanford, 94305, United States.

Motivation: Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data.

Results: We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve Lasso and group Lasso, linear, logistic and Cox regression problems on sparse genetic matrices that contain 1,000,000 variants and almost 100,000 individuals within 10 minutes and using less than 32GB of memory.

Availability: https://github.com/rivas-lab/snpnet/tree/compact.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btab452DOI Listing
June 2021

Time trajectories in the transcriptomic response to exercise - a meta-analysis.

Nat Commun 2021 06 9;12(1):3471. Epub 2021 Jun 9.

Center for Inherited Cardiovascular Disease, School of Medicine, Stanford University, Stanford, CA, USA.

Exercise training prevents multiple diseases, yet the molecular mechanisms that drive exercise adaptation are incompletely understood. To address this, we create a computational framework comprising data from skeletal muscle or blood from 43 studies, including 739 individuals before and after exercise or training. Using linear mixed effects meta-regression, we detect specific time patterns and regulatory modulators of the exercise response. Acute and long-term responses are transcriptionally distinct and we identify SMAD3 as a central regulator of the exercise response. Exercise induces a more pronounced inflammatory response in skeletal muscle of older individuals and our models reveal multiple sex-associated responses. We validate seven of our top genes in a separate human cohort. In this work, we provide a powerful resource ( www.extrameta.org ) that expands the transcriptional landscape of exercise adaptation by extending previously known responses and their regulatory networks, and identifying novel modality-, time-, age-, and sex-associated changes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-021-23579-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8190306PMC
June 2021

Combining Clinical and Polygenic Risk Improves Stroke Prediction Among Individuals With Atrial Fibrillation.

Circ Genom Precis Med 2021 Jun 15;14(3):e003168. Epub 2021 Jun 15.

Division of Cardiology, Department of Medicine (J.W.O., M.T., M.P., H.W., C.T., S.L.C., E.A.A.), Stanford University School of Medicine, Stanford, CA.

Background: Atrial fibrillation (AF) is associated with a five-fold increased risk of ischemic stroke. A portion of this risk is heritable; however, current risk stratification tools (CHADS-VASc) do not include family history or genetic risk. We hypothesized that we could improve ischemic stroke prediction in patients with AF by incorporating polygenic risk scores (PRS).

Methods: Using data from the largest available genome-wide association study in Europeans, we combined over half a million genetic variants to construct a PRS to predict ischemic stroke in patients with AF. We externally validated this PRS in independent data from the UK Biobank, both independently and integrated with clinical risk factors. The integrated PRS and clinical risk factors risk tool had the greatest predictive ability.

Results: Compared with the currently recommended risk tool (CHADS-VASc), the integrated tool significantly improved Net Reclassification Index (2.3% [95% CI, 1.3%-3.0%]) and fit (χ =0.002). Using this improved tool, >115 000 people with AF would have improved risk classification in the United States. Independently, PRS was a significant predictor of ischemic stroke in patients with AF prospectively (hazard ratio, 1.13 per 1 SD [95% CI, 1.06-1.23]). Lastly, polygenic risk scores were uncorrelated with clinical risk factors (Pearson correlation coefficient, -0.018).

Conclusions: In patients with AF, there appears to be a significant association between PRS and risk of ischemic stroke. The greatest predictive ability was found with the integration of PRS and clinical risk factors; however, the prediction of stroke remains challenging.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1161/CIRCGEN.120.003168DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8212575PMC
June 2021

Exome sequencing in patient-parent trios suggests new candidate genes for early-onset primary sclerosing cholangitis.

Liver Int 2021 05 11;41(5):1044-1057. Epub 2021 Mar 11.

Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands.

Background & Aims: Primary sclerosing cholangitis (PSC) is a rare bile duct disease strongly associated with inflammatory bowel disease (IBD). Whole-exome sequencing (WES) has contributed to understanding the molecular basis of very early-onset IBD, but rare protein-altering genetic variants have not been identified for early-onset PSC. We performed WES in patients diagnosed with PSC ≤ 12 years to investigate the contribution of rare genetic variants to early-onset PSC.

Methods: In this multicentre study, WES was performed on 87 DNA samples from 29 patient-parent trios with early-onset PSC. We selected rare (minor allele frequency < 2%) coding and splice-site variants that matched recessive (homozygous and compound heterozygous variants) and dominant (de novo) inheritance in the index patients. Variant pathogenicity was predicted by an in-house developed algorithm (GAVIN), and PSC-relevant variants were selected using gene expression data and gene function.

Results: In 22 of 29 trios we identified at least 1 possibly pathogenic variant. We prioritized 36 genes, harbouring a total of 54 variants with predicted pathogenic effects. In 18 genes, we identified 36 compound heterozygous variants, whereas in the other 18 genes we identified 18 de novo variants. Twelve of 36 candidate risk genes are known to play a role in transmembrane transport, adaptive and innate immunity, and epithelial barrier function.

Conclusions: The 36 candidate genes for early-onset PSC need further verification in other patient cohorts and evaluation of gene function before a causal role can be attributed to its variants.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1111/liv.14831DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8252477PMC
May 2021

Survival Analysis on Rare Events Using Group-Regularized Multi-Response Cox Regression.

Bioinformatics 2021 Feb 9. Epub 2021 Feb 9.

Department of Biomedical Data Science, Stanford University, Stanford, United States.

Motivation: The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data.

Results: We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank (Sudlow et al., 2015) dataset where records for a large number of common and less prevalent diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al. (2020).

Availability: https://github.com/rivas-lab/multisnpnet-Cox.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btab095DOI Listing
February 2021

Polygenic risk modeling with latent trait-related genetic components.

Eur J Hum Genet 2021 Jul 8;29(7):1071-1081. Epub 2021 Feb 8.

Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, USA.

Polygenic risk models have led to significant advances in understanding complex diseases and their clinical presentation. While polygenic risk scores (PRS) can effectively predict outcomes, they do not generally account for disease subtypes or pathways which underlie within-trait diversity. Here, we introduce a latent factor model of genetic risk based on components from Decomposition of Genetic Associations (DeGAs), which we call the DeGAs polygenic risk score (dPRS). We compute DeGAs using genetic associations for 977 traits and find that dPRS performs comparably to standard PRS while offering greater interpretability. We show how to decompose an individual's genetic risk for a trait across DeGAs components, with examples for body mass index (BMI) and myocardial infarction (heart attack) in 337,151 white British individuals in the UK Biobank, with replication in a further set of 25,486 non-British white individuals. We find that BMI polygenic risk factorizes into components related to fat-free mass, fat mass, and overall health indicators like physical activity. Most individuals with high dPRS for BMI have strong contributions from both a fat-mass component and a fat-free mass component, whereas a few "outlier" individuals have strong contributions from only one of the two components. Overall, our method enables fine-scale interpretation of the drivers of genetic risk for complex traits.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41431-021-00813-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8298449PMC
July 2021

Genetics of 35 blood and urine biomarkers in the UK Biobank.

Nat Genet 2021 02 18;53(2):185-194. Epub 2021 Jan 18.

Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, USA.

Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n = 363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations and additional sets of large-effect (>0.1 s.d.) protein-altering, human leukocyte antigen (HLA) and copy number variant (CNV) associations. Through Mendelian randomization (MR) analysis, we discover 51 causal relationships, including previously known agonistic effects of urate on gout and cystatin C on stroke. Finally, we develop polygenic risk scores (PRSs) for each biomarker and build 'multi-PRS' models for diseases using 35 PRSs simultaneously, which improved chronic kidney disease, type 2 diabetes, gout and alcoholic cirrhosis genetic risk stratification in an independent dataset (FinnGen; n = 135,500) relative to single-disease PRSs. Together, our results delineate the genetic basis of biomarkers and their causal influences on diseases and improve genetic risk stratification for common diseases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41588-020-00757-zDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7867639PMC
February 2021

Graphical analysis for phenome-wide causal discovery in genotyped population-scale biobanks.

Nat Commun 2021 01 13;12(1):350. Epub 2021 Jan 13.

Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.

Causal inference via Mendelian randomization requires making strong assumptions about horizontal pleiotropy, where genetic instruments are connected to the outcome not only through the exposure. Here, we present causal Graphical Analysis Using Genetics (cGAUGE), a pipeline that overcomes these limitations using instrument filters with provable properties. This is achievable by identifying conditional independencies while examining multiple traits. cGAUGE also uses ExSep (Exposure-based Separation), a novel test for the existence of causal pathways that does not require selecting instruments. In simulated data we illustrate how cGAUGE can reduce the empirical false discovery rate by up to 30%, while retaining the majority of true discoveries. On 96 complex traits from 337,198 subjects from the UK Biobank, our results cover expected causal links and many new ones that were previously suggested by correlation-based observational studies. Notably, we identify multiple risk factors for cardiovascular disease, including red blood cell distribution width.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-020-20516-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7806647PMC
January 2021

A phenome-wide association study of 26 mendelian genes reveals phenotypic expressivity of common and rare variants within the general population.

PLoS Genet 2020 11 23;16(11):e1008802. Epub 2020 Nov 23.

Stanford Cardiovascular Institute, Stanford University, Stanford, Stanford, California, United States of America.

The clinical evaluation of a genetic syndrome relies upon recognition of a characteristic pattern of signs or symptoms to guide targeted genetic testing for confirmation of the diagnosis. However, individuals displaying a single phenotype of a complex syndrome may not meet criteria for clinical diagnosis or genetic testing. Here, we present a phenome-wide association study (PheWAS) approach to systematically explore the phenotypic expressivity of common and rare alleles in genes associated with four well-described syndromic diseases (Alagille (AS), Marfan (MS), DiGeorge (DS), and Noonan (NS) syndromes) in the general population. Using human phenotype ontology (HPO) terms, we systematically mapped 60 phenotypes related to AS, MS, DS and NS in 337,198 unrelated white British from the UK Biobank (UKBB) based on their hospital admission records, self-administrated questionnaires, and physiological measurements. We performed logistic regression adjusting for age, sex, and the first 5 genetic principal components, for each phenotype and each variant in the target genes (JAG1, NOTCH2 FBN1, PTPN1 and RAS-opathy genes, and genes in the 22q11.2 locus) and performed a gene burden test. Overall, we observed multiple phenotype-genotype correlations, such as the association between variation in JAG1, FBN1, PTPN11 and SOS2 with diastolic and systolic blood pressure; and pleiotropy among multiple variants in syndromic genes. For example, rs11066309 in PTPN11 was significantly associated with a lower body mass index, an increased risk of hypothyroidism and a smaller size for gestational age, all in concordance with NS-related phenotypes. Similarly, rs589668 in FBN1 was associated with an increase in body height and blood pressure, and a reduced body fat percentage as observed in Marfan syndrome. Our findings suggest that the spectrum of associations of common and rare variants in genes involved in syndromic diseases can be extended to individual phenotypes within the general population.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1008802DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7735621PMC
November 2020

Cardiac Imaging of Aortic Valve Area From 34 287 UK Biobank Participants Reveals Novel Genetic Associations and Shared Genetic Comorbidity With Multiple Disease Phenotypes.

Circ Genom Precis Med 2020 12 30;13(6):e003014. Epub 2020 Oct 30.

Department of Pediatrics, Division of Pediatric Cardiology, Stanford University School of Medicine, Stanford, CA (A.C.-P., C.T., K.X., H.T., J.R.P.).

Background: The aortic valve is an important determinant of cardiovascular physiology and anatomic location of common human diseases.

Methods: From a sample of 34 287 white British ancestry participants, we estimated functional aortic valve area by planimetry from prospectively obtained cardiac magnetic resonance imaging sequences of the aortic valve. Aortic valve area measurements were submitted to genome-wide association testing, followed by polygenic risk scoring and phenome-wide screening, to identify genetic comorbidities.

Results: A genome-wide association study of aortic valve area in these UK Biobank participants showed 3 significant associations, indexed by rs71190365 (chr13:50764607, , =1.8×10), rs35991305 (chr12:94191968, , =3.4×10), and chr17:45013271:C:T (, =5.6×10). Replication on an independent set of 8145 unrelated European ancestry participants showed consistent effect sizes in all 3 loci, although rs35991305 did not meet nominal significance. We constructed a polygenic risk score for aortic valve area, which in a separate cohort of 311 728 individuals without imaging demonstrated that smaller aortic valve area is predictive of increased risk for aortic valve disease (odds ratio, 1.14; =2.3×10). After excluding subjects with a medical diagnosis of aortic valve stenosis (remaining n=308 683 individuals), phenome-wide association of >10 000 traits showed multiple links between the polygenic score for aortic valve disease and key health-related comorbidities involving the cardiovascular system and autoimmune disease. Genetic correlation analysis supports a shared genetic etiology with between aortic valve area and birth weight along with other cardiovascular conditions.

Conclusions: These results illustrate the use of automated phenotyping of cardiac imaging data from the general population to investigate the genetic etiology of aortic valve disease, perform clinical prediction, and uncover new clinical and genetic correlates of cardiac anatomy.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1161/CIRCGEN.120.003014DOI Listing
December 2020

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.

PLoS Genet 2020 10 23;16(10):e1009141. Epub 2020 Oct 23.

Department of Statistics, Stanford University, Stanford, CA, United States of America.

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1009141DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7641476PMC
October 2020

Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank.

Biostatistics 2020 Sep 29. Epub 2020 Sep 29.

Department of Statistics and Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA.

We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/biostatistics/kxaa038DOI Listing
September 2020

Sex-specific genetic effects across biomarkers.

Eur J Hum Genet 2021 01 1;29(1):154-163. Epub 2020 Sep 1.

Biomedical Informatics Training Program, Stanford University, Stanford, CA, USA.

Sex differences have been shown in laboratory biomarkers; however, the extent to which this is due to genetics is unknown. In this study, we infer sex-specific genetic parameters (heritability and genetic correlation) across 33 quantitative biomarker traits in 181,064 females and 156,135 males from the UK Biobank study. We apply a Bayesian Mixture Model, Sex Effects Mixture Model (SEMM), to Genome-wide Association Study summary statistics in order to (1) estimate the contributions of sex to the genetic variance of these biomarkers and (2) identify variants whose statistical association with these traits is sex-specific. We find that the genetics of most biomarker traits are shared between males and females, with the notable exception of testosterone, where we identify 119 female and 445 male-specific variants. These include protein-altering variants in steroid hormone production genes (POR, UGT2B7). Using the sex-specific variants as genetic instruments for Mendelian randomization, we find evidence for causal links between testosterone levels and height, body mass index, waist and hip circumference, and type 2 diabetes. We also show that sex-specific polygenic risk score models for testosterone outperform a combined model. Overall, these results demonstrate that while sex has a limited role in the genetics of most biomarker traits, sex plays an important role in testosterone genetics.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41431-020-00712-wDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7794464PMC
January 2021

High-throughput SARS-CoV-2 and host genome sequencing from single nasopharyngeal swabs.

medRxiv 2020 Sep 1. Epub 2020 Sep 1.

During COVID19 and other viral pandemics, rapid generation of host and pathogen genomic data is critical to tracking infection and informing therapies. There is an urgent need for efficient approaches to this data generation at scale. We have developed a scalable, high throughput approach to generate high fidelity low pass whole genome and HLA sequencing, viral genomes, and representation of human transcriptome from single nasopharyngeal swabs of COVID19 patients.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/2020.07.27.20163147DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7402057PMC
September 2020

Whole exome sequencing analyses reveal gene-microbiota interactions in the context of IBD.

Gut 2021 02 10;70(2):285-296. Epub 2020 Jul 10.

Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Groningen, The Netherlands

Objective: Both the gut microbiome and host genetics are known to play significant roles in the pathogenesis of IBD. However, the interaction between these two factors and its implications in the aetiology of IBD remain underexplored. Here, we report on the influence of host genetics on the gut microbiome in IBD.

Design: To evaluate the impact of host genetics on the gut microbiota of patients with IBD, we combined whole exome sequencing of the host genome and whole genome shotgun sequencing of 1464 faecal samples from 525 patients with IBD and 939 population-based controls. We followed a four-step analysis: (1) exome-wide microbial quantitative trait loci (mbQTL) analyses, (2) a targeted approach focusing on IBD-associated genomic regions and protein truncating variants (PTVs, minor allele frequency (MAF) >5%), (3) gene-based burden tests on PTVs with MAF <5% and exome copy number variations (CNVs) with site frequency <1%, (4) joint analysis of both cohorts to identify the interactions between disease and host genetics.

Results: We identified 12 mbQTLs, including variants in the IBD-associated genes , , and . For example, the decrease of the pathway acetyl-coenzyme A biosynthesis, which is involved in short chain fatty acids production, was associated with variants in the gene (false discovery rate <0.05). Changes in functional pathways involved in the metabolic potential were also observed in participants carrying rare PTVs or CNVs in , and genes. These genes are known for their function in the immune system. Moreover, interaction analyses confirmed previously known IBD disease-specific mbQTLs in .

Conclusion: This study highlights that both common and rare genetic variants affecting the immune system are key factors in shaping the gut microbiota in the context of IBD and pinpoints towards potential mechanisms for disease treatment.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1136/gutjnl-2019-319706DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7815889PMC
February 2021

Race, socioeconomic deprivation, and hospitalization for COVID-19 in English participants of a national biobank.

Int J Equity Health 2020 07 6;19(1):114. Epub 2020 Jul 6.

Center for Genomic Medicine and Division of Cardiology, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA.

Preliminary reports suggest that the Coronavirus Disease 2019 (COVID- 19) pandemic has led to disproportionate morbidity and mortality among historically disadvantaged populations. We investigate the racial and socioeconomic associations of COVID- 19 hospitalization among 418,794 participants of the UK Biobank, of whom 549 (0.13%) had been hospitalized. Both Black participants (odds ratio 3.7; 95%CI 2.5-5.3) and Asian participants (odds ratio 2.2; 95%CI 1.5-3.2) were at substantially increased risk as compared to White participants. We further observed a striking gradient in COVID- 19 hospitalization rates according to the Townsend Deprivation Index - a composite measure of socioeconomic deprivation - and household income. Adjusting for socioeconomic factors and cardiorespiratory comorbidities led to only modest attenuation of the increased risk in Black participants, adjusted odds ratio 2.4 (95%CI 1.5-3.7). These observations confirm and extend earlier preliminary and lay press reports of higher morbidity in non-White individuals in the context of a large population of participants in a national biobank. The extent to which this increased risk relates to variation in pre-existing comorbidities, differences in testing or hospitalization patterns, or additional disparities in social determinants of health warrants further study.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12939-020-01227-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7336098PMC
July 2020

FasTag: Automatic text classification of unstructured medical narratives.

PLoS One 2020 22;15(6):e0234647. Epub 2020 Jun 22.

Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America.

Unstructured clinical narratives are continuously being recorded as part of delivery of care in electronic health records, and dedicated tagging staff spend considerable effort manually assigning clinical codes for billing purposes. Despite these efforts, however, label availability and accuracy are both suboptimal. In this retrospective study, we aimed to automate the assignment of top-level International Classification of Diseases version 9 (ICD-9) codes to clinical records from human and veterinary data stores using minimal manual labor and feature curation. Automating top-level annotations could in turn enable rapid cohort identification, especially in a veterinary setting. To this end, we trained long short-term memory (LSTM) recurrent neural networks (RNNs) on 52,722 human and 89,591 veterinary records. We investigated the accuracy of both separate-domain and combined-domain models and probed model portability. We established relevant baseline classification performances by training Decision Trees (DT) and Random Forests (RF). We also investigated whether transforming the data using MetaMap Lite, a clinical natural language processing tool, affected classification performance. We showed that the LSTM-RNNs accurately classify veterinary and human text narratives into top-level categories with an average weighted macro F1 score of 0.74 and 0.68 respectively. In the "neoplasia" category, the model trained on veterinary data had a high validation accuracy in veterinary data and moderate accuracy in human data, with F1 scores of 0.91 and 0.70 respectively. Our LSTM method scored slightly higher than that of the DT and RF models. The use of LSTM-RNN models represents a scalable structure that could prove useful in cohort identification for comparative oncology studies. Digitization of human and veterinary health information will continue to be a reality, particularly in the form of unstructured narratives. Our approach is a step forward for these two domains to learn from and inform one another.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0234647PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7307763PMC
August 2020

Race, Socioeconomic Deprivation, and Hospitalization for COVID-19 in English participants of a National Biobank.

medRxiv 2020 May 2. Epub 2020 May 2.

Preliminary reports suggest that the Coronavirus Disease 2019 (COVID-19) pandemic has led to disproportionate morbidity and mortality among historically disadvantaged populations. The extent to which these disparities are related to socioeconomic versus biologic factors is largely unknown. We investigate the racial and socioeconomic associations of COVID-19 hospitalization among 418,794 participants of the UK Biobank, of whom 549 (0.13%) had been hospitalized. Both black participants (odds ratio 3.4; 95%CI 2.4-4.9) and Asian participants (odds ratio 2.1; 95%CI 1.5-3.2) were at substantially increased risk as compared to white participants. We further observed a striking gradient in COVID-19 hospitalization rates according to the Townsend Deprivation Index - a composite measure of socioeconomic deprivation - and household income. Adjusting for such factors led to only modest attenuation of the increased risk in black participants, adjusted odds ratio 3.1 (95%CI 2.0-4.8). These observations confirm and extend earlier preliminary and lay press reports of higher morbidity in non-white individuals in the context of a large population of participants in a national biobank. The extent to which this increased risk relates to variation in pre-existing comorbidities, differences in testing or hospitalization patterns, or additional disparities in social determinants of health warrants further study.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/2020.04.27.20082107DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7276998PMC
May 2020

Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma.

PLoS Genet 2020 05 5;16(5):e1008682. Epub 2020 May 5.

Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, California, United States of America.

Protein-altering variants that are protective against human disease provide in vivo validation of therapeutic targets. Here we use genotyping data from UK Biobank (n = 337,151 unrelated White British individuals) and FinnGen (n = 176,899) to conduct a search for protein-altering variants conferring lower intraocular pressure (IOP) and protection against glaucoma. Through rare protein-altering variant association analysis, we find a missense variant in ANGPTL7 in UK Biobank (rs28991009, p.Gln175His, MAF = 0.8%, genotyped in 82,253 individuals with measured IOP and an independent set of 4,238 glaucoma patients and 250,660 controls) that significantly lowers IOP (β = -0.53 and -0.67 mmHg for heterozygotes, -3.40 and -2.37 mmHg for homozygotes, P = 5.96 x 10-9 and 1.07 x 10-13 for corneal compensated and Goldman-correlated IOP, respectively) and is associated with 34% reduced risk of glaucoma (P = 0.0062). In FinnGen, we identify an ANGPTL7 missense variant at a greater than 50-fold increased frequency in Finland compared with other populations (rs147660927, p.Arg220Cys, MAF Finland = 4.3%), which was genotyped in 6,537 glaucoma patients and 170,362 controls and is associated with a 29% lower glaucoma risk (P = 1.9 x 10-12 for all glaucoma types and also protection against its subtypes including exfoliation, primary open-angle, and primary angle-closure). We further find three rarer variants in UK Biobank, including a protein-truncating variant, which confer a strong composite lowering of IOP (P = 0.0012 and 0.24 for Goldman-correlated and corneal compensated IOP, respectively), suggesting the protective mechanism likely resides in the loss of interaction or function. Our results support inhibition or down-regulation of ANGPTL7 as a therapeutic strategy for glaucoma.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1008682DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7199928PMC
May 2020

Assessing Digital Phenotyping to Enhance Genetic Studies of Human Diseases.

Am J Hum Genet 2020 05 9;106(5):611-622. Epub 2020 Apr 9.

Department of Biomedical Data Science, Stanford University, Stanford, CA, USA. Electronic address:

Population-scale biobanks that combine genetic data and high-dimensional phenotyping for a large number of participants provide an exciting opportunity to perform genome-wide association studies (GWAS) to identify genetic variants associated with diverse quantitative traits and diseases. A major challenge for GWAS in population biobanks is ascertaining disease cases from heterogeneous data sources such as hospital records, digital questionnaire responses, or interviews. In this study, we use genetic parameters, including genetic correlation, to evaluate whether GWAS performed using cases in the UK Biobank ascertained from hospital records, questionnaire responses, and family history of disease implicate similar disease genetics across a range of effect sizes. We find that hospital record and questionnaire GWAS largely identify similar genetic effects for many complex phenotypes and that combining together both phenotyping methods improves power to detect genetic associations. We also show that family history GWAS using cases ascertained on family history of disease agrees with combined hospital record and questionnaire GWAS and that family history GWAS has better power to detect genetic associations for some phenotypes. Overall, this work demonstrates that digital phenotyping and unstructured phenotype data can be combined with structured data such as hospital records to identify cases for GWAS in biobanks and improve the ability of such studies to identify genetic associations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ajhg.2020.03.007DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7212271PMC
May 2020

Homogeneity in the association of body mass index with type 2 diabetes across the UK Biobank: A Mendelian randomization study.

PLoS Med 2019 12 10;16(12):e1002982. Epub 2019 Dec 10.

Department of Biomedical Data Science, Stanford University, Stanford, California, United States of America.

Background: Lifestyle interventions to reduce body mass index (BMI) are critical public health strategies for type 2 diabetes prevention. While weight loss interventions have shown demonstrable benefit for high-risk and prediabetic individuals, we aimed to determine whether the same benefits apply to those at lower risk.

Methods And Findings: We performed a multi-stratum Mendelian randomization study of the effect size of BMI on diabetes odds in 287,394 unrelated individuals of self-reported white British ancestry in the UK Biobank, who were recruited from across the United Kingdom from 2006 to 2010 when they were between the ages of 40 and 69 years. Individuals were stratified on the following diabetes risk factors: BMI, diabetes family history, and genome-wide diabetes polygenic risk score. The main outcome measure was the odds ratio of diabetes per 1-kg/m2 BMI reduction, in the full cohort and in each stratum. Diabetes prevalence increased sharply with BMI, family history of diabetes, and genetic risk. Conversely, predicted risk reduction from weight loss was strikingly similar across BMI and genetic risk categories. Weight loss was predicted to substantially reduce diabetes odds even among lower-risk individuals: for instance, a 1-kg/m2 BMI reduction was associated with a 1.37-fold reduction (95% CI 1.12-1.68) in diabetes odds among non-overweight individuals (BMI < 25 kg/m2) without a family history of diabetes, similar to that in obese individuals (BMI ≥ 30 kg/m2) with a family history (1.21-fold reduction, 95% CI 1.13-1.29). A key limitation of this analysis is that the BMI-altering DNA sequence polymorphisms it studies represent cumulative predisposition over an individual's entire lifetime, and may consequently incorrectly estimate the risk modification potential of weight loss interventions later in life.

Conclusions: In a population-scale cohort, lower BMI was consistently associated with reduced diabetes risk across BMI, family history, and genetic risk categories, suggesting all individuals can substantially reduce their diabetes risk through weight loss. Our results support the broad deployment of weight loss interventions to individuals at all levels of diabetes risk.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pmed.1002982DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6903707PMC
December 2019

Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology.

Nat Commun 2019 09 6;10(1):4064. Epub 2019 Sep 6.

Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, USA.

Population-based biobanks with genomic and dense phenotype data provide opportunities for generating effective therapeutic hypotheses and understanding the genomic role in disease predisposition. To characterize latent components of genetic associations, we apply truncated singular value decomposition (DeGAs) to matrices of summary statistics derived from genome-wide association analyses across 2,138 phenotypes measured in 337,199 White British individuals in the UK Biobank study. We systematically identify key components of genetic associations and the contributions of variants, genes, and phenotypes to each component. As an illustration of the utility of the approach to inform downstream experiments, we report putative loss of function variants, rs114285050 (GPR151) and rs150090666 (PDE3B), that substantially contribute to obesity-related traits and experimentally demonstrate the role of these genes in adipocyte biology. Our approach to dissect components of genetic associations across the human phenome will accelerate biomedical hypothesis generation by providing insights on previously unexplored latent structures.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-019-11953-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6731283PMC
September 2019

Rare and common variant discovery in complex disease: the IBD case study.

Hum Mol Genet 2019 11;28(R2):R162-R169

Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, USA.

Complex diseases such as inflammatory bowel disease (IBD), which consists of ulcerative colitis and Crohn's disease, are a significant medical burden-70 000 new cases of IBD are diagnosed in the United States annually. In this review, we examine the history of genetic variant discovery in complex disease with a focus on IBD. We cover methods that have been applied to microsatellite, common variant, targeted resequencing and whole-exome and -genome data, specifically focusing on the progression of technologies towards rare-variant discovery. The inception of these methods combined with better availability of population level variation data has led to rapid discovery of IBD-causative and/or -associated variants at over 200 loci; over time, these methods have grown exponentially in both power and ascertainment to detect rare variation. We highlight rare-variant discoveries critical to the elucidation of the pathogenesis of IBD, including those in NOD2, IL23R, CARD9, RNF186 and ADCY7. We additionally identify the major areas of rare-variant discovery that will evolve in the coming years. A better understanding of the genetic basis of IBD and other complex diseases will lead to improved diagnosis, prognosis, treatment and surveillance.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/hmg/ddz189DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6872431PMC
November 2019

Phenome-wide Burden of Copy-Number Variation in the UK Biobank.

Am J Hum Genet 2019 08 25;105(2):373-383. Epub 2019 Jul 25.

Department of Pediatrics, School of Medicine, Stanford University, Stanford, CA 94305, USA; Stanford Cardiovascular Institute, Stanford University, Stanford, CA 94035, USA. Electronic address:

Copy-number variations (CNVs) represent a significant proportion of the genetic differences between individuals and many CNVs associate causally with syndromic disease and clinical outcomes. Here, we characterize the landscape of copy-number variation and their phenome-wide effects in a sample of 472,228 array-genotyped individuals from the UK Biobank. In addition to population-level selection effects against genic loci conferring high mortality, we describe genetic burden from potentially pathogenic and previously uncharacterized CNV loci across more than 3,000 quantitative and dichotomous traits, with separate analyses for common and rare classes of variation. Specifically, we highlight the effects of CNVs at two well-known syndromic loci 16p11.2 and 22q11.2, previously uncharacterized variation at 9p23, and several genic associations in the context of acute coronary artery disease and high body mass index. Our data constitute a deeply contextualized portrait of population-wide burden of copy-number variation, as well as a series of dosage-mediated genic associations across the medical phenome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ajhg.2019.07.001DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6699064PMC
August 2019

DeepTag: inferring diagnoses from veterinary clinical notes.

NPJ Digit Med 2018 24;1:60. Epub 2018 Oct 24.

1Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA.

Large scale veterinary clinical records can become a powerful resource for patient care and research. However, clinicians lack the time and resource to annotate patient records with standard medical diagnostic codes and most veterinary visits are captured in free-text notes. The lack of standard coding makes it challenging to use the clinical data to improve patient care. It is also a major impediment to cross-species translational research, which relies on the ability to accurately identify patient cohorts with specific diagnostic criteria in humans and animals. In order to reduce the coding burden for veterinary clinical practice and aid translational research, we have developed a deep learning algorithm, DeepTag, which automatically infers diagnostic codes from veterinary free-text notes. DeepTag is trained on a newly curated dataset of 112,558 veterinary notes manually annotated by experts. DeepTag extends multitask LSTM with an improved hierarchical objective that captures the semantic structures between diseases. To foster human-machine collaboration, DeepTag also learns to abstain in examples when it is uncertain and defers them to human experts, resulting in improved performance. DeepTag accurately infers disease codes from free-text even in challenging cross-hospital settings where the text comes from different clinical settings than the ones used for training. It enables automated disease annotation across a broad range of clinical diagnoses with minimal preprocessing. The technical framework in this work can be applied in other medical domains that currently lack medical coding resources.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41746-018-0067-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6550285PMC
October 2018

Opportunities and challenges for transcriptome-wide association studies.

Nat Genet 2019 04 29;51(4):592-599. Epub 2019 Mar 29.

Department of Computer Science, Stanford University, Stanford, CA, USA.

Transcriptome-wide association studies (TWAS) integrate genome-wide association studies (GWAS) and gene expression datasets to identify gene-trait associations. In this Perspective, we explore properties of TWAS as a potential approach to prioritize causal genes at GWAS loci, by using simulations and case studies of literature-curated candidate causal genes for schizophrenia, low-density-lipoprotein cholesterol and Crohn's disease. We explore risk loci where TWAS accurately prioritizes the likely causal gene as well as loci where TWAS prioritizes multiple genes, some likely to be non-causal, owing to sharing of expression quantitative trait loci (eQTL). TWAS is especially prone to spurious prioritization with expression data from non-trait-related tissues or cell types, owing to substantial cross-cell-type variation in expression levels and eQTL strengths. Nonetheless, TWAS prioritizes candidate causal genes more accurately than simple baselines. We suggest best practices for causal-gene prioritization with TWAS and discuss future opportunities for improvement. Our results showcase the strengths and limitations of using eQTL datasets to determine causal genes at GWAS loci.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41588-019-0385-zDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6777347PMC
April 2019
-->