Publications by authors named "Melissa Gymrek"

41 Publications

Analysis of Brugada syndrome loci reveals that fine-mapping clustered GWAS hits enhances the annotation of disease-relevant variants.

Cell Rep Med 2021 Apr 20;2(4):100250. Epub 2021 Apr 20.

Department of Medical Sciences, School of Medicine, Universitat de Girona, Girona, Spain.

Genome-wide association studies (GWASs) are instrumental in identifying loci harboring common single-nucleotide variants (SNVs) that affect human traits and diseases. GWAS hits emerge in clusters, but the focus is often on the most significant hit in each trait- or disease-associated locus. The remaining hits represent SNVs in linkage disequilibrium (LD) and are considered redundant and thus frequently marginally reported or exploited. Here, we interrogate the value of integrating the full set of GWAS hits in a locus repeatedly associated with cardiac conduction traits and arrhythmia, -. Our analysis reveals 5 common 7-SNV haplotypes (Hap1-5) with 2 combinations associated with life-threatening arrhythmia-Brugada syndrome (the risk Hap and protective Hap genotypes). Hap1 and Hap2 share 3 SNVs; thus, this analysis suggests that assuming redundancy among clustered GWAS hits can lead to confounding disease-risk associations and supports the need to deconstruct GWAS data in the context of haplotype composition.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.xcrm.2021.100250DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8080235PMC
April 2021

A flexible ChIP-sequencing simulation toolkit.

BMC Bioinformatics 2021 Apr 20;22(1):201. Epub 2021 Apr 20.

Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.

Background: A major challenge in evaluating quantitative ChIP-seq analyses, such as peak calling and differential binding, is a lack of reliable ground truth data. Accurate simulation of ChIP-seq data can mitigate this challenge, but existing frameworks are either too cumbersome to apply genome-wide or unable to model a number of important experimental conditions in ChIP-seq.

Results: We present ChIPs, a toolkit for rapidly simulating ChIP-seq data using statistical models of key experimental steps. We demonstrate how ChIPs can be used for a range of applications, including benchmarking analysis tools and evaluating the impact of various experimental parameters. ChIPs is implemented as a standalone command-line program written in C++ and is available from https://github.com/gymreklab/chips .

Conclusions: ChIPs is an efficient ChIP-seq simulation framework that generates realistic datasets over a flexible range of experimental conditions. It can serve as an important component in various ChIP-seq analyses where ground truth data are needed.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-021-04097-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8056602PMC
April 2021

Variable number tandem repeats mediate the expression of proximal genes.

Nat Commun 2021 04 6;12(1):2075. Epub 2021 Apr 6.

Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA.

Variable number tandem repeats (VNTRs) account for significant genetic variation in many organisms. In humans, VNTRs have been implicated in both Mendelian and complex disorders, but are largely ignored by genomic pipelines due to the complexity of genotyping and the computational expense. We describe adVNTR-NN, a method that uses shallow neural networks to genotype a VNTR in 18 seconds on 55X whole genome data, while maintaining high accuracy. We use adVNTR-NN to genotype 10,264 VNTRs in 652 GTEx individuals. Associating VNTR length with gene expression in 46 tissues, we identify 163 "eVNTRs". Of the 22 eVNTRs in blood where independent data is available, 21 (95%) are replicated in terms of significance and direction of association. 49% of the eVNTR loci show a strong and likely causal impact on the expression of genes and 80% have maximum effect size at least 0.3. The impacted genes are involved in diseases including Alzheimer's, obesity and familial cancers, highlighting the importance of VNTRs for understanding the genetic basis of complex diseases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-021-22206-zDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8024321PMC
April 2021

Deep neural networks identify sequence context features predictive of transcription factor binding.

Nat Mach Intell 2021 Feb 18;3(2):172-180. Epub 2021 Jan 18.

Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA USA.

Transcription factors (TFs) bind DNA by recognizing specific sequence motifs, typically of length 6-12bp. A motif can occur many thousands of times in the human genome, but only a subset of those sites are actually bound. Here we present a machine learning framework leveraging existing convolutional neural network architectures and model interpretation techniques to identify and interpret sequence context features most important for predicting whether a particular motif instance will be bound. We apply our framework to predict binding at motifs for 38 TFs in a lymphoblastoid cell line, score the importance of context sequences at base-pair resolution, and characterize context features most predictive of binding. We find that the choice of training data heavily influences classification accuracy and the relative importance of features such as open chromatin. Overall, our framework enables novel insights into features predictive of TF binding and is likely to inform future deep learning applications to interpret non-coding genetic variants.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s42256-020-00282-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8009085PMC
February 2021

Patterns of de novo tandem repeat mutations and their role in autism.

Nature 2021 01 13;589(7841):246-250. Epub 2021 Jan 13.

Department of Medicine, University of California San Diego, La Jolla, CA, USA.

Autism spectrum disorder (ASD) is an early-onset developmental disorder characterized by deficits in communication and social interaction and restrictive or repetitive behaviours. Family studies demonstrate that ASD has a substantial genetic basis with contributions both from inherited and de novo variants. It has been estimated that de novo mutations may contribute to 30% of all simplex cases, in which only a single child is affected per family. Tandem repeats (TRs), defined here as sequences of 1 to 20 base pairs in size repeated consecutively, comprise one of the major sources of de novo mutations in humans. TR expansions are implicated in dozens of neurological and psychiatric disorders. Yet, de novo TR mutations have not been characterized on a genome-wide scale, and their contribution to ASD remains unexplored. Here we develop new bioinformatics methods for identifying and prioritizing de novo TR mutations from sequencing data and perform a genome-wide characterization of de novo TR mutations in ASD-affected probands and unaffected siblings. We infer specific mutation events and their precise changes in repeat number, and primarily focus on more prevalent stepwise copy number changes rather than large expansions. Our results demonstrate a significant genome-wide excess of TR mutations in ASD probands. Mutations in probands tend to be larger, enriched in fetal brain regulatory regions, and are predicted to be more evolutionarily deleterious. Overall, our results highlight the importance of considering repeat variants in future studies of de novo mutations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-03078-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7810352PMC
January 2021

TRTools: a toolkit for genome-wide analysis of tandem repeats.

Bioinformatics 2021 May;37(5):731-733

Department of Medicine, University of California San Diego, La Jolla, 92093, USA.

Summary: A rich set of tools have recently been developed for performing genome-wide genotyping of tandem repeats (TRs). However, standardized tools for downstream analysis of these results are lacking. To facilitate TR analysis applications, we present TRTools, a Python library and suite of command line tools for filtering, merging and quality control of TR genotype files. TRTools utilizes an internal harmonization module, making it compatible with outputs from a wide range of TR genotypers.

Availability And Implementation: TRTools is freely available at https://github.com/gymreklab/TRTools. Detailed documentation is available at https://trtools.readthedocs.io.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa736DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8097685PMC
May 2021

Autism risk in offspring can be assessed through quantification of male sperm mosaicism.

Nat Med 2020 01 23;26(1):143-150. Epub 2019 Dec 23.

Department of Neurosciences, Howard Hughes Medical Institute, University of California, San Diego, La Jolla, CA, USA.

De novo mutations arising on the paternal chromosome make the largest known contribution to autism risk, and correlate with paternal age at the time of conception. The recurrence risk for autism spectrum disorders is substantial, leading many families to decline future pregnancies, but the potential impact of assessing parental gonadal mosaicism has not been considered. We measured sperm mosaicism using deep-whole-genome sequencing, for variants both present in an offspring and evident only in father's sperm, and identified single-nucleotide, structural and short tandem-repeat variants. We found that mosaicism quantification can stratify autism spectrum disorders recurrence risk due to de novo mutations into a vast majority with near 0% recurrence and a small fraction with a substantially higher and quantifiable risk, and we identify novel mosaic variants at risk for transmission to a future offspring. This suggests, therefore, that genetic counseling would benefit from the addition of sperm mosaicism assessment.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41591-019-0711-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7032648PMC
January 2020

The impact of short tandem repeat variation on gene expression.

Nat Genet 2019 11 1;51(11):1652-1659. Epub 2019 Nov 1.

Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.

Short tandem repeats (STRs) have been implicated in a variety of complex traits in humans. However, genome-wide studies of the effects of STRs on gene expression thus far have had limited power to detect associations and provide insights into putative mechanisms. Here, we leverage whole-genome sequencing and expression data for 17 tissues from the Genotype-Tissue Expression Project to identify more than 28,000 STRs for which repeat number is associated with expression of nearby genes (eSTRs). We use fine-mapping to quantify the probability that each eSTR is causal and characterize the top 1,400 fine-mapped eSTRs. We identify hundreds of eSTRs linked with published genome-wide association study signals and implicate specific eSTRs in complex traits, including height, schizophrenia, inflammatory bowel disease and intelligence. Overall, our results support the hypothesis that eSTRs contribute to a range of human phenotypes, and our data should serve as a valuable resource for future studies of complex traits.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41588-019-0521-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6917484PMC
November 2019

Cooperation of cancer drivers with regulatory germline variants shapes clinical outcomes.

Nat Commun 2019 09 11;10(1):4128. Epub 2019 Sep 11.

INSERM U830, Équipe Labellisée LNCC Genetics and Biology of Pediatric Cancers, PSL Research University, SIREDO Oncology Centre, Institut Curie Research Centre, Paris, France.

Pediatric malignancies including Ewing sarcoma (EwS) feature a paucity of somatic alterations except for pathognomonic driver-mutations that cannot explain overt variations in clinical outcome. Here, we demonstrate in EwS how cooperation of dominant oncogenes and regulatory germline variants determine tumor growth, patient survival and drug response. Binding of the oncogenic EWSR1-FLI1 fusion transcription factor to a polymorphic enhancer-like DNA element controls expression of the transcription factor MYBL2 mediating these phenotypes. Whole-genome and RNA sequencing reveals that variability at this locus is inherited via the germline and is associated with variable inter-tumoral MYBL2 expression. High MYBL2 levels sensitize EwS cells for inhibition of its upstream activating kinase CDK2 in vitro and in vivo, suggesting MYBL2 as a putative biomarker for anti-CDK2-therapy. Collectively, we establish cooperation of somatic mutations and regulatory germline variants as a major determinant of tumor progression and highlight the importance of integrating the regulatory genome in precision medicine.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-019-12071-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6739408PMC
September 2019

Profiling the genome-wide landscape of tandem repeat expansions.

Nucleic Acids Res 2019 09;47(15):e90

Department of Medicine, University of California San Diego, 9500 Gilman Drive, MC 0639, La Jolla, CA 92093, USA.

Tandem repeat (TR) expansions have been implicated in dozens of genetic diseases, including Huntington's Disease, Fragile X Syndrome, and hereditary ataxias. Furthermore, TRs have recently been implicated in a range of complex traits, including gene expression and cancer risk. While the human genome harbors hundreds of thousands of TRs, analysis of TR expansions has been mainly limited to known pathogenic loci. A major challenge is that expanded repeats are beyond the read length of most next-generation sequencing (NGS) datasets and are not profiled by existing genome-wide tools. We present GangSTR, a novel algorithm for genome-wide genotyping of both short and expanded TRs. GangSTR extracts information from paired-end reads into a unified model to estimate maximum likelihood TR lengths. We validate GangSTR on real and simulated data and show that GangSTR outperforms alternative methods in both accuracy and speed. We apply GangSTR to a deeply sequenced trio to profile the landscape of TR expansions in a healthy family and validate novel expansions using orthogonal technologies. Our analysis reveals that healthy individuals harbor dozens of long TR alleles not captured by current genome-wide methods. GangSTR will likely enable discovery of novel disease-associated variants not currently accessible from NGS.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkz501DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6735967PMC
September 2019

A reference haplotype panel for genome-wide imputation of short tandem repeats.

Nat Commun 2018 10 23;9(1):4397. Epub 2018 Oct 23.

Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.

Short tandem repeats (STRs) are involved in dozens of Mendelian disorders and have been implicated in complex traits. However, genotyping arrays used in genome-wide association studies focus on single nucleotide polymorphisms (SNPs) and do not readily allow identification of STR associations. We leverage next-generation sequencing (NGS) from 479 families to create a SNP + STR reference haplotype panel. Our panel enables imputing STR genotypes into SNP array data when NGS is not available for directly genotyping STRs. Imputed genotypes achieve mean concordance of 97% with observed genotypes in an external dataset compared to 71% expected under a naive model. Performance varies widely across STRs, with near perfect concordance at bi-allelic STRs vs. 70% at highly polymorphic repeats. Imputation increases power over individual SNPs to detect STR associations with gene expression. Imputing STRs into existing SNP datasets will enable the first large-scale STR association studies across a range of complex traits.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-018-06694-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6199332PMC
October 2018

Targeted genotyping of variable number tandem repeats with adVNTR.

Genome Res 2018 11 23;28(11):1709-1719. Epub 2018 Oct 23.

Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093, USA.

Whole-genome sequencing is increasingly used to identify Mendelian variants in clinical pipelines. These pipelines focus on single-nucleotide variants (SNVs) and also structural variants, while ignoring more complex repeat sequence variants. Here, we consider the problem of genotyping (VNTRs), composed of inexact tandem duplications of short (6-100 bp) repeating units. VNTRs span 3% of the human genome, are frequently present in coding regions, and have been implicated in multiple Mendelian disorders. Although existing tools recognize VNTR carrying sequence, genotyping VNTRs (determining repeat unit count and sequence variation) from whole-genome sequencing reads remains challenging. We describe a method, adVNTR, that uses hidden Markov models to model each VNTR, count repeat units, and detect sequence variation. adVNTR models can be developed for short-read (Illumina) and single-molecule (Pacific Biosciences [PacBio]) whole-genome and whole-exome sequencing, and show good results on multiple simulated and real data sets.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.235119.118DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6211647PMC
November 2018

Quantitative analysis of population-scale family trees with millions of relatives.

Science 2018 04 1;360(6385):171-175. Epub 2018 Mar 1.

New York Genome Center, New York, NY 10013, USA.

Family trees have vast applications in fields as diverse as genetics, anthropology, and economics. However, the collection of extended family trees is tedious and usually relies on resources with limited geographical scope and complex data usage restrictions. We collected 86 million profiles from publicly available online data shared by genealogy enthusiasts. After extensive cleaning and validation, we obtained population-scale family trees, including a single pedigree of 13 million individuals. We leveraged the data to partition the genetic architecture of human longevity and to provide insights into the geographical dispersion of families. We also report a simple digital procedure to overlay other data sets with our resource.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.aam9309DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6593158PMC
April 2018

Journal of Open Source Software (JOSS): design and first-year review.

PeerJ Prepr 2018 12;4:e147. Epub 2018 Feb 12.

eScience Institute, University of Washington, Seattle, WA, United States of America.

This article describes the motivation, design, and progress of the Journal of Open Source Software (JOSS). JOSS is a free and open-access journal that publishes articles describing research software. It has the dual goals of improving the quality of the software submitted and providing a mechanism for research software developers to receive credit. While designed to work within the current merit system of science, JOSS addresses the dearth of rewards for key contributions to science made in the form of software. JOSS publishes articles that encapsulate scholarship contained in the software itself, and its rigorous peer review targets the software components: functionality, documentation, tests, continuous integration, and the license. A JOSS article contains an abstract describing the purpose and functionality of the software, references, and a link to the software archive. The article is the entry point of a JOSS submission, which encompasses the full set of software artifacts. Submission and review proceed in the open, on GitHub. Editors, reviewers, and authors work collaboratively and openly. Unlike other journals, JOSS does not reject articles requiring major revision; while not yet accepted, articles remain visible and under review until the authors make adequate changes (or withdraw, if unable to meet requirements). Once an article is accepted, JOSS gives it a digital object identifier (DOI), deposits its metadata in Crossref, and the article can begin collecting citations on indexers like Google Scholar and other services. Authors retain copyright of their JOSS article, releasing it under a Creative Commons Attribution 4.0 International License. In its first year, starting in May 2016, JOSS published 111 articles, with more than 40 additional articles under review. JOSS is a sponsored project of the nonprofit organization NumFOCUS and is an affiliate of the Open Source Initiative (OSI).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.7717/peerj-cs.147DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7340488PMC
February 2018

Interpreting short tandem repeat variations in humans using mutational constraint.

Nat Genet 2017 Oct 11;49(10):1495-1501. Epub 2017 Sep 11.

New York Genome Center, New York, New York, USA.

Identifying regions of the genome that are depleted of mutations can distinguish potentially deleterious variants. Short tandem repeats (STRs), also known as microsatellites, are among the largest contributors of de novo mutations in humans. However, per-locus studies of STR mutations have been limited to highly ascertained panels of several dozen loci. Here we harnessed bioinformatics tools and a novel analytical framework to estimate mutation parameters for each STR in the human genome by correlating STR genotypes with local sequence heterozygosity. We applied our method to obtain robust estimates of the impact of local sequence features on mutation parameters and used these estimates to create a framework for measuring constraint at STRs by comparing observed versus expected mutation rates. Constraint scores identified known pathogenic variants with early-onset effects. Our metric will provide a valuable tool for prioritizing pathogenic STRs in medical genetics studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ng.3952DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5679271PMC
October 2017

Type 2 Diabetes Variants Disrupt Function of SLC16A11 through Two Distinct Mechanisms.

Cell 2017 Jun;170(1):199-212.e20

Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Biology, MIT, Cambridge, MA 02139, USA; Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA. Electronic address:

Type 2 diabetes (T2D) affects Latinos at twice the rate seen in populations of European descent. We recently identified a risk haplotype spanning SLC16A11 that explains ∼20% of the increased T2D prevalence in Mexico. Here, through genetic fine-mapping, we define a set of tightly linked variants likely to contain the causal allele(s). We show that variants on the T2D-associated haplotype have two distinct effects: (1) decreasing SLC16A11 expression in liver and (2) disrupting a key interaction with basigin, thereby reducing cell-surface localization. Both independent mechanisms reduce SLC16A11 function and suggest SLC16A11 is the causal gene at this locus. To gain insight into how SLC16A11 disruption impacts T2D risk, we demonstrate that SLC16A11 is a proton-coupled monocarboxylate transporter and that genetic perturbation of SLC16A11 induces changes in fatty acid and lipid metabolism that are associated with increased T2D risk. Our findings suggest that increasing SLC16A11 function could be therapeutically beneficial for T2D. VIDEO ABSTRACT.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.cell.2017.06.011DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5562285PMC
June 2017

Genome-wide profiling of heritable and de novo STR variations.

Nat Methods 2017 Jun 24;14(6):590-592. Epub 2017 Apr 24.

New York Genome Center, New York, New York, USA.

Short tandem repeats (STRs) are highly variable elements that play a pivotal role in multiple genetic diseases, population genetics applications, and forensic casework. However, it has proven problematic to genotype STRs from high-throughput sequencing data. Here, we describe HipSTR, a novel haplotype-based method for robustly genotyping and phasing STRs from Illumina sequencing data, and we report a genome-wide analysis and validation of de novo STR mutations. HipSTR is freely available at https://hipstr-tool.github.io/HipSTR.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nmeth.4267DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5482724PMC
June 2017

A genomic view of short tandem repeats.

Authors:
Melissa Gymrek

Curr Opin Genet Dev 2017 Jun 16;44:9-16. Epub 2017 Feb 16.

Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA; Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA. Electronic address:

Short tandem repeats (STRs) are some of the fastest mutating loci in the genome. Tools for accurately profiling STRs from high-throughput sequencing data have enabled genome-wide interrogation of more than a million STRs across hundreds of individuals. These catalogs have revealed that STRs are highly multiallelic and may contribute more de novo mutations than any other variant class. Recent studies have leveraged these catalogs to show that STRs play a widespread role in regulating gene expression and other molecular phenotypes. These analyses suggest that STRs are an underappreciated but rich reservoir of variation that likely make significant contributions to Mendelian diseases, complex traits, and cancer.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.gde.2017.01.012DOI Listing
June 2017

The Simons Genome Diversity Project: 300 genomes from 142 diverse populations.

Nature 2016 Oct 21;538(7624):201-206. Epub 2016 Sep 21.

Department of Zoology, University of Oxford, Oxford OX1 3PS, UK.

Here we report the Simons Genome Diversity Project data set: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioural modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that of other non-Africans.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature18964DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5161557PMC
October 2016

Recommendations for open data science.

Gigascience 2016 18;5:22. Epub 2016 May 18.

Data Science and Data Engineering, Broad Institute of MIT and Harvard, Cambridge, MA USA.

Life science research increasingly relies on large-scale computational analyses. However, the code and data used for these analyses are often lacking in publications. To maximize scientific impact, reproducibility, and reuse, it is crucial that these resources are made publicly available and are fully transparent. We provide recommendations for improving the openness of data-driven studies in life sciences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13742-016-0127-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4870738PMC
October 2017

Population-Scale Sequencing Data Enable Precise Estimates of Y-STR Mutation Rates.

Am J Hum Genet 2016 05 25;98(5):919-933. Epub 2016 Apr 25.

New York Genome Center, New York, NY 10013, USA; Whitehead Institute for Biomedical Research, 9 Cambridge Center, Cambridge, MA 02139, USA; Department of Computer Science, Fu Foundation School of Engineering, Columbia University, New York, NY 10027, USA; Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10032, USA. Electronic address:

Short tandem repeats (STRs) are mutation-prone loci that span nearly 1% of the human genome. Previous studies have estimated the mutation rates of highly polymorphic STRs by using capillary electrophoresis and pedigree-based designs. Although this work has provided insights into the mutational dynamics of highly mutable STRs, the mutation rates of most others remain unknown. Here, we harnessed whole-genome sequencing data to estimate the mutation rates of Y chromosome STRs (Y-STRs) with 2-6 bp repeat units that are accessible to Illumina sequencing. We genotyped 4,500 Y-STRs by using data from the 1000 Genomes Project and the Simons Genome Diversity Project. Next, we developed MUTEA, an algorithm that infers STR mutation rates from population-scale data by using a high-resolution SNP-based phylogeny. After extensive intrinsic and extrinsic validations, we harnessed MUTEA to derive mutation-rate estimates for 702 polymorphic STRs by tracing each locus over 222,000 meioses, resulting in the largest collection of Y-STR mutation rates to date. Using our estimates, we identified determinants of STR mutation rates and built a model to predict rates for STRs across the genome. These predictions indicate that the load of de novo STR mutations is at least 75 mutations per generation, rivaling the load of all other known variant types. Finally, we identified Y-STRs with potential applications in forensics and genetic genealogy, assessed the ability to differentiate between the Y chromosomes of father-son pairs, and imputed Y-STR genotypes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ajhg.2016.04.001DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4863667PMC
May 2016

Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences.

Nat Genet 2016 06 25;48(6):593-9. Epub 2016 Apr 25.

Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA.

We report the sequences of 1,244 human Y chromosomes randomly ascertained from 26 worldwide populations by the 1000 Genomes Project. We discovered more than 65,000 variants, including single-nucleotide variants, multiple-nucleotide variants, insertions and deletions, short tandem repeats, and copy number variants. Of these, copy number variants contribute the greatest predicted functional impact. We constructed a calibrated phylogenetic tree on the basis of binary single-nucleotide variants and projected the more complex variants onto it, estimating the number of mutations for each class. Our phylogeny shows bursts of extreme expansion in male numbers that have occurred independently among each of the five continental superpopulations examined, at times of known migrations and technological innovations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ng.3559DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4884158PMC
June 2016

Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans.

Nucleic Acids Res 2016 05 7;44(8):3750-62. Epub 2016 Apr 7.

Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA

Despite representing an important source of genetic variation, tandem repeats (TRs) remain poorly studied due to technical difficulties. We hypothesized that TRs can operate as expression (eQTLs) and methylation (mQTLs) quantitative trait loci. To test this we analyzed the effect of variation at 4849 promoter-associated TRs, genotyped in 120 individuals, on neighboring gene expression and DNA methylation. Polymorphic promoter TRs were associated with increased variance in local gene expression and DNA methylation, suggesting functional consequences related to TR variation. We identified >100 TRs associated with expression/methylation levels of adjacent genes. These potential eQTL/mQTL TRs were enriched for overlaps with transcription factor binding and DNaseI hypersensitivity sites, providing a rationale for their effects. Moreover, we showed that most TR variants are poorly tagged by nearby single nucleotide polymorphisms (SNPs) markers, indicating that many functional TR variants are not effectively assayed by SNP-based approaches. Our study assigns biological significance to TR variations in the human genome, and suggests that a significant fraction of TR variations exert functional effects via alterations of local gene expression or epigenetics. We conclude that targeted studies that focus on genotyping TR variants are required to fully ascertain functional variation in the genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw219DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4857002PMC
May 2016

Abundant contribution of short tandem repeats to gene expression variation in humans.

Nat Genet 2016 Jan 7;48(1):22-9. Epub 2015 Dec 7.

Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, USA.

The contribution of repetitive elements to quantitative human traits is largely unknown. Here we report a genome-wide survey of the contribution of short tandem repeats (STRs), which constitute one of the most polymorphic and abundant repeat classes, to gene expression in humans. Our survey identified 2,060 significant expression STRs (eSTRs). These eSTRs were replicable in orthogonal populations and expression assays. We used variance partitioning to disentangle the contribution of eSTRs from that of linked SNPs and indels and found that eSTRs contribute 10-15% of the cis heritability mediated by all common variants. Further functional genomic analyses showed that eSTRs are enriched in conserved regions, colocalize with regulatory elements and may modulate certain histone modifications. By analyzing known genome-wide association study (GWAS) signals and searching for new associations in 1,685 whole genomes from deeply phenotyped individuals, we found that eSTRs are enriched in various clinically relevant conditions. These results highlight the contribution of STRs to the genetic architecture of quantitative human traits.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ng.3461DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4909355PMC
January 2016

EWS-FLI1 utilizes divergent chromatin remodeling mechanisms to directly activate or repress enhancer elements in Ewing sarcoma.

Cancer Cell 2014 Nov 30;26(5):668-681. Epub 2014 Oct 30.

Department of Pathology and Center for Cancer Research, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.

The aberrant transcription factor EWS-FLI1 drives Ewing sarcoma, but its molecular function is not completely understood. We find that EWS-FLI1 reprograms gene regulatory circuits in Ewing sarcoma by directly inducing or repressing enhancers. At GGAA repeat elements, which lack evolutionary conservation and regulatory potential in other cell types, EWS-FLI1 multimers induce chromatin opening and create de novo enhancers that physically interact with target promoters. Conversely, EWS-FLI1 inactivates conserved enhancers containing canonical ETS motifs by displacing wild-type ETS transcription factors. These divergent chromatin-remodeling patterns repress tumor suppressors and mesenchymal lineage regulators while activating oncogenes and potential therapeutic targets, such as the kinase VRK1. Our findings demonstrate how EWS-FLI1 establishes an oncogenic regulatory program governing both tumor survival and differentiation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ccell.2014.10.004DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4492343PMC
November 2014

PyBamView: a browser-based application for viewing short read alignments.

Authors:
Melissa Gymrek

Bioinformatics 2014 Dec 21;30(23):3405-7. Epub 2014 Aug 21.

Whitehead Institute for Biomedical Research, 9 Cambridge Center, Cambridge, MA 02142, Harvard-MIT Division of Health Sciences and Technology, MIT, Cambridge, MA 02139 and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA Whitehead Institute for Biomedical Research, 9 Cambridge Center, Cambridge, MA 02142, Harvard-MIT Division of Health Sciences and Technology, MIT, Cambridge, MA 02139 and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA Whitehead Institute for Biomedical Research, 9 Cambridge Center, Cambridge, MA 02142, Harvard-MIT Division of Health Sciences and Technology, MIT, Cambridge, MA 02139 and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.

Unlabelled: Current sequence alignment browsers allow visualization of large and complex next-generation sequencing datasets. However, most of these tools provide inadequate display of insertions and can be cumbersome to use on large datasets. I implemented PyBamView, a lightweight Web application for visualizing short read alignments. It provides an easy-to-use Web interface for viewing alignments across multiple samples, with a focus on accurate visualization of insertions.

Availability And Implementation: PyBamView is available as a standard python package. The source code is freely available under the MIT license at https://mgymrek.github.io/pybamview.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btu565DOI Listing
December 2014

The landscape of human STR variation.

Genome Res 2014 Nov 18;24(11):1894-904. Epub 2014 Aug 18.

Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA;

Short tandem repeats are among the most polymorphic loci in the human genome. These loci play a role in the etiology of a range of genetic diseases and have been frequently utilized in forensics, population genetics, and genetic genealogy. Despite this plethora of applications, little is known about the variation of most STRs in the human population. Here, we report the largest-scale analysis of human STR variation to date. We collected information for nearly 700,000 STR loci across more than 1000 individuals in Phase 1 of the 1000 Genomes Project. Extensive quality controls show that reliable allelic spectra can be obtained for close to 90% of the STR loci in the genome. We utilize this call set to analyze determinants of STR variation, assess the human reference genome's representation of STR alleles, find STR loci with common loss-of-function alleles, and obtain initial estimates of the linkage disequilibrium between STRs and common SNPs. Overall, these analyses further elucidate the scale of genetic variation beyond classical point mutations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.177774.114DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4216929PMC
November 2014

OTX2 duplication is implicated in hemifacial microsomia.

PLoS One 2014 9;9(5):e96788. Epub 2014 May 9.

Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, United States of America.

Hemifacial microsomia (HFM) is the second most common facial anomaly after cleft lip and palate. The phenotype is highly variable and most cases are sporadic. We investigated the disorder in a large pedigree with five affected individuals spanning eight meioses. Whole-exome sequencing results indicated the absence of a pathogenic coding point mutation. A genome-wide survey of segmental variations identified a 1.3 Mb duplication of chromosome 14q22.3 in all affected individuals that was absent in more than 1000 chromosomes of ethnically matched controls. The duplication was absent in seven additional sporadic HFM cases, which is consistent with the known heterogeneity of the disorder. To find the critical gene in the duplicated region, we analyzed signatures of human craniofacial disease networks, mouse expression data, and predictions of dosage sensitivity. All of these approaches implicated OTX2 as the most likely causal gene. Moreover, OTX2 is a known oncogenic driver in medulloblastoma, a condition that was diagnosed in the proband during the course of the study. Our findings suggest a role for OTX2 dosage sensitivity in human craniofacial development and raise the possibility of a shared etiology between a subtype of hemifacial microsomia and medulloblastoma.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0096788PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4016008PMC
January 2015

LoQAtE--Localization and Quantitation ATlas of the yeast proteomE. A new tool for multiparametric dissection of single-protein behavior in response to biological perturbations in yeast.

Nucleic Acids Res 2014 Jan 22;42(Database issue):D726-30. Epub 2013 Oct 22.

Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel and Whitehead Institute for Biomedical Research, Nine Cambridge Center, Cambridge, MA 02142, USA.

Living organisms change their proteome dramatically to sustain a stable internal milieu in fluctuating environments. To study the dynamics of proteins during stress, we measured the localization and abundance of the Saccharomyces cerevisiae proteome under various growth conditions and genetic backgrounds using the GFP collection. We created a database (DB) called 'LoQAtE' (Localizaiton and Quantitation Atlas of the yeast proteomE), available online at http://www.weizmann.ac.il/molgen/loqate/, to provide easy access to these data. Using LoQAtE DB, users can get a profile of changes for proteins of interest as well as querying advanced intersections by either abundance changes, primary localization or localization shifts over the tested conditions. Currently, the DB hosts information on 5330 yeast proteins under three external perturbations (DTT, H₂O₂ and nitrogen starvation) and two genetic mutations [in the chaperonin containing TCP1 (CCT) complex and in the proteasome]. Additional conditions will be uploaded regularly. The data demonstrate hundreds of localization and abundance changes, many of which were not detected at the level of mRNA. LoQAtE is designed to allow easy navigation for non-experts in high-content microscopy and data are available for download. These data should open up new perspectives on the significant role of proteins while combating external and internal fluctuations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkt933DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965041PMC
January 2014

Profiling short tandem repeats from short reads.

Methods Mol Biol 2013 ;1038:113-35

Harvard-MIT Division of Health Sciences and Technology, MIT, Cambridge, MA, USA.

Short tandem repeats (STRs), also known as microsatellites, have a wide range of applications, including medical genetics, forensics, and population genetics. High-throughput sequencing has the potential to profile large numbers of STRs, but cumbersome gapped alignment and STR-specific noise patterns hamper this task. We recently developed an algorithm, called lobSTR, to overcome these challenges and to accurately profile STRs from short reads. Here we describe how to use lobSTR to call STR variations from high-throughput sequencing datasets and to diagnose the quality of the calls.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/978-1-62703-514-9_7DOI Listing
February 2014