Publications by authors named "Eric Banks"

45 Publications

A validated lineage-derived somatic truth data set enables benchmarking in cancer genome analysis.

Commun Biol 2020 12 8;3(1):744. Epub 2020 Dec 8.

Broad Institute of Harvard and MIT, Cambridge, MA, USA.

Existing cancer benchmark data sets for human sequencing data use germline variants, synthetic methods, or expensive validations, none of which are satisfactory for providing a large collection of true somatic variation across a whole genome. Here we propose a data set, Lineage derived Somatic Truth (LinST), of short somatic mutations in the HT115 colon cancer cell-line, that are validated using a known cell lineage that includes thousands of mutations and a high confidence region covering 2.7 gigabases per sample.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s42003-020-01460-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7722876PMC
December 2020

The mutational constraint spectrum quantified from variation in 141,456 humans.

Nature 2020 05 27;581(7809):434-443. Epub 2020 May 27.

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.

Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2308-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7334197PMC
May 2020

A structural variation reference for medical and population genetics.

Nature 2020 05 27;581(7809):444-451. Epub 2020 May 27.

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.

Structural variants (SVs) rearrange large segments of DNA and can have profound consequences in evolution and human disease. As national biobanks, disease-association studies, and clinical genetic testing have grown increasingly reliant on genome sequencing, population references such as the Genome Aggregation Database (gnomAD) have become integral in the interpretation of single-nucleotide variants (SNVs). However, there are no reference maps of SVs from high-coverage genome sequencing comparable to those for SNVs. Here we present a reference of sequence-resolved SVs constructed from 14,891 genomes across diverse global populations (54% non-European) in gnomAD. We discovered a rich and complex landscape of 433,371 SVs, from which we estimate that SVs are responsible for 25-29% of all rare protein-truncating events per genome. We found strong correlations between natural selection against damaging SNVs and rare SVs that disrupt or duplicate protein-coding sequence, which suggests that genes that are highly intolerant to loss-of-function are also sensitive to increased dosage. We also uncovered modest selection against noncoding SVs in cis-regulatory elements, although selection against protein-truncating SVs was stronger than all noncoding effects. Finally, we identified very large (over one megabase), rare SVs in 3.9% of samples, and estimate that 0.13% of individuals may carry an SV that meets the existing criteria for clinically important incidental findings. This SV resource is freely distributed via the gnomAD browser and will have broad utility in population genetics, disease-association studies, and diagnostic screening.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2287-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7334194PMC
May 2020

Lean and deep models for more accurate filtering of SNP and INDEL variant calls.

Bioinformatics 2020 04;36(7):2060-2067

Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.

Summary: We investigate convolutional neural networks (CNNs) for filtering small genomic variants in short-read DNA sequence data. Errors created during sequencing and library preparation make variant calling a difficult task. Encoding the reference genome and aligned reads covering sites of genetic variation as numeric tensors allows us to leverage CNNs for variant filtration. Convolutions over these tensors learn to detect motifs useful for classifying variants. Variant filtering models are trained to classify variants as artifacts or real variation. Visualizing the learned weights of the CNN confirmed it detects familiar DNA motifs known to correlate with real variation, like homopolymers and short tandem repeats (STR). After confirmation of the biological plausibility of the learned features we compared our model to current state-of-the-art filtration methods like Gaussian Mixture Models, Random Forests and CNNs designed for image classification, like DeepVariant. We demonstrate improvements in both sensitivity and precision. The tensor encoding was carefully tailored for processing genomic data, respecting the qualitative differences in structure between DNA and natural images. Ablation tests quantitatively measured the benefits of our tensor encoding strategy. Bayesian hyper-parameter optimization confirmed our notion that architectures designed with DNA data in mind outperform off-the-shelf image classification models. Our cross-generalization analysis identified idiosyncrasies in truth resources pointing to the need for new methods to construct genomic truth data. Our results show that models trained on heterogenous data types and diverse truth resources generalize well to new datasets, negating the need to train separate models for each data type.

Availability And Implementation: This work is available in the Genome Analysis Toolkit (GATK) with the tool name CNNScoreVariants (https://github.com/broadinstitute/gatk).

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz901DOI Listing
April 2020

Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects.

Nat Commun 2018 10 2;9(1):4038. Epub 2018 Oct 2.

McDonnell Genome Institute, Washington University School of Medicine, St. Louis, 63108, MO, USA.

Hundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power. A central challenge for joint analysis is that different WGS data processing pipelines cause substantial differences in variant calling in combined datasets, necessitating computationally expensive reprocessing. This approach is no longer tenable given the scale of current studies and data volumes. Here, we define WGS data processing standards that allow different groups to produce functionally equivalent (FE) results, yet still innovate on data processing pipelines. We present initial FE pipelines developed at five genome centers and show that they yield similar variant calling results and produce significantly less variability than sequencing replicates. This work alleviates a key technical bottleneck for genome aggregation and helps lay the foundation for community-wide human genetics studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-018-06159-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6168605PMC
October 2018

Sequence data and association statistics from 12,940 type 2 diabetes cases and controls.

Sci Data 2017 12 19;4:170179. Epub 2017 Dec 19.

Wellcome Trust Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK.

To investigate the genetic basis of type 2 diabetes (T2D) to high resolution, the GoT2D and T2D-GENES consortia catalogued variation from whole-genome sequencing of 2,657 European individuals and exome sequencing of 12,940 individuals of multiple ancestries. Over 27M SNPs, indels, and structural variants were identified, including 99% of low-frequency (minor allele frequency [MAF] 0.1-5%) non-coding variants in the whole-genome sequenced individuals and 99.7% of low-frequency coding variants in the whole-exome sequenced individuals. Each variant was tested for association with T2D in the sequenced individuals, and, to increase power, most were tested in larger numbers of individuals (>80% of low-frequency coding variants in ~82 K Europeans via the exome chip, and ~90% of low-frequency non-coding variants in ~44 K Europeans via genotype imputation). The variants, genotypes, and association statistics from these analyses provide the largest reference to date of human genetic information relevant to T2D, for use in activities such as T2D-focused genotype imputation, functional characterization of variants or genes, and other novel analyses to detect associations between sequence variation and T2D.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/sdata.2017.179DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5735917PMC
December 2017

A Low-Frequency Inactivating Variant Enriched in the Finnish Population Is Associated With Fasting Insulin Levels and Type 2 Diabetes Risk.

Diabetes 2017 07 24;66(7):2019-2032. Epub 2017 Mar 24.

Diabetes and Endocrinology Unit, Department of Clinical Sciences Malmö, Lund University Diabetes Centre, Malmö, Sweden.

To identify novel coding association signals and facilitate characterization of mechanisms influencing glycemic traits and type 2 diabetes risk, we analyzed 109,215 variants derived from exome array genotyping together with an additional 390,225 variants from exome sequence in up to 39,339 normoglycemic individuals from five ancestry groups. We identified a novel association between the coding variant (p.Pro50Thr) in and fasting plasma insulin (FI), a gene in which rare fully penetrant mutations are causal for monogenic glycemic disorders. The low-frequency allele is associated with a 12% increase in FI levels. This variant is present at 1.1% frequency in Finns but virtually absent in individuals from other ancestries. Carriers of the FI-increasing allele had increased 2-h insulin values, decreased insulin sensitivity, and increased risk of type 2 diabetes (odds ratio 1.05). In cellular studies, the AKT2-Thr50 protein exhibited a partial loss of function. We extend the allelic spectrum for coding variants in associated with disorders of glucose homeostasis and demonstrate bidirectional effects of variants within the pleckstrin homology domain of .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2337/db16-1329DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5482074PMC
July 2017

A framework for the detection of de novo mutations in family-based sequencing data.

Eur J Hum Genet 2017 02 23;25(2):227-233. Epub 2016 Nov 23.

Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands.

Germline mutation detection from human DNA sequence data is challenging due to the rarity of such events relative to the intrinsic error rates of sequencing technologies and the uneven coverage across the genome. We developed PhaseByTransmission (PBT) to identify de novo single nucleotide variants and short insertions and deletions (indels) from sequence data collected in parent-offspring trios. We compute the joint probability of the data given the genotype likelihoods in the individual family members, the known familial relationships and a prior probability for the mutation rate. Candidate de novo mutations (DNMs) are reported along with their posterior probability, providing a systematic way to prioritize them for validation. Our tool is integrated in the Genome Analysis Toolkit and can be used together with the ReadBackedPhasing module to infer the parental origin of DNMs based on phase-informative reads. Using simulated data, we show that PBT outperforms existing tools, especially in low coverage data and on the X chromosome. We further show that PBT displays high validation rates on empirical parent-offspring sequencing data for whole-exome data from 104 trios and X-chromosome data from 249 parent-offspring families. Finally, we demonstrate an association between father's age at conception and the number of DNMs in female offspring's X chromosome, consistent with previous literature reports.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ejhg.2016.147DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5255947PMC
February 2017

Analysis of protein-coding genetic variation in 60,706 humans.

Nature 2016 08;536(7616):285-91

Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature19057DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018207PMC
August 2016

The genetic architecture of type 2 diabetes.

Nature 2016 08 11;536(7614):41-47. Epub 2016 Jul 11.

Wellcome Trust Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK.

The genetic architecture of common traits, including the number, frequency, and effect sizes of inherited variants that contribute to individual risk, has been long debated. Genome-wide association studies have identified scores of common variants associated with type 2 diabetes, but in aggregate, these explain only a fraction of the heritability of this disease. Here, to test the hypothesis that lower-frequency variants explain much of the remainder, the GoT2D and T2D-GENES consortia performed whole-genome sequencing in 2,657 European individuals with and without diabetes, and exome sequencing in 12,940 individuals from five ancestry groups. To increase statistical power, we expanded the sample size via genotyping and imputation in a further 111,548 subjects. Variants associated with type 2 diabetes after sequencing were overwhelmingly common and most fell within regions previously identified by genome-wide association studies. Comprehensive enumeration of sequence variation is necessary to identify functional alleles that provide important clues to disease pathophysiology, but large-scale sequencing does not support the idea that lower-frequency variants have a major role in predisposition to type 2 diabetes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature18642DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5034897PMC
August 2016

Connexin Controls Cell-Cycle Exit and Cell Differentiation by Directly Promoting Cytosolic Localization and Degradation of E3 Ligase Skp2.

Dev Cell 2015 Nov 12;35(4):483-96. Epub 2015 Nov 12.

Department of Biochemistry, University of Texas Health Science Center, San Antonio, TX 78229-3900, USA. Electronic address:

Connexins and connexin channels play important roles in cell growth/differentiation and tumorigenesis. Here, we identified a relationship between a connexin molecule and a critical cell-cycle regulator. Our data show that connexin (Cx) 50 regulated lens cell-cycle progression and differentiation by modulating expression of cyclin-dependent kinase inhibitor p27/p57 and E3 ubiquitin ligase Skp2. Cx50 directly interacted with and retained Skp2 in the cytosol by masking the nuclear targeting domain of Skp2, and this effect was supported by an increased nuclear localization of Skp2, disruption of Skp2 interaction with importin-7, and decreased levels of p27/p57 in mouse lenses lacking Cx50. As a result, Cx50 increased auto-ubiquitination and subsequent degradation of Skp2. A mutation (V362E) on the C terminus of Cx50 disrupted the interaction between Cx50 and Skp2 and completely abolished such effects. Therefore, this study identifies a role for connexins in regulating cell-cycle modulators and, consequently, cell growth and differentiation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.devcel.2015.10.014DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4957694PMC
November 2015

Tools and best practices for data processing in allelic expression analysis.

Genome Biol 2015 Sep 17;16:195. Epub 2015 Sep 17.

New York Genome Center, New York, NY, USA.

Allelic expression analysis has become important for integrating genome and transcriptome data to characterize various biological phenomena such as cis-regulatory variation and nonsense-mediated decay. We analyze the properties of allelic expression read count data and technical sources of error, such as low-quality or double-counted RNA-seq reads, genotyping errors, allelic mapping bias, and technical covariates due to sample preparation and sequencing, and variation in total read depth. We provide guidelines for correcting such errors, show that our quality control measures improve the detection of relevant allelic expression, and introduce tools for the high-throughput production of allelic expression data from RNA-sequencing data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-015-0762-6DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4574606PMC
September 2015

Human genomics. Effect of predicted protein-truncating genetic variants on the human transcriptome.

Science 2015 May;348(6235):666-9

Broad Institute of MIT and Harvard, Cambridge, MA, USA. Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.

Accurate prediction of the functional effect of genetic variation is critical for clinical genome interpretation. We systematically characterized the transcriptome effects of protein-truncating variants, a class of variants expected to have profound effects on gene function, using data from the Genotype-Tissue Expression (GTEx) and Geuvadis projects. We quantitated tissue-specific and positional effects on nonsense-mediated transcript decay and present an improved predictive model for this decay. We directly measured the effect of variants both proximal and distal to splice junctions. Furthermore, we found that robustness to heterozygous gene inactivation is not due to dosage compensation. Our results illustrate the value of transcriptome data in the functional interpretation of genetic variants.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.1261877DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4537935PMC
May 2015

The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes.

BMC Genomics 2015 Feb 28;16:143. Epub 2015 Feb 28.

Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.

Background: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls.

Results: This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%.

Conclusions: In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-015-1333-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4352271PMC
February 2015

Microbial diversity in a Venezuelan orthoquartzite cave is dominated by the Chloroflexi (Class Ktedonobacterales) and Thaumarchaeota Group I.1c.

Front Microbiol 2014 26;5:615. Epub 2014 Nov 26.

Biotechnology and Planetary Protection Group, Jet Propulsion Laboratory, California Institute of Technology Pasadena, CA, USA.

The majority of caves are formed within limestone rock and hence our understanding of cave microbiology comes from carbonate-buffered systems. In this paper, we describe the microbial diversity of Roraima Sur Cave (RSC), an orthoquartzite (SiO4) cave within Roraima Tepui, Venezuela. The cave contains a high level of microbial activity when compared with other cave systems, as determined by an ATP-based luminescence assay and cell counting. Molecular phylogenetic analysis of microbial diversity within the cave demonstrates the dominance of Actinomycetales and Alphaproteobacteria in endolithic bacterial communities close to the entrance, while communities from deeper in the cave are dominated (82-84%) by a unique clade of Ktedonobacterales within the Chloroflexi. While members of this phylum are commonly found in caves, this is the first identification of members of the Class Ktedonobacterales. An assessment of archaeal species demonstrates the dominance of phylotypes from the Thaumarchaeota Group I.1c (100%), which have previously been associated with acidic environments. While the Thaumarchaeota have been seen in numerous cave systems, the dominance of Group I.1c in RSC is unique and a departure from the traditional archaeal community structure. Geochemical analysis of the cave environment suggests that water entering the cave, rather than the nutrient-limited orthoquartzite rock, provides the carbon and energy necessary for microbial community growth and subsistence, while the poor buffering capacity of quartzite or the low pH of the environment may be selecting for this unusual community structure. Together these data suggest that pH, imparted by the geochemistry of the host rock, can play as important a role in niche-differentiation in caves as in other environmental systems.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fmicb.2014.00615DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4244709PMC
December 2014

From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.

Curr Protoc Bioinformatics 2013 ;43:11.10.1-11.10.33

Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts.

This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls that can be used in downstream analyses. The complete workflow includes the core NGS data processing steps that are necessary to make the raw data suitable for analysis by the GATK, as well as the key methods involved in variant discovery using the GATK.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/0471250953.bi1110s43DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243306PMC
July 2016

A polygenic burden of rare disruptive mutations in schizophrenia.

Nature 2014 Feb 22;506(7487):185-90. Epub 2014 Jan 22.

Departments of Genetics and Psychiatry, University of North Carolina, Chapel Hill, North Carolina 27599-7264, USA.

Schizophrenia is a common disease with a complex aetiology, probably involving multiple and heterogeneous genetic factors. Here, by analysing the exome sequences of 2,536 schizophrenia cases and 2,543 controls, we demonstrate a polygenic burden primarily arising from rare (less than 1 in 10,000), disruptive mutations distributed across many genes. Particularly enriched gene sets include the voltage-gated calcium ion channel and the signalling complex formed by the activity-regulated cytoskeleton-associated scaffold protein (ARC) of the postsynaptic density, sets previously implicated by genome-wide association and copy-number variation studies. Similar to reports in autism, targets of the fragile X mental retardation protein (FMRP, product of FMR1) are enriched for case mutations. No individual gene-based test achieves significance after correction for multiple testing and we do not detect any alleles of moderately low frequency (approximately 0.5 to 1 per cent) and moderately large effect. Taken together, these data suggest that population-based exome sequencing can discover risk alleles and complements established gene-mapping paradigms in neuropsychiatric disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature12975DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4136494PMC
February 2014

De novo mutations in schizophrenia implicate synaptic networks.

Nature 2014 Feb 22;506(7487):179-84. Epub 2014 Jan 22.

Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Institute of Psychological Medicine and Clinical Neurosciences, Cardiff University, Cardiff CF24 4HQ, UK.

Inherited alleles account for most of the genetic risk for schizophrenia. However, new (de novo) mutations, in the form of large chromosomal copy number changes, occur in a small fraction of cases and disproportionally disrupt genes encoding postsynaptic proteins. Here we show that small de novo mutations, affecting one or a few nucleotides, are overrepresented among glutamatergic postsynaptic proteins comprising activity-regulated cytoskeleton-associated protein (ARC) and N-methyl-d-aspartate receptor (NMDAR) complexes. Mutations are additionally enriched in proteins that interact with these complexes to modulate synaptic strength, namely proteins regulating actin filament dynamics and those whose messenger RNAs are targets of fragile X mental retardation protein (FMRP). Genes affected by mutations in schizophrenia overlap those mutated in autism and intellectual disability, as do mutation-enriched synaptic pathways. Aligning our findings with a parallel case-control study, we demonstrate reproducible insights into aetiological mechanisms for schizophrenia and reveal pathophysiology shared with other neurodevelopmental disorders.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature12929DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4237002PMC
February 2014

Analysis of rare, exonic variation amongst subjects with autism spectrum disorders and population controls.

PLoS Genet 2013 Apr 11;9(4):e1003443. Epub 2013 Apr 11.

Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America.

We report on results from whole-exome sequencing (WES) of 1,039 subjects diagnosed with autism spectrum disorders (ASD) and 870 controls selected from the NIMH repository to be of similar ancestry to cases. The WES data came from two centers using different methods to produce sequence and to call variants from it. Therefore, an initial goal was to ensure the distribution of rare variation was similar for data from different centers. This proved straightforward by filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. Results were evaluated using seven samples sequenced at both centers and by results from the association study. Next we addressed how the data and/or results from the centers should be combined. Gene-based analyses of association was an obvious choice, but should statistics for association be combined across centers (meta-analysis) or should data be combined and then analyzed (mega-analysis)? Because of the nature of many gene-based tests, we showed by theory and simulations that mega-analysis has better power than meta-analysis. Finally, before analyzing the data for association, we explored the impact of population structure on rare variant analysis in these data. Like other recent studies, we found evidence that population structure can confound case-control studies by the clustering of rare variants in ancestry space; yet, unlike some recent studies, for these data we found that principal component-based analyses were sufficient to control for ancestry and produce test statistics with appropriate distributions. After using a variety of gene-based tests and both meta- and mega-analysis, we found no new risk genes for ASD in this sample. Our results suggest that standard gene-based tests will require much larger samples of cases and controls before being effective for gene discovery, even for a disorder like ASD.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1003443DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3623759PMC
April 2013

A TALEN genome-editing system for generating human stem cell-based disease models.

Cell Stem Cell 2013 Feb 13;12(2):238-51. Epub 2012 Dec 13.

Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA 02138, USA.

Transcription activator-like effector nucleases (TALENs) are a new class of engineered nucleases that are easier to design to cleave at desired sites in a genome than previous types of nucleases. We report here the use of TALENs to rapidly and efficiently generate mutant alleles of 15 genes in cultured somatic cells or human pluripotent stem cells, the latter for which we differentiated both the targeted lines and isogenic control lines into various metabolic cell types. We demonstrate cell-autonomous phenotypes directly linked to disease-dyslipidemia, insulin resistance, hypoglycemia, lipodystrophy, motor-neuron death, and hepatitis C infection. We found little evidence of TALEN off-target effects, but each clonal line nevertheless harbors a significant number of unique mutations. Given the speed and ease with which we were able to derive and characterize these cell lines, we anticipate TALEN-mediated genome editing of human cells becoming a mainstay for the investigation of human biology and disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.stem.2012.11.011DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3570604PMC
February 2013

Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth.

Am J Hum Genet 2012 Oct;91(4):597-607

Division of Psychiatric Genomics, Mount Sinai School of Medicine, New York, NY 10029, USA.

Sequencing of gene-coding regions (the exome) is increasingly used for studying human disease, for which copy-number variants (CNVs) are a critical genetic component. However, detecting copy number from exome sequencing is challenging because of the noncontiguous nature of the captured exons. This is compounded by the complex relationship between read depth and copy number; this results from biases in targeted genomic hybridization, sequence factors such as GC content, and batching of samples during collection and sequencing. We present a statistical tool (exome hidden Markov model [XHMM]) that uses principal-component analysis (PCA) to normalize exome read depth and a hidden Markov model (HMM) to discover exon-resolution CNV and genotype variation across samples. We evaluate performance on 90 schizophrenia trios and 1,017 case-control samples. XHMM detects a median of two rare (<1%) CNVs per individual (one deletion and one duplication) and has 79% sensitivity to similarly rare CNVs overlapping three or more exons discovered with microarrays. With sensitivity similar to state-of-the-art methods, XHMM achieves higher specificity by assigning quality metrics to the CNV calls to filter out bad ones, as well as to statistically genotype the discovered CNV in all individuals, yielding a trio call set with Mendelian-inheritance properties highly consistent with expectation. We also show that XHMM breakpoint quality scores enable researchers to explicitly search for novel classes of structural variation. For example, we apply XHMM to extract those CNVs that are highly likely to disrupt (delete or duplicate) only a portion of a gene.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ajhg.2012.08.005DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3484655PMC
October 2012

Reducing multiples: a mathematical formula that accurately predicts rates of singletons, twins, and higher-order multiples in women undergoing in vitro fertilization.

Fertil Steril 2012 Dec 15;98(6):1474-80.e2. Epub 2012 Sep 15.

Ronald O. Perelman and Claudia Cohen Center for Reproductive Medicine, Weill Cornell Medical College, New York, New York 10021, USA.

Objective: To develop a mathematical formula that accurately predicts the probability of a singleton, twin, and higher-order multiple pregnancy according to implantation rate and number of embryos transferred.

Design: A total of 12,003 IVF cycles from a single center resulting in ET were analyzed. Using mathematical modeling we developed a formula, the Combined Formula, and tested for the ability of this formula to accurately predict outcomes.

Setting: Academic hospital.

Patient(s): Patients undergoing IVF.

Intervention(s): None.

Main Outcome Measure(s): Goodness of fit of data from our center and previously published data to the Combined Formula and three previous mathematical models.

Result(s): The Combined Formula predicted the probability of singleton, twin, and higher-order pregnancies more accurately than three previous formulas (1.4% vs. 2.88%, 4.02%, and 5%, respectively) and accurately predicted outcomes from five previously published studies from other centers. An online applet is provided (https://secure.ivf.org/ivf-calculator.html).

Conclusion(s): The probability of pregnancy with singletons, twins, and higher-order multiples according to number of embryos transferred is predictable and not random and can be accurately modeled using the Combined Formula. The embryo itself is the major predictor of pregnancy outcomes, but there is an influence from "barriers," such as the endometrium and collaboration between embryos (embryo-embryo interaction). This model can be used to guide the decision regarding number of embryos to transfer after IVF.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.fertnstert.2012.08.014DOI Listing
December 2012

Efficiency and power as a function of sequence coverage, SNP array density, and imputation.

PLoS Comput Biol 2012 12;8(7):e1002604. Epub 2012 Jul 12.

Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.

High coverage whole genome sequencing provides near complete information about genetic variation. However, other technologies can be more efficient in some settings by (a) reducing redundant coverage within samples and (b) exploiting patterns of genetic variation across samples. To characterize as many samples as possible, many genetic studies therefore employ lower coverage sequencing or SNP array genotyping coupled to statistical imputation. To compare these approaches individually and in conjunction, we developed a statistical framework to estimate genotypes jointly from sequence reads, array intensities, and imputation. In European samples, we find similar sensitivity (89%) and specificity (99.6%) from imputation with either 1× sequencing or 1 M SNP arrays. Sensitivity is increased, particularly for low-frequency polymorphisms (MAF < 5%), when low coverage sequence reads are added to dense genome-wide SNP arrays--the converse, however, is not true. At sites where sequence reads and array intensities produce different sample genotypes, joint analysis reduces genotype errors and identifies novel error modes. Our joint framework informs the use of next-generation sequencing in genome wide association studies and supports development of improved methods for genotype calling.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1002604DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3395607PMC
January 2013

Antibiotic resistance is prevalent in an isolated cave microbiome.

PLoS One 2012 11;7(4):e34953. Epub 2012 Apr 11.

MG DeGroote Institute for Infectious Disease Research, Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada.

Antibiotic resistance is a global challenge that impacts all pharmaceutically used antibiotics. The origin of the genes associated with this resistance is of significant importance to our understanding of the evolution and dissemination of antibiotic resistance in pathogens. A growing body of evidence implicates environmental organisms as reservoirs of these resistance genes; however, the role of anthropogenic use of antibiotics in the emergence of these genes is controversial. We report a screen of a sample of the culturable microbiome of Lechuguilla Cave, New Mexico, in a region of the cave that has been isolated for over 4 million years. We report that, like surface microbes, these bacteria were highly resistant to antibiotics; some strains were resistant to 14 different commercially available antibiotics. Resistance was detected to a wide range of structurally different antibiotics including daptomycin, an antibiotic of last resort in the treatment of drug resistant Gram-positive pathogens. Enzyme-mediated mechanisms of resistance were also discovered for natural and semi-synthetic macrolide antibiotics via glycosylation and through a kinase-mediated phosphorylation mechanism. Sequencing of the genome of one of the resistant bacteria identified a macrolide kinase encoding gene and characterization of its product revealed it to be related to a known family of kinases circulating in modern drug resistant pathogens. The implications of this study are significant to our understanding of the prevalence of resistance, even in microbiomes isolated from human use of antibiotics. This supports a growing understanding that antibiotic resistance is natural, ancient, and hard wired in the microbial pangenome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034953PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3324550PMC
August 2012

Patterns and rates of exonic de novo mutations in autism spectrum disorders.

Nature 2012 Apr 4;485(7397):242-5. Epub 2012 Apr 4.

Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts 02114, USA.

Autism spectrum disorders (ASD) are believed to have genetic and environmental origins, yet in only a modest fraction of individuals can specific causes be identified. To identify further genetic risk factors, here we assess the role of de novo mutations in ASD by sequencing the exomes of ASD cases and their parents (n = 175 trios). Fewer than half of the cases (46.3%) carry a missense or nonsense de novo variant, and the overall rate of mutation is only modestly higher than the expected rate. In contrast, the proteins encoded by genes that harboured de novo missense or nonsense mutations showed a higher degree of connectivity among themselves and to previous ASD genes as indexed by protein-protein interaction screens. The small increase in the rate of de novo events, when taken together with the protein interaction results, are consistent with an important but limited role for de novo point mutations in ASD, similar to that documented for de novo copy number variants. Genetic models incorporating these data indicate that most of the observed de novo events are unconnected to ASD; those that do confer risk are distributed across many genes and are incompletely penetrant (that is, not necessarily sufficient for disease). Our results support polygenic models in which spontaneous coding mutations in any of a large number of genes increases risk by 5- to 20-fold. Despite the challenge posed by such models, results from de novo events and a large parallel case-control study provide strong evidence in favour of CHD8 and KATNAL2 as genuine autism risk factors.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature11011DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3613847PMC
April 2012

A systematic survey of loss-of-function variants in human protein-coding genes.

Science 2012 Feb;335(6070):823-8

Wellcome Trust Sanger Institute, Hinxton, UK.

Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.1215040DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3299548PMC
February 2012
-->