Publications by authors named "Paul Flicek"

230 Publications

The FAANG Data Portal: Global, Open-Access, "FAIR", and Richly Validated Genotype to Phenotype Data for High-Quality Functional Annotation of Animal Genomes.

Front Genet 2021 17;12:639238. Epub 2021 Jun 17.

European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, United Kingdom.

The Functional Annotation of ANimal Genomes (FAANG) project is a worldwide coordinated action creating high-quality functional annotation of farmed and companion animal genomes. The generation of a rich genome-to-phenome resource and supporting informatic infrastructure advances the scope of comparative genomics and furthers the understanding of functional elements. The project also provides terrestrial and aquatic animal agriculture community powerful resources for supporting improvements to farmed animal production, disease resistance, and genetic diversity. The FAANG Data Portal (https://data.faang.org) ensures Findable, Accessible, Interoperable and Reusable (FAIR) open access to the wealth of sample, sequencing, and analysis data produced by an ever-growing number of FAANG consortia. It is developed and maintained by the FAANG Data Coordination Centre (DCC) at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI). FAANG projects produce a standardised set of multi-omic assays with resulting data placed into a range of specialised open data archives. To ensure this data is easily findable and accessible by the community, the portal automatically identifies and collates all submitted FAANG data into a single easily searchable resource. The Data Portal supports direct download from the multiple underlying archives to enable seamless access to all FAANG data from within the portal itself. The portal provides a range of predefined filters, powerful predictive search, and a catalogue of sampling and analysis protocols and automatically identifies publications associated with any dataset. To ensure all FAANG data submissions are high-quality, the portal includes powerful contextual metadata validation and data submissions brokering to the underlying EMBL-EBI archives. The portal will incorporate extensive new technical infrastructure to effectively deliver and standardise FAANG's shift to single-cellomics, cell atlases, pangenomes, and novel phenotypic prediction models. The Data Portal plays a key role for FAANG by supporting high-quality functional annotation of animal genomes, through open FAIR sharing of data, complete with standardised rich metadata. Future Data Portal features developed by the DCC will support new technological developments for continued improvement for FAANG projects.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2021.639238DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8248360PMC
June 2021

Accessing Livestock Resources in Ensembl.

Front Genet 2021 28;12:650228. Epub 2021 Apr 28.

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Cambridge, United Kingdom.

Genome assembly is cheaper, more accurate and more automated than it has ever been. This is due to a combination of more cost-efficient chemistries, new sequencing technologies and better algorithms. The livestock community has been at the forefront of this new wave of genome assembly, generating some of the highest quality vertebrate genome sequences. Ensembl's goal is to add functional and comparative annotation to these genomes, through our gene annotation, genomic alignments, gene trees, regulatory, and variation data. We run computationally complex analyses in a high throughput and consistent manner to help accelerate downstream science. Our livestock resources are continuously growing in both breadth and depth. We annotate reference genome assemblies for newly sequenced species and regularly update annotation for existing genomes. We are the only major resource to support the annotation of breeds and other non-reference assemblies. We currently provide resources for 13 pig breeds, maternal and paternal haplotypes for hybrid cattle and various other non-reference or wild type assemblies for livestock species. Here, we describe the livestock data present in Ensembl and provide protocols for how to view data in our genome browser, download via it our FTP site, manipulate it via our tools and interact with it programmatically via our REST API.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2021.650228DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8115729PMC
April 2021

Towards complete and error-free genome assemblies of all vertebrate species.

Nature 2021 Apr 28;592(7856):737-746. Epub 2021 Apr 28.

UQ Genomics, University of Queensland, Brisbane, Queensland, Australia.

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species. To address this issue, the international Genome 10K (G10K) consortium has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-021-03451-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8081667PMC
April 2021

Genetic perturbation of PU.1 binding and chromatin looping at neutrophil enhancers associates with autoimmune disease.

Nat Commun 2021 04 16;12(1):2298. Epub 2021 Apr 16.

Human Genetics, Wellcome Sanger Institute, Genome Campus, Hinxton, UK.

Neutrophils play fundamental roles in innate immune response, shape adaptive immunity, and are a potentially causal cell type underpinning genetic associations with immune system traits and diseases. Here, we profile the binding of myeloid master regulator PU.1 in primary neutrophils across nearly a hundred volunteers. We show that variants associated with differential PU.1 binding underlie genetically-driven differences in cell count and susceptibility to autoimmune and inflammatory diseases. We integrate these results with other multi-individual genomic readouts, revealing coordinated effects of PU.1 binding variants on the local chromatin state, enhancer-promoter contacts and downstream gene expression, and providing a functional interpretation for 27 genes underlying immune traits. Collectively, these results demonstrate the functional role of PU.1 and its target enhancers in neutrophil transcriptional control and immune disease susceptibility.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-021-22548-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8052402PMC
April 2021

Functional annotations of three domestic animal genomes provide vital resources for comparative and agricultural research.

Nat Commun 2021 03 23;12(1):1821. Epub 2021 Mar 23.

Department of Animal Science, University of California, Davis, Davis, CA, USA.

Gene regulatory elements are central drivers of phenotypic variation and thus of critical importance towards understanding the genetics of complex traits. The Functional Annotation of Animal Genomes consortium was formed to collaboratively annotate the functional elements in animal genomes, starting with domesticated animals. Here we present an expansive collection of datasets from eight diverse tissues in three important agricultural species: chicken (Gallus gallus), pig (Sus scrofa), and cattle (Bos taurus). Comparative analysis of these datasets and those from the human and mouse Encyclopedia of DNA Elements projects reveal that a core set of regulatory elements are functionally conserved independent of divergence between species, and that tissue-specific transcription factor occupancy at regulatory elements and their predicted target genes are also conserved. These datasets represent a unique opportunity for the emerging field of comparative epigenomics, as well as the agricultural research community, including species that are globally important food resources.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-021-22100-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7988148PMC
March 2021

Haplotype-resolved diverse human genomes and integrated analysis of structural variation.

Science 2021 04 25;372(6537). Epub 2021 Feb 25.

New York Genome Center, New York, NY 10013, USA.

Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% of the genome: 26 million base pairs) integrate all forms of genetic variation, even across complex loci. We identified 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterized 130 of the most active mobile element source elements and found that 63% of all SVs arise through homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.abf7117DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8026704PMC
April 2021

LINE retrotransposons characterize mammalian tissue-specific and evolutionarily dynamic regulatory regions.

Genome Biol 2021 Feb 18;22(1):62. Epub 2021 Feb 18.

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

Background: To investigate the mechanisms driving regulatory evolution across tissues, we experimentally mapped promoters, enhancers, and gene expression in the liver, brain, muscle, and testis from ten diverse mammals.

Results: The regulatory landscape around genes included both tissue-shared and tissue-specific regulatory regions, where tissue-specific promoters and enhancers evolved most rapidly. Genomic regions switching between promoters and enhancers were more common across species, and less common across tissues within a single species. Long Interspersed Nuclear Elements (LINEs) played recurrent evolutionary roles: LINE L1s were associated with tissue-specific regulatory regions, whereas more ancient LINE L2s were associated with tissue-shared regulatory regions and with those switching between promoter and enhancer signatures across species.

Conclusions: Our analyses of the tissue-specificity and evolutionary stability among promoters and enhancers reveal how specific LINE families have helped shape the dynamic mammalian regulome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-021-02260-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7890895PMC
February 2021

A spatially resolved brain region- and cell type-specific isoform atlas of the postnatal mouse brain.

Nat Commun 2021 01 19;12(1):463. Epub 2021 Jan 19.

Brain and Mind Research Institute and Center for Neurogenetics, Weill Cornell Medicine, New York, NY, USA.

Splicing varies across brain regions, but the single-cell resolution of regional variation is unclear. We present a single-cell investigation of differential isoform expression (DIE) between brain regions using single-cell long-read sequencing in mouse hippocampus and prefrontal cortex in 45 cell types at postnatal day 7 ( www.isoformAtlas.com ). Isoform tests for DIE show better performance than exon tests. We detect hundreds of DIE events traceable to cell types, often corresponding to functionally distinct protein isoforms. Mostly, one cell type is responsible for brain-region specific DIE. However, for fewer genes, multiple cell types influence DIE. Thus, regional identity can, although rarely, override cell-type specificity. Cell types indigenous to one anatomic structure display distinctive DIE, e.g. the choroid plexus epithelium manifests distinct transcription-start-site usage. Spatial transcriptomics and long-read sequencing yield a spatially resolved splicing map. Our methods quantify isoform expression with cell-type and spatial resolution and it contributes to further our understanding of how the brain integrates molecular and cellular complexity.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-020-20343-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7815907PMC
January 2021

Mouse mutant phenotyping at scale reveals novel genes controlling bone mineral density.

PLoS Genet 2020 12 28;16(12):e1009190. Epub 2020 Dec 28.

Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America.

The genetic landscape of diseases associated with changes in bone mineral density (BMD), such as osteoporosis, is only partially understood. Here, we explored data from 3,823 mutant mouse strains for BMD, a measure that is frequently altered in a range of bone pathologies, including osteoporosis. A total of 200 genes were found to significantly affect BMD. This pool of BMD genes comprised 141 genes with previously unknown functions in bone biology and was complementary to pools derived from recent human studies. Nineteen of the 141 genes also caused skeletal abnormalities. Examination of the BMD genes in osteoclasts and osteoblasts underscored BMD pathways, including vesicle transport, in these cells and together with in silico bone turnover studies resulted in the prioritization of candidate genes for further investigation. Overall, the results add novel pathophysiological and molecular insight into bone health and disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1009190DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7822523PMC
December 2020

GENCODE 2021.

Nucleic Acids Res 2021 01;49(D1):D916-D923

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa1087DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778937PMC
January 2021

The European Bioinformatics Institute: empowering cooperation in response to a global health crisis.

Nucleic Acids Res 2021 01;49(D1):D29-D37

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The European Bioinformatics Institute (EMBL-EBI; https://www.ebi.ac.uk/) provides freely available data and bioinformatics services to the scientific community, alongside its research activity and training provision. The 2020 COVID-19 pandemic has brought to the forefront a need for the scientific community to work even more cooperatively to effectively tackle a global health crisis. EMBL-EBI has been able to build on its position to contribute to the fight against COVID-19 in a number of ways. Firstly, EMBL-EBI has used its infrastructure, expertise and network of international collaborations to help build the European COVID-19 Data Platform (https://www.covid19dataportal.org/), which brings together COVID-19 biomolecular data and connects it to researchers, clinicians and public health professionals. By September 2020, the COVID-19 Data Platform has integrated in excess of 170 000 COVID-19 biomolecular data and literature records, collected through a number of EMBL-EBI resources. Secondly, EMBL-EBI has strived to continue its support of the life science communities through the crisis, with updated Training provision and improved service provision throughout its resources. The COVID-19 pandemic has highlighted the importance of EMBL-EBI's core principles, including international cooperation, resource sharing and central data brokering, and has further empowered scientific cooperation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa1077DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778996PMC
January 2021

Gramene 2021: harnessing the power of comparative genomics and pathways for plant research.

Nucleic Acids Res 2021 01;49(D1):D1452-D1463

Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA.

Gramene (http://www.gramene.org), a knowledgebase founded on comparative functional analyses of genomic and pathway data for model plants and major crops, supports agricultural researchers worldwide. The resource is committed to open access and reproducible science based on the FAIR data principles. Since the last NAR update, we made nine releases; doubled the genome portal's content; expanded curated genes, pathways and expression sets; and implemented the Domain Informational Vocabulary Extraction (DIVE) algorithm for extracting gene function information from publications. The current release, #63 (October 2020), hosts 93 reference genomes-over 3.9 million genes in 122 947 families with orthologous and paralogous classifications. Plant Reactome portrays pathway networks using a combination of manual biocuration in rice (320 reference pathways) and orthology-based projections to 106 species. The Reactome platform facilitates comparison between reference and projected pathways, gene expression analyses and overlays of gene-gene interactions. Gramene integrates ontology-based protein structure-function annotation; information on genetic, epigenetic, expression, and phenotypic diversity; and gene functional annotations extracted from plant-focused journals using DIVE. We train plant researchers in biocuration of genes and pathways; host curated maize gene structures as tracks in the maize genome browser; and integrate curated rice genes and pathways in the Plant Reactome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa979DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7779000PMC
January 2021

Ensembl 2021.

Nucleic Acids Res 2021 01;49(D1):D884-D891

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The Ensembl project (https://www.ensembl.org) annotates genomes and disseminates genomic data for vertebrate species. We create detailed and comprehensive annotation of gene structures, regulatory elements and variants, and enable comparative genomics by inferring the evolutionary history of genes and genomes. Our integrated genomic data are made available in a variety of ways, including genome browsers, search interfaces, specialist tools such as the Ensembl Variant Effect Predictor, download files and programmatic interfaces. Here, we present recent Ensembl developments including two new website portals. Ensembl Rapid Release (http://rapid.ensembl.org) is designed to provide core tools and services for genomes as soon as possible and has been deployed to support large biodiversity sequencing projects. Our SARS-CoV-2 genome browser (https://covid-19.ensembl.org) integrates our own annotation with publicly available genomic data from numerous sources to facilitate the use of genomics in the international scientific response to the COVID-19 pandemic. We also report on other updates to our annotation resources, tools and services. All Ensembl data and software are freely available without restriction.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa942DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778975PMC
January 2021

Functional signatures of evolutionarily young CTCF binding sites.

BMC Biol 2020 09 23;18(1):132. Epub 2020 Sep 23.

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

Background: The introduction of novel CTCF binding sites in gene regulatory regions in the rodent lineage is partly the effect of transposable element expansion, particularly in the murine lineage. The exact mechanism and functional impact of evolutionarily novel CTCF binding sites are not yet fully understood. We investigated the impact of novel subspecies-specific CTCF binding sites in two Mus genus subspecies, Mus musculus domesticus and Mus musculus castaneus, that diverged 0.5 million years ago.

Results: CTCF binding site evolution is influenced by the action of the B2-B4 family of transposable elements independently in both lineages, leading to the proliferation of novel CTCF binding sites. A subset of evolutionarily young sites may harbour transcriptional functionality as evidenced by the stability of their binding across multiple tissues in M. musculus domesticus (BL6), while overall the distance of subspecies-specific CTCF binding to the nearest transcription start sites and/or topologically associated domains (TADs) is largely similar to musculus-common CTCF sites. Remarkably, we discovered a recurrent regulatory architecture consisting of a CTCF binding site and an interferon gene that appears to have been tandemly duplicated to create a 15-gene cluster on chromosome 4, thus forming a novel BL6 specific immune locus in which CTCF may play a regulatory role.

Conclusions: Our results demonstrate that thousands of CTCF binding sites show multiple functional signatures rapidly after incorporation into the genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12915-020-00863-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7520972PMC
September 2020

Chromatin activation as a unifying principle underlying pathogenic mechanisms in multiple myeloma.

Genome Res 2020 09 20;30(9):1217-1227. Epub 2020 Aug 20.

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton CB10 1SD, United Kingdom.

Multiple myeloma (MM) is a plasma cell neoplasm associated with a broad variety of genetic lesions. In spite of this genetic heterogeneity, MMs share a characteristic malignant phenotype whose underlying molecular basis remains poorly characterized. In the present study, we examined plasma cells from MM using a multi-epigenomics approach and demonstrated that, when compared to normal B cells, malignant plasma cells showed an extensive activation of regulatory elements, in part affecting coregulated adjacent genes. Among target genes up-regulated by this process, we found members of the NOTCH, NF-kB, MTOR signaling, and TP53 signaling pathways. Other activated genes included sets involved in osteoblast differentiation and response to oxidative stress, all of which have been shown to be associated with the MM phenotype and clinical behavior. We functionally characterized MM-specific active distant enhancers controlling the expression of thioredoxin (), a major regulator of cellular redox status and, in addition, identified as a novel essential gene for MM. Collectively, our data indicate that aberrant chromatin activation is a unifying feature underlying the malignant plasma cell phenotype.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.265520.120DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7545147PMC
September 2020

The tuatara genome reveals ancient features of amniote evolution.

Nature 2020 08 5;584(7821):403-409. Epub 2020 Aug 5.

Department of Anatomy, University of Otago, Dunedin, New Zealand.

The tuatara (Sphenodon punctatus)-the only living member of the reptilian order Rhynchocephalia (Sphenodontia), once widespread across Gondwana-is an iconic species that is endemic to New Zealand. A key link to the now-extinct stem reptiles (from which dinosaurs, modern reptiles, birds and mammals evolved), the tuatara provides key insights into the ancestral amniotes. Here we analyse the genome of the tuatara, which-at approximately 5 Gb-is among the largest of the vertebrate genomes yet assembled. Our analyses of this genome, along with comparisons with other vertebrate genomes, reinforce the uniqueness of the tuatara. Phylogenetic analyses indicate that the tuatara lineage diverged from that of snakes and lizards around 250 million years ago. This lineage also shows moderate rates of molecular evolution, with instances of punctuated evolution. Our genome sequence analysis identifies expansions of proteins, non-protein-coding RNA families and repeat elements, the latter of which show an amalgam of reptilian and mammalian features. The sequencing of the tuatara genome provides a valuable resource for deep comparative analyses of tetrapods, as well as for tuatara biology and conservation. Our study also provides important insights into both the technical challenges and the cultural obligations that are associated with genome sequencing.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2561-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7116210PMC
August 2020

Perspectives on ENCODE.

Nature 2020 07 29;583(7818):693-698. Epub 2020 Jul 29.

HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.

The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2449-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7410827PMC
July 2020

Transcriptional activity and strain-specific history of mouse pseudogenes.

Nat Commun 2020 07 29;11(1):3695. Epub 2020 Jul 29.

Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA.

Pseudogenes are ideal markers of genome remodelling. In turn, the mouse is an ideal platform for studying them, particularly with the recent availability of strain-sequencing and transcriptional data. Here, combining both manual curation and automatic pipelines, we present a genome-wide annotation of the pseudogenes in the mouse reference genome and 18 inbred mouse strains (available via the mouse.pseudogene.org resource). We also annotate 165 unitary pseudogenes in mouse, and 303, in human. The overall pseudogene repertoire in mouse is similar to that in human in terms of size, biotype distribution, and family composition (e.g. with GAPDH and ribosomal proteins being the largest families). Notable differences arise in the pseudogene age distribution, with multiple retro-transpositional bursts in mouse evolutionary history and only one in human. Furthermore, in each strain about a fifth of all pseudogenes are unique, reflecting strain-specific evolution. Finally, we find that ~15% of the mouse pseudogenes are transcribed, and that highly transcribed parent genes tend to give rise to many processed pseudogenes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-020-17157-wDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7392758PMC
July 2020

Cell type specific novel lncRNAs and circRNAs in the BLUEPRINT haematopoietic transcriptomes atlas.

Haematologica 2020 07 23. Epub 2020 Jul 23.

Department of Haematology, University of Cambridge, School of Clinical Medicine;

Transcriptional profiling of hematopoietic cell subpopulations has helped characterize the developmental stages of the hematopoietic system and the molecular bases of malignant and non-malignant blood diseases for the past three decades. Previously, only the genes targeted by expression microarrays could be profiled genome wide. High-throughput RNA sequencing (RNA-seq), however, encompasses a broader repertoire of RNA molecules, without restriction to previously annotated genes. We analysed the BLUEPRINT consortium RNA- seq data for mature hematopoietic cell types. The data comprised 90 total RNA-seq samples, each composed of one of 27 cell types, and 32 small RNA-seq samples, each composed of one of 11 cell types. We estimated gene and isoform expression levels for each cell type using existing annotations from Ensembl. We then used guided transcriptome assembly to discover unannotated transcripts. We identified hundreds of novel non-coding RNA genes and showed that the majority have cell type dependent expression. We also characterized the expression of circular RNAs and found that these are also cell type specific. These analyses refine the active transcriptional landscape of mature hematopoietic cells, highlight abundant genes and transcriptional isoforms for each blood cell type, and provide a valuable resource for researchers of hematological development and diseases. Finally, we made the data accessible via a web-based interface: https://blueprint.haem.cam.ac.uk/bloodatlas/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3324/haematol.2019.238147DOI Listing
July 2020

Pervasive lesion segregation shapes cancer genome evolution.

Nature 2020 07 24;583(7815):265-270. Epub 2020 Jun 24.

MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.

Cancers arise through the acquisition of oncogenic mutations and grow by clonal expansion. Here we reveal that most mutagenic DNA lesions are not resolved into a mutated DNA base pair within a single cell cycle. Instead, DNA lesions segregate, unrepaired, into daughter cells for multiple cell generations, resulting in the chromosome-scale phasing of subsequent mutations. We characterize this process in mutagen-induced mouse liver tumours and show that DNA replication across persisting lesions can produce multiple alternative alleles in successive cell divisions, thereby generating both multiallelic and combinatorial genetic diversity. The phasing of lesions enables accurate measurement of strand-biased repair processes, quantification of oncogenic selection and fine mapping of sister-chromatid-exchange events. Finally, we demonstrate that lesion segregation is a unifying property of exogenous mutagens, including UV light and chemotherapy agents in human cells and tumours, which has profound implications for the evolution and adaptation of cancer genomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2435-1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7116693PMC
July 2020

An improved pig reference genome sequence to enable pig genetics and genomics research.

Gigascience 2020 06;9(6)

Department of Pathology, University of Cambridge, Tennis Court Road, Cambridge CB2 1QP, UK.

Background: The domestic pig (Sus scrofa) is important both as a food source and as a biomedical model given its similarity in size, anatomy, physiology, metabolism, pathology, and pharmacology to humans. The draft reference genome (Sscrofa10.2) of a purebred Duroc female pig established using older clone-based sequencing methods was incomplete, and unresolved redundancies, short-range order and orientation errors, and associated misassembled genes limited its utility.

Results: We present 2 annotated highly contiguous chromosome-level genome assemblies created with more recent long-read technologies and a whole-genome shotgun strategy, 1 for the same Duroc female (Sscrofa11.1) and 1 for an outbred, composite-breed male (USMARCv1.0). Both assemblies are of substantially higher (>90-fold) continuity and accuracy than Sscrofa10.2.

Conclusions: These highly contiguous assemblies plus annotation of a further 11 short-read assemblies provide an unprecedented view of the genetic make-up of this important agricultural and biomedical model species. We propose that the improved Duroc assembly (Sscrofa11.1) become the reference genome for genomic research in pigs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giaa051DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7448572PMC
June 2020

Progress, Challenges, and Surprises in Annotating the Human Genome.

Annu Rev Genomics Hum Genet 2020 08 18;21:55-79. Epub 2020 May 18.

European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton CB10 1SD, United Kingdom; email:

Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1146/annurev-genom-121119-083418DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7116059PMC
August 2020

Identification of male heterogametic sex-determining regions on the Atlantic herring Clupea harengus genome.

J Fish Biol 2020 Jul 22;97(1):190-201. Epub 2020 May 22.

Faculty of Science and Technology, University of the Faroe Islands, Tórshavn, Faroe Islands.

The sex determination system of Atlantic herring Clupea harengus L., a commercially important fish, was investigated. Low coverage whole-genome sequencing of 48 females and 55 males and a genome-wide association study revealed two regions on chromosomes 8 and 21 associated with sex. The genotyping data of the single nucleotide polymorphisms associated with sex showed that 99.4% of the available female genotypes were homozygous, whereas 68.6% of the available male genotypes were heterozygous. This is close to the theoretical expectation of homo/heterozygous distribution at low sequencing coverage when the males are factually heterozygous. This suggested a male heterogametic sex determination system in C. harengus, consistent with other species within the Clupeiformes group. There were 76 protein coding genes on the sex regions but none of these genes were previously reported master sex regulation genes, or obviously related to sex determination. However, many of these genes are expressed in testis or ovary in other species, but the exact genes controlling sex determination in C. harengus could not be identified.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1111/jfb.14349DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7115899PMC
July 2020

Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project.

Wellcome Open Res 2019 30;4:50. Epub 2019 Dec 30.

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

We present a set of biallelic SNVs and INDELs, from 2,548 samples spanning 26 populations from the 1000 Genomes Project, called on GRCh38. We believe this will be a useful reference resource for those using GRCh38. It represents an improvement over the "lift-overs" of the 1000 Genomes Project data that have been available to date by encompassing all of the GRCh38 primary assembly autosomes and pseudo-autosomal regions, including novel, medically relevant loci. Here, we describe how the data set was created and benchmark our call set against that produced by the final phase of the 1000 Genomes Project on GRCh37 and the lift-over of that data to GRCh38.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/wellcomeopenres.15126.2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7059836PMC
December 2019

Clustered CTCF binding is an evolutionary mechanism to maintain topologically associating domains.

Genome Biol 2020 01 7;21(1). Epub 2020 Jan 7.

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Cambridge, CB10 1SD, UK.

Background: CTCF binding contributes to the establishment of a higher-order genome structure by demarcating the boundaries of large-scale topologically associating domains (TADs). However, despite the importance and conservation of TADs, the role of CTCF binding in their evolution and stability remains elusive.

Results: We carry out an experimental and computational study that exploits the natural genetic variation across five closely related species to assess how CTCF binding patterns stably fixed by evolution in each species contribute to the establishment and evolutionary dynamics of TAD boundaries. We perform CTCF ChIP-seq in multiple mouse species to create genome-wide binding profiles and associate them with TAD boundaries. Our analyses reveal that CTCF binding is maintained at TAD boundaries by a balance of selective constraints and dynamic evolutionary processes. Regardless of their conservation across species, CTCF binding sites at TAD boundaries are subject to stronger sequence and functional constraints compared to other CTCF sites. TAD boundaries frequently harbor dynamically evolving clusters containing both evolutionarily old and young CTCF sites as a result of the repeated acquisition of new species-specific sites close to conserved ones. The overwhelming majority of clustered CTCF sites colocalize with cohesin and are significantly closer to gene transcription start sites than nonclustered CTCF sites, suggesting that CTCF clusters particularly contribute to cohesin stabilization and transcriptional regulation.

Conclusions: Dynamic conservation of CTCF site clusters is an apparently important feature of CTCF binding evolution that is critical to the functional stability of a higher-order chromatin structure.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-019-1894-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6945661PMC
January 2020

Re-annotation of 191 developmental and epileptic encephalopathy-associated genes unmasks de novo variants in .

NPJ Genom Med 2019 2;4:31. Epub 2019 Dec 2.

20Department of Medical Genetics, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, CB2 0XY UK.

The developmental and epileptic encephalopathies (DEE) are a group of rare, severe neurodevelopmental disorders, where even the most thorough sequencing studies leave 60-65% of patients without a molecular diagnosis. Here, we explore the incompleteness of transcript models used for exome and genome analysis as one potential explanation for a lack of current diagnoses. Therefore, we have updated the GENCODE gene annotation for 191 epilepsy-associated genes, using human brain-derived transcriptomic libraries and other data to build 3,550 putative transcript models. Our annotations increase the transcriptional 'footprint' of these genes by over 674 kb. Using as a case study, due to its close phenotype/genotype correlation with Dravet syndrome, we screened 122 people with Dravet syndrome or a similar phenotype with a panel of exon sequences representing eight established genes and identified two de novo variants that now - through improved gene annotation - are ascribed to residing among our exons. These two (from 122 screened people, 1.6%) molecular diagnoses carry significant clinical implications. Furthermore, we identified a previously classified intronic Dravet syndrome-associated variant that now lies within a deeply conserved exon. Our findings illustrate the potential gains of thorough gene annotation in improving diagnostic yields for genetic disorders.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41525-019-0106-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6889285PMC
December 2019

Using long and linked reads to improve an Atlantic herring (Clupea harengus) genome assembly.

Sci Rep 2019 11 27;9(1):17716. Epub 2019 Nov 27.

Amplexa Genetics A/S, Hoyvíksvegur 51, FO-100, Tórshavn, Faroe Islands.

Atlantic herring (Clupea harengus) is one of the most abundant fish species in the world. It is an important economical and nutritional resource, as well as a crucial part of the North Atlantic ecosystem. In 2016, a draft herring genome assembly was published. Being a species of such importance, we sought to independently verify and potentially improve the herring genome assembly. We sequenced the herring genome generating paired-end, mate-pair, linked and long reads. Three assembly versions of the herring genome were generated based on a de novo assembly (A1), which was scaffolded using linked and long reads (A2) and then merged with the previously published assembly (A3). The resulting assemblies were compared using parameters describing the size, fragmentation, correctness, and completeness of the assemblies. Results showed that the A2 assembly was less fragmented, more complete and more correct than A1. A3 showed improvement in fragmentation and correctness compared with A2 and the published assembly but was slightly less complete than the published assembly. Thus, we here confirmed the previously published herring assembly, and made improvements by further scaffolding the assembly and removing low-quality sequences using linked and long reads and merging of assemblies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-019-54151-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6881392PMC
November 2019

Ensembl 2020.

Nucleic Acids Res 2020 01;48(D1):D682-D688

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The Ensembl (https://www.ensembl.org) is a system for generating and distributing genome annotation such as genes, variation, regulation and comparative genomics across the vertebrate subphylum and key model organisms. The Ensembl annotation pipeline is capable of integrating experimental and reference data from multiple providers into a single integrated resource. Here, we present 94 newly annotated and re-annotated genomes, bringing the total number of genomes offered by Ensembl to 227. This represents the single largest expansion of the resource since its inception. We also detail our continued efforts to improve human annotation, developments in our epigenome analysis and display, a new tool for imputing causal genes from genome-wide association studies and visualisation of variation within a 3D protein model. Finally, we present information on our new website. Both software and data are made available without restriction via our website, online tools platform and programmatic interfaces (available under an Apache 2.0 license) and data updates made available four times a year.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkz966DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7145704PMC
January 2020
-->