Publications by authors named "Sarah Djebali"

30 Publications

  • Page 1 of 1

Correlation Networks Provide New Insights into the Architecture of Testicular Steroid Pathways in Pigs.

Genes (Basel) 2021 Apr 9;12(4). Epub 2021 Apr 9.

GenPhySE, Université de Toulouse, INRAE, ENVT, 31326 Castanet Tolosan, France.

Steroid metabolism is a fundamental process in the porcine testis to provide testosterone but also estrogens and androstenone, which are essential for the physiology of the boar. This study concerns boars at an early stage of puberty. Using a RT-qPCR approach, we showed that the transcriptional activities of several genes providing key enzymes involved in this metabolism (such as ) are correlated. Surprisingly, , a key gene for testosterone production, was absent from this group. An additional weighted gene co-expression network analysis was performed on two large sets of mRNA-seq to identify co-expression modules. Of these modules, two containing either or were further analyzed. This comprehensive correlation meta-analysis identified a group of 85 genes with as hub gene, but did not allow the characterization of a robust correlation network around . As the CYP11A1-group includes most of the genes involved in steroid synthesis pathways (including encoding for the LH receptor), it may control the synthesis of most of the testicular steroids. The independent expression of probably allows part of the production of testosterone to escape this control. This CYP11A1-group contained also and genes encoding a peptide hormone and an angiotensin peptide precursor, respectively.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/genes12040551DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8069258PMC
April 2021

An integrative atlas of chicken long non-coding genes and their annotations across 25 tissues.

Sci Rep 2020 11 24;10(1):20457. Epub 2020 Nov 24.

PEGASE UMR 1348, INRA, AGROCAMPUS OUEST, 35590, Saint-Gilles, France.

Long non-coding RNAs (LNC) regulate numerous biological processes. In contrast to human, the identification of LNC in farm species, like chicken, is still lacunar. We propose a catalogue of 52,075 chicken genes enriched in LNC ( http://www.fragencode.org/ ), built from the Ensembl reference extended using novel LNC modelled here from 364 RNA-seq and LNC from four public databases. The Ensembl reference grew from 4,643 to 30,084 LNC, of which 59% and 41% with expression ≥ 0.5 and ≥ 1 TPM respectively. Characterization of these LNC relatively to the closest protein coding genes (PCG) revealed that 79% of LNC are in intergenic regions, as in other species. Expression analysis across 25 tissues revealed an enrichment of co-expressed LNC:PCG pairs, suggesting co-regulation and/or co-function. As expected LNC were more tissue-specific than PCG (25% vs. 10%). Similarly to human, 16% of chicken LNC hosted one or more miRNA. We highlighted a new chicken LNC, hosting miR155, conserved in human, highly expressed in immune tissues like miR155, and correlated with immunity-related PCG in both species. Among LNC:PCG pairs tissue-specific in the same tissue, we revealed an enrichment of divergent pairs with the PCG coding transcription factors, as for example LHX5, HXD3 and TBX4, in both human and chicken.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-020-77586-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7686352PMC
November 2020

A limited set of transcriptional programs define major cell types.

Genome Res 2020 07 29;30(7):1047-1059. Epub 2020 Jul 29.

Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, E-08003 Barcelona, Catalonia, Spain.

We have produced RNA sequencing data for 53 primary cells from different locations in the human body. The clustering of these primary cells reveals that most cells in the human body share a few broad transcriptional programs, which define five major cell types: epithelial, endothelial, mesenchymal, neural, and blood cells. These act as basic components of many tissues and organs. Based on gene expression, these cell types redefine the basic histological types by which tissues have been traditionally classified. We identified genes whose expression is specific to these cell types, and from these genes, we estimated the contribution of the major cell types to the composition of human tissues. We found this cellular composition to be a characteristic signature of tissues and to reflect tissue morphological heterogeneity and histology. We identified changes in cellular composition in different tissues associated with age and sex, and found that departures from the normal cellular composition correlate with histological phenotypes associated with disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.263186.120DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7397875PMC
July 2020

Multi-species annotation of transcriptome and chromatin structure in domesticated animals.

BMC Biol 2019 12 30;17(1):108. Epub 2019 Dec 30.

GenPhySE, Université de Toulouse, INRA, INPT, ENVT, Chemin de Borde Rouge, Castanet-Tolosan Cedex, F-31326, France.

Background: Comparative genomics studies are central in identifying the coding and non-coding elements associated with complex traits, and the functional annotation of genomes is a critical step to decipher the genotype-to-phenotype relationships in livestock animals. As part of the Functional Annotation of Animal Genomes (FAANG) action, the FR-AgENCODE project aimed to create reference functional maps of domesticated animals by profiling the landscape of transcription (RNA-seq), chromatin accessibility (ATAC-seq) and conformation (Hi-C) in species representing ruminants (cattle, goat), monogastrics (pig) and birds (chicken), using three target samples related to metabolism (liver) and immunity (CD4+ and CD8+ T cells).

Results: RNA-seq assays considerably extended the available catalog of annotated transcripts and identified differentially expressed genes with unknown function, including new syntenic lncRNAs. ATAC-seq highlighted an enrichment for transcription factor binding sites in differentially accessible regions of the chromatin. Comparative analyses revealed a core set of conserved regulatory regions across species. Topologically associating domains (TADs) and epigenetic A/B compartments annotated from Hi-C data were consistent with RNA-seq and ATAC-seq data. Multi-species comparisons showed that conserved TAD boundaries had stronger insulation properties than species-specific ones and that the genomic distribution of orthologous genes in A/B compartments was significantly conserved across species.

Conclusions: We report the first multi-species and multi-assay genome annotation results obtained by a FAANG project. Beyond the generation of reference annotations and the confirmation of previous findings on model animals, the integrative analysis of data from multiple assays and species sheds a new light on the multi-scale selective pressure shaping genome organization from birds to mammals. Overall, these results emphasize the value of FAANG for research on domesticated animals and reinforces the importance of future meta-analyses of the reference datasets being generated by this community on different species.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12915-019-0726-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6936065PMC
December 2019

Analysis of pig transcriptomes suggests a global regulation mechanism enabling temporary bursts of circular RNAs.

RNA Biol 2019 09 3;16(9):1190-1204. Epub 2019 Jun 3.

b Institute of Genome Biology, Leibniz Institute for Farm Animal Biology (FBN) , Dummerstorf , Germany.

To investigate the dynamics of circRNA expression in pig testes, we designed specific strategies to individually study circRNA production from intron lariats and circRNAs originating from back-splicing of two exons. By applying these methods on seven Total-RNA-seq datasets sampled during the testicular puberty, we detected 126 introns in 114 genes able to produce circRNAs and 5,236 exonic circRNAs produced by 2,516 genes. Comparing our RNA-seq datasets to datasets from the literature (embryonic cortex and postnatal muscle stages) revealed highly abundant intronic and exonic circRNAs in one sample each in pubertal testis and embryonic cortex, respectively. This abundance was due to higher production of circRNA by the same genes in comparison to other testis samples, rather than to the recruitment of new genes. No global relationship between circRNA and mRNA production was found. We propose ExoCirc-9244 () as a marker of a particular stage in testis, which is characterized by a very low plasma estradiol level and a high abundance of circRNA in testis. We hypothesize that the abundance of testicular circRNA is associated with an abrupt switch of the cellular process to overcome a particular challenge that may have arisen in the early stages of steroid production. We also hypothesize that, in certain circumstances, isoforms and circular transcripts from different genes share functions and that a global regulation of circRNA production is established. Our data indicate that this massive production of circRNAs is much more related to the structure of the genes generating circRNAs than to their function. PE: Paired Ends; CR: chimeric Read; SR: Split Read; circRNA: circular RNA; NC: non conventional; ExoCirc-RNA: exonic circular RNA; IntroLCirc-: name of a porcine intronic lariat circRNA; ExoCirc-: name of a porcine exonic circRNA; IntronCircle-: name of a porcine intron circle; sisRNA: stable intronic sequence RNA; P: porcine breed Pietrain; LW: porcine breed Large White; RT: reverse transcription/reverse transcriptase; Total-RNA-seq: RNA-seq obtained from total RNA after ribosomal depletion; mRNA-seq: RNA-seq of poly(A) transcripts; TPM: transcripts per million; CR-PM: chimeric reads per million; RBP: RNA binding protein; miRNA: micro RNA; E2: estradiol; DHT: dihydrotestesterone.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1080/15476286.2019.1621621DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6693536PMC
September 2019

An atlas of human long non-coding RNAs with accurate 5' ends.

Nature 2017 03 1;543(7644):199-204. Epub 2017 Mar 1.

Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane 4072, Australia.

Long non-coding RNAs (lncRNAs) are largely heterogeneous and functionally uncharacterized. Here, using FANTOM5 cap analysis of gene expression (CAGE) data, we integrate multiple transcript collections to generate a comprehensive atlas of 27,919 human lncRNA genes with high-confidence 5' ends and expression profiles across 1,829 samples from the major human primary cell types and tissues. Genomic and epigenomic classification of these lncRNAs reveals that most intergenic lncRNAs originate from enhancers rather than from promoters. Incorporating genetic and expression data, we show that lncRNAs overlapping trait-associated single nucleotide polymorphisms are specifically expressed in cell types relevant to the traits, implicating these lncRNAs in multiple diseases. We further demonstrate that lncRNAs overlapping expression quantitative trait loci (eQTL)-associated single nucleotide polymorphisms of messenger RNAs are co-expressed with the corresponding messenger RNAs, suggesting their potential roles in transcriptional regulation. Combining these findings with conservation data, we identify 19,175 potentially functional lncRNAs in the human genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature21374DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6857182PMC
March 2017

Long noncoding RNA repertoire in chicken liver and adipose tissue.

Genet Sel Evol 2017 01 10;49(1). Epub 2017 Jan 10.

UMR PEGASE, INRA, 35042, Rennes, France.

Background: Improving functional annotation of the chicken genome is a key challenge in bridging the gap between genotype and phenotype. Among all transcribed regions, long noncoding RNAs (lncRNAs) are a major component of the transcriptome and its regulation, and whole-transcriptome sequencing (RNA-Seq) has greatly improved their identification and characterization. We performed an extensive profiling of the lncRNA transcriptome in the chicken liver and adipose tissue by RNA-Seq. We focused on these two tissues because of their importance in various economical traits for which energy storage and mobilization play key roles and also because of their high cell homogeneity. To predict lncRNAs, we used a recently developed tool called FEELnc, which also classifies them with respect to their distance and strand orientation to the closest protein-coding genes. Moreover, to confidently identify the genes/transcripts expressed in each tissue (a complex task for weakly expressed molecules such as lncRNAs), we probed a particularly large number of biological replicates (16 per tissue) compared to common multi-tissue studies with a larger set of tissues but less sampling.

Results: We predicted 2193 lncRNA genes, among which 1670 were robustly expressed across replicates in the liver and/or adipose tissue and which were classified into 1493 intergenic and 177 intragenic lncRNAs located between and within protein-coding genes, respectively. We observed similar structural features between chickens and mammals, with strong synteny conservation but without sequence conservation. As previously reported, we confirm that lncRNAs have a lower and more tissue-specific expression than mRNAs. Finally, we showed that adjacent lncRNA-mRNA genes in divergent orientation have a higher co-expression level when separated by less than 1 kb compared to more distant divergent pairs. Among these, we highlighted for the first time a novel lncRNA candidate involved in lipid metabolism, lnc_DHCR24, which is highly correlated with the DHCR24 gene that encodes a key enzyme of cholesterol biosynthesis.

Conclusions: We provide a comprehensive lncRNA repertoire in the chicken liver and adipose tissue, which shows interesting patterns of co-expression between mRNAs and lncRNAs. It contributes to improving the structural and functional annotation of the chicken genome and provides a basis for further studies on energy storage and mobilization traits in the chicken.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12711-016-0275-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225574PMC
January 2017

ChimPipe: accurate detection of fusion genes and transcription-induced chimeras from RNA-seq data.

BMC Genomics 2017 01 3;18(1). Epub 2017 Jan 3.

Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona, 08003, Spain.

Background: Chimeric transcripts are commonly defined as transcripts linking two or more different genes in the genome, and can be explained by various biological mechanisms such as genomic rearrangement, read-through or trans-splicing, but also by technical or biological artefacts. Several studies have shown their importance in cancer, cell pluripotency and motility. Many programs have recently been developed to identify chimeras from Illumina RNA-seq data (mostly fusion genes in cancer). However outputs of different programs on the same dataset can be widely inconsistent, and tend to include many false positives. Other issues relate to simulated datasets restricted to fusion genes, real datasets with limited numbers of validated cases, result inconsistencies between simulated and real datasets, and gene rather than junction level assessment.

Results: Here we present ChimPipe, a modular and easy-to-use method to reliably identify fusion genes and transcription-induced chimeras from paired-end Illumina RNA-seq data. We have also produced realistic simulated datasets for three different read lengths, and enhanced two gold-standard cancer datasets by associating exact junction points to validated gene fusions. Benchmarking ChimPipe together with four other state-of-the-art tools on this data showed ChimPipe to be the top program at identifying exact junction coordinates for both kinds of datasets, and the one showing the best trade-off between sensitivity and precision. Applied to 106 ENCODE human RNA-seq datasets, ChimPipe identified 137 high confidence chimeras connecting the protein coding sequence of their parent genes. In subsequent experiments, three out of four predicted chimeras, two of which recurrently expressed in a large majority of the samples, could be validated. Cloning and sequencing of the three cases revealed several new chimeric transcript structures, 3 of which with the potential to encode a chimeric protein for which we hypothesized a new role. Applying ChimPipe to human and mouse ENCODE RNA-seq data led to the identification of 131 recurrent chimeras common to both species, and therefore potentially conserved.

Conclusions: ChimPipe combines discordant paired-end reads and split-reads to detect any kind of chimeras, including those originating from polymerase read-through, and shows an excellent trade-off between sensitivity and precision. The chimeras found by ChimPipe can be validated in-vitro with high accuracy.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-016-3404-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5209911PMC
January 2017

Erratum to: Bioinformatics Pipeline for Transcriptome Sequencing Analysis.

Methods Mol Biol 2017 ;1468:E1

CNRS UMR6290 Dog Genetic Team, 2 av du Pr. Léon Bernard, Rennes, 35043, France.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/978-1-4939-4035-6_17DOI Listing
January 2017

Erratum to: A benchmark for RNA-seq quantification pipelines.

Genome Biol 2016 09 30;17(1):203. Epub 2016 Sep 30.

Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-016-1060-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5045616PMC
September 2016

Bioinformatics Pipeline for Transcriptome Sequencing Analysis.

Methods Mol Biol 2017 ;1468:201-19

CNRS UMR6290 Dog Genetic Team, 2 av du Pr. Léon Bernard, Rennes, 35043, France.

The development of High Throughput Sequencing (HTS) for RNA profiling (RNA-seq) has shed light on the diversity of transcriptomes. While RNA-seq is becoming a de facto standard for monitoring the population of expressed transcripts in a given condition at a specific time, processing the huge amount of data it generates requires dedicated bioinformatics programs. Here, we describe a standard bioinformatics protocol using state-of-the-art tools, the STAR mapper to align reads onto a reference genome, Cufflinks to reconstruct the transcriptome, and RSEM to quantify expression levels of genes and transcripts. We present the workflow using human transcriptome sequencing data from two biological replicates of the K562 cell line produced as part of the ENCODE3 project.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/978-1-4939-4035-6_14DOI Listing
January 2018

Gene-specific patterns of expression variation across organs and species.

Genome Biol 2016 07 8;17(1):151. Epub 2016 Jul 8.

Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona, 08003, Spain.

Background: A comparison of transcriptional profiles derived from different tissues in a given species or among different species assumes that commonalities reflect evolutionarily conserved programs and that differences reflect species or tissue responses to environmental conditions or developmental program staging. Apparently conflicting results have been published regarding whether organ-specific transcriptional patterns dominate over species-specific patterns, or vice versa, making it unclear to what extent the biology of a given organism can be extrapolated to another. These studies have in common that they treat the transcriptomes monolithically, implicitly ignoring that each gene is likely to have a specific pattern of transcriptional variation across organs and species.

Results: We use linear models to quantify this pattern. We find a continuum in the spectrum of expression variation: the expression of some genes varies considerably across species and little across organs, and simply reflects evolutionary distance. At the other extreme are genes whose expression varies considerably across organs and little across species; these genes are much more likely to be associated with diseases than are genes whose expression varies predominantly across species.

Conclusions: Whether transcriptomes, when considered globally, cluster preferentially according to one component or the other may not be a property of the transcriptomes, but rather a consequence of the dominant behavior of a subset of genes. Therefore, the values of the components of the variance of expression for each gene could become a useful resource when planning, interpreting, and extrapolating experimental data from mouse to humans.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-016-1008-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4937605PMC
July 2016

Erratum to: A benchmark for RNA-seq quantification pipelines.

Genome Biol 2016 05 23;17(1):107. Epub 2016 May 23.

Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-016-0986-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877800PMC
May 2016

A benchmark for RNA-seq quantification pipelines.

Genome Biol 2016 Apr 23;17:74. Epub 2016 Apr 23.

Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA.

Obtaining RNA-seq measurements involves a complex data analytical process with a large number of competing algorithms as options. There is much debate about which of these methods provides the best approach. Unfortunately, it is currently difficult to evaluate their performance due in part to a lack of sensitive assessment metrics. We present a series of statistical summaries and plots to evaluate the performance in terms of specificity and sensitivity, available as a R/Bioconductor package ( http://bioconductor.org/packages/rnaseqcomp ). Using two independent datasets, we assessed seven competing pipelines. Performance was generally poor, with two methods clearly underperforming and RSEM slightly outperforming the rest.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-016-0940-1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842274PMC
April 2016

Human genomics. The human transcriptome across tissues and individuals.

Science 2015 May;348(6235):660-5

Center for Genomic Regulation (CRG), Barcelona, Catalonia, Spain. Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain. Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Barcelona, Catalonia, Spain. Joint CRG-Barcelona Super Computing Center (BSC)-Institut de Recerca Biomedica (IRB) Program in Computational Biology, Barcelona, Catalonia, Spain.

Transcriptional regulation and posttranscriptional processing underlie many cellular and organismal phenotypes. We used RNA sequence data generated by Genotype-Tissue Expression (GTEx) project to investigate the patterns of transcriptome variation across individuals and tissues. Tissues exhibit characteristic transcriptional signatures that show stability in postmortem samples. These signatures are dominated by a relatively small number of genes—which is most clearly seen in blood—though few are exclusive to a particular tissue and vary more across tissues than individuals. Genes exhibiting high interindividual expression variation include disease candidates associated with sex, ethnicity, and age. Primary transcription is the major driver of cellular specificity, with splicing playing mostly a complementary role; except for the brain, which exhibits a more divergent splicing program. Variation in splicing, despite its stochasticity, may play in contrast a comparatively greater role in defining individual phenotypes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.aaa0355DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547472PMC
May 2015

Enhanced transcriptome maps from multiple mouse tissues reveal evolutionary constraint in gene expression.

Nat Commun 2015 Jan 13;6:5903. Epub 2015 Jan 13.

Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA.

Mice have been a long-standing model for human biology and disease. Here we characterize, by RNA sequencing, the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates. Comparison with transcriptome profiles in human cell lines reveals substantial conservation of transcriptional programmes, and uncovers a distinct class of genes with levels of expression that have been constrained early in vertebrate evolution. This core set of genes captures a substantial fraction of the transcriptional output of mammalian cells, and participates in basic functional and structural housekeeping processes common to all cell types. Perturbation of these constrained genes is associated with significant phenotypes including embryonic lethality and cancer. Evolutionary constraint in gene expression levels is not reflected in the conservation of the genomic sequences, but is associated with conserved epigenetic marking, as well as with characteristic post-transcriptional regulatory programme, in which sub-cellular localization and alternative splicing play comparatively large roles.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ncomms6903DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4308717PMC
January 2015

A comparative encyclopedia of DNA elements in the mouse genome.

Nature 2014 Nov;515(7527):355-64

Bioinformatics and Genomics, Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88, 08003 Barcelona, Catalonia, Spain.

The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature13992DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4266106PMC
November 2014

Comparative analysis of the transcriptome across distant species.

Nature 2014 Aug;512(7515):445-8

Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature13424DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4155737PMC
August 2014

Transcriptome characterization by RNA sequencing identifies a major molecular and clinical subdivision in chronic lymphocytic leukemia.

Genome Res 2014 Feb 21;24(2):212-26. Epub 2013 Nov 21.

Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Catalonia, Spain;

Chronic lymphocytic leukemia (CLL) has heterogeneous clinical and biological behavior. Whole-genome and -exome sequencing has contributed to the characterization of the mutational spectrum of the disease, but the underlying transcriptional profile is still poorly understood. We have performed deep RNA sequencing in different subpopulations of normal B-lymphocytes and CLL cells from a cohort of 98 patients, and characterized the CLL transcriptional landscape with unprecedented resolution. We detected thousands of transcriptional elements differentially expressed between the CLL and normal B cells, including protein-coding genes, noncoding RNAs, and pseudogenes. Transposable elements are globally derepressed in CLL cells. In addition, two thousand genes-most of which are not differentially expressed-exhibit CLL-specific splicing patterns. Genes involved in metabolic pathways showed higher expression in CLL, while genes related to spliceosome, proteasome, and ribosome were among the most down-regulated in CLL. Clustering of the CLL samples according to RNA-seq derived gene expression levels unveiled two robust molecular subgroups, C1 and C2. C1/C2 subgroups and the mutational status of the immunoglobulin heavy variable (IGHV) region were the only independent variables in predicting time to treatment in a multivariate analysis with main clinico-biological features. This subdivision was validated in an independent cohort of patients monitored through DNA microarrays. Further analysis shows that B-cell receptor (BCR) activation in the microenvironment of the lymph node may be at the origin of the C1/C2 differences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.152132.112DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3912412PMC
February 2014

Unravelling the hidden DNA structural/physical code provides novel insights on promoter location.

Nucleic Acids Res 2013 Aug 12;41(15):7220-30. Epub 2013 Jun 12.

Institute for Research in Biomedicine (IRB Barcelona), Barcelona 08028, Spain, Joint IRB-BSC Research Program on Computational Biology, Barcelona 08028, Spain, Bioinformatics and Genomics Group, Center for Genomic Regulation and Universitat Pompeu Fabra, Barcelona 08003, Spain, Barcelona Supercomputing Center, Barcelona 08034, Spain and Department of Biochemistry and Molecular Biology, University of Barcelona, Barcelona 08028, Spain.

Although protein recognition of DNA motifs in promoter regions has been traditionally considered as a critical regulatory element in transcription, the location of promoters, and in particular transcription start sites (TSSs), still remains a challenge. Here we perform a comprehensive analysis of putative core promoter sequences relative to non-annotated predicted TSSs along the human genome, which were defined by distinct DNA physical properties implemented in our ProStar computational algorithm. A representative sampling of predicted regions was subjected to extensive experimental validation and analyses. Interestingly, the vast majority proved to be transcriptionally active despite the lack of specific sequence motifs, indicating that physical signaling is indeed able to detect promoter activity beyond conventional TSS prediction methods. Furthermore, highly active regions displayed typical chromatin features associated to promoters of housekeeping genes. Our results enable to redefine the promoter signatures and analyze the diversity, evolutionary conservation and dynamic regulation of human core promoters at large-scale. Moreover, the present study strongly supports the hypothesis of an ancient regulatory mechanism encoded by the intrinsic physical properties of the DNA that may contribute to the complexity of transcription regulation in the human genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkt511DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3753636PMC
August 2013

The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression.

Genome Res 2012 Sep;22(9):1775-89

Bioinformatics and Genomics, Centre for Genomic Regulation and UPF, 08003 Barcelona, Catalonia, Spain.

The human genome contains many thousands of long noncoding RNAs (lncRNAs). While several studies have demonstrated compelling biological and disease roles for individual examples, analytical and experimental approaches to investigate these genes have been hampered by the lack of comprehensive lncRNA annotation. Here, we present and analyze the most complete human lncRNA annotation to date, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9277 manually annotated genes producing 14,880 transcripts. Our analyses indicate that lncRNAs are generated through pathways similar to that of protein-coding genes, with similar histone-modification profiles, splicing signals, and exon/intron lengths. In contrast to protein-coding genes, however, lncRNAs display a striking bias toward two-exon transcripts, they are predominantly localized in the chromatin and nucleus, and a fraction appear to be preferentially processed into small RNAs. They are under stronger selective pressure than neutrally evolving sequences-particularly in their promoter regions, which display levels of selection comparable to protein-coding genes. Importantly, about one-third seem to have arisen within the primate lineage. Comprehensive analysis of their expression in multiple human organs and brain regions shows that lncRNAs are generally lower expressed than protein-coding genes, and display more tissue-specific expression patterns, with a large fraction of tissue-specific lncRNAs expressed in the brain. Expression correlation analysis indicates that lncRNAs show particularly striking positive correlation with the expression of antisense coding genes. This GENCODE annotation represents a valuable resource for future studies of lncRNAs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.132159.111DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431493PMC
September 2012

Understanding transcriptional regulation by integrative analysis of transcription factor binding data.

Genome Res 2012 Sep;22(9):1658-67

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.

Statistical models have been used to quantify the relationship between gene expression and transcription factor (TF) binding signals. Here we apply the models to the large-scale data generated by the ENCODE project to study transcriptional regulation by TFs. Our results reveal a notable difference in the prediction accuracy of expression levels of transcription start sites (TSSs) captured by different technologies and RNA extraction protocols. In general, the expression levels of TSSs with high CpG content are more predictable than those with low CpG content. For genes with alternative TSSs, the expression levels of downstream TSSs are more predictable than those of the upstream ones. Different TF categories and specific TFs vary substantially in their contributions to predicting expression. Between two cell lines, the differential expression of TSS can be precisely reflected by the difference of TF-binding signals in a quantitative manner, arguing against the conventional on-and-off model of TF binding. Finally, we explore the relationships between TF-binding signals and other chromatin features such as histone modifications and DNase hypersensitivity for determining expression. The models imply that these features regulate transcription in a highly coordinated manner.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.136838.111DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431483PMC
September 2012

Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs.

Genome Res 2012 Sep;22(9):1616-25

Centre for Genomic Regulation and UPF, E-08003, Barcelona, Catalonia, Spain.

Splicing remains an incompletely understood process. Recent findings suggest that chromatin structure participates in its regulation. Here, we analyze the RNA from subcellular fractions obtained through RNA-seq in the cell line K562. We show that in the human genome, splicing occurs predominantly during transcription. We introduce the coSI measure, based on RNA-seq reads mapping to exon junctions and borders, to assess the degree of splicing completion around internal exons. We show that, as expected, splicing is almost fully completed in cytosolic polyA+ RNA. In chromatin-associated RNA (which includes the RNA that is being transcribed), for 5.6% of exons, the removal of the surrounding introns is fully completed, compared with 0.3% of exons for which no intron-removal has occurred. The remaining exons exist as a mixture of spliced and fewer unspliced molecules, with a median coSI of 0.75. Thus, most RNAs undergo splicing while being transcribed: "co-transcriptional splicing." Consistent with co-transcriptional spliceosome assembly and splicing, we have found significant enrichment of spliceosomal snRNAs in chromatin-associated RNA compared with other cellular RNA fractions and other nonspliceosomal snRNAs. CoSI scores decrease along the gene, pointing to a "first transcribed, first spliced" rule, yet more downstream exons carry other characteristics, favoring rapid, co-transcriptional intron removal. Exons with low coSI values, that is, in the process of being spliced, are enriched with chromatin marks, consistent with a role for chromatin in splicing during transcription. For alternative exons and long noncoding RNAs, splicing tends to occur later, and the latter might remain unspliced in some cases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.134445.111DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431479PMC
September 2012

Landscape of transcription in human cells.

Nature 2012 Sep;489(7414):101-8

Centre for Genomic Regulation and UPF, Doctor Aiguader 88, Barcelona 08003, Catalonia, Spain.

Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature11233DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3684276PMC
September 2012

Modeling gene expression using chromatin features in various cellular contexts.

Genome Biol 2012 Jun 13;13(9):R53. Epub 2012 Jun 13.

Program in Bioinformatics and Integrative Biology, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA.

Background: Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines.

Results: We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA.

Conclusions: Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb-2012-13-9-r53DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491397PMC
June 2012

An encyclopedia of mouse DNA elements (Mouse ENCODE).

Genome Biol 2012 Aug 13;13(8):418. Epub 2012 Aug 13.

To complement the human Encyclopedia of DNA Elements (ENCODE) project and to enable a broad range of mouse genomics efforts, the Mouse ENCODE Consortium is applying the same experimental pipelines developed for human ENCODE to annotate the mouse genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb-2012-13-8-418DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491367PMC
August 2012

Evidence for transcript networks composed of chimeric RNAs in human cells.

PLoS One 2012 4;7(1):e28213. Epub 2012 Jan 4.

Bioinformatics and Genomics, Centre for Genomic Regulation and Universitat Pompeu Fabra, Barcelona, Catalonia, Spain.

The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5' and 3' transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0028213PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3251577PMC
May 2012

Efficient targeted transcript discovery via array-based normalization of RACE libraries.

Nat Methods 2008 Jul 25;5(7):629-35. Epub 2008 May 25.

Grup de Recerca en Informàtica Biomèdica, Institut Municipal d'Investigació Mèdica/Universitat Pompeu Fabra, Dr. Aiguader 88, 08003 Barcelona, Spain.

Rapid amplification of cDNA ends (RACE) is a widely used approach for transcript identification. Random clone selection from the RACE mixture, however, is an ineffective sampling strategy if the dynamic range of transcript abundances is large. To improve sampling efficiency of human transcripts, we hybridized the products of the RACE reaction onto tiling arrays and used the detected exons to delineate a series of reverse-transcriptase (RT)-PCRs, through which the original RACE transcript population was segregated into simpler transcript populations. We independently cloned the products and sequenced randomly selected clones. This approach, RACEarray, is superior to direct cloning and sequencing of RACE products because it specifically targets new transcripts and often results in overall normalization of transcript abundance. We show theoretically and experimentally that this strategy leads indeed to efficient sampling of new transcripts, and we investigated multiplexing the strategy by pooling RACE reactions from multiple interrogated loci before hybridization.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nmeth.1216DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2713501PMC
July 2008

Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA.

Genome Biol 2006 7;7 Suppl 1:S7.1-10. Epub 2006 Aug 7.

Dyogen Lab, CNRS UMR8541, Ecole Normale Supérieure, 46 rue d'Ulm, 75005 Paris, France.

Background: Accurate and automatic gene identification in eukaryotic genomic DNA is more than ever of crucial importance to efficiently exploit the large volume of assembled genome sequences available to the community. Automatic methods have always been considered less reliable than human expertise. This is illustrated in the EGASP project, where reference annotations against which all automatic methods are measured are generated by human annotators and experimentally verified. We hypothesized that replicating the accuracy of human annotators in an automatic method could be achieved by formalizing the rules and decisions that they use, in a mathematical formalism.

Results: We have developed Exogean, a flexible framework based on directed acyclic colored multigraphs (DACMs) that can represent biological objects (for example, mRNA, ESTs, protein alignments, exons) and relationships between them. Graphs are analyzed to process the information according to rules that replicate those used by human annotators. Simple individual starting objects given as input to Exogean are thus combined and synthesized into complex objects such as protein coding transcripts.

Conclusion: We show here, in the context of the EGASP project, that Exogean is currently the method that best reproduces protein coding gene annotations from human experts, in terms of identifying at least one exact coding sequence per gene. We discuss current limitations of the method and several avenues for improvement.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb-2006-7-s1-s7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1810556PMC
September 2006