Publications by authors named "Ilya Vorontsov"

18 Publications

  • Page 1 of 1

Assessing Ribosome Distribution Along Transcripts with Polarity Scores and Regression Slope Estimates.

Methods Mol Biol 2021 ;2252:269-294

Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia.

During translation, the rate of ribosome movement along mRNA varies. This leads to a non-uniform ribosome distribution along the transcript, depending on local mRNA sequence, structure, tRNA availability, and translation factor abundance, as well as the relationship between the overall rates of initiation, elongation, and termination. Stress, antibiotics, and genetic perturbations affecting composition and properties of translation machinery can alter the ribosome positional distribution dramatically. Here, we offer a computational protocol for analyzing positional distribution profiles using ribosome profiling (Ribo-Seq) data. The protocol uses papolarity, a new Python toolkit for the analysis of transcript-level short read coverage profiles. For a single sample, for each transcript papolarity allows for computing the classic polarity metric which, in the case of Ribo-Seq, reflects ribosome positional preferences. For comparison versus a control sample, papolarity estimates an improved metric, the relative linear regression slope of coverage along transcript length. This involves de-noising by profile segmentation with a Poisson model and aggregation of Ribo-Seq coverage within segments, thus achieving reliable estimates of the regression slope. The papolarity software and the associated protocol can be conveniently used for Ribo-Seq data analysis in the command-line Linux environment. Papolarity package is available through Python pip package manager. The source code is available at https://github.com/autosome-ru/papolarity .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/978-1-0716-1150-0_13DOI Listing
January 2021

A holistic view of mouse enhancer architectures reveals analogous pleiotropic effects and correlation with human disease.

BMC Genomics 2020 Nov 2;21(1):754. Epub 2020 Nov 2.

Mammalian Genetics Unit, MRC Harwell Institute, Oxfordshire, OX11 0RD, UK.

Background: Efforts to elucidate the function of enhancers in vivo are underway but their vast numbers alongside differing enhancer architectures make it difficult to determine their impact on gene activity. By systematically annotating multiple mouse tissues with super- and typical-enhancers, we have explored their relationship with gene function and phenotype.

Results: Though super-enhancers drive high total- and tissue-specific expression of their associated genes, we find that typical-enhancers also contribute heavily to the tissue-specific expression landscape on account of their large numbers in the genome. Unexpectedly, we demonstrate that both enhancer types are preferentially associated with relevant 'tissue-type' phenotypes and exhibit no difference in phenotype effect size or pleiotropy. Modelling regulatory data alongside molecular data, we built a predictive model to infer gene-phenotype associations and use this model to predict potentially novel disease-associated genes.

Conclusion: Overall our findings reveal that differing enhancer architectures have a similar impact on mammalian phenotypes whilst harbouring differing cellular and expression effects. Together, our results systematically characterise enhancers with predicted phenotypic traits endorsing the role for both types of enhancers in human disease and disorders.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-020-07109-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7607678PMC
November 2020

Functional annotation of human long noncoding RNAs via molecular phenotyping.

Authors:
Jordan A Ramilowski Chi Wai Yip Saumya Agrawal Jen-Chien Chang Yari Ciani Ivan V Kulakovskiy Mickaël Mendez Jasmine Li Ching Ooi John F Ouyang Nick Parkinson Andreas Petri Leonie Roos Jessica Severin Kayoko Yasuzawa Imad Abugessaisa Altuna Akalin Ivan V Antonov Erik Arner Alessandro Bonetti Hidemasa Bono Beatrice Borsari Frank Brombacher Christopher JF Cameron Carlo Vittorio Cannistraci Ryan Cardenas Melissa Cardon Howard Chang Josée Dostie Luca Ducoli Alexander Favorov Alexandre Fort Diego Garrido Noa Gil Juliette Gimenez Reto Guler Lusy Handoko Jayson Harshbarger Akira Hasegawa Yuki Hasegawa Kosuke Hashimoto Norihito Hayatsu Peter Heutink Tetsuro Hirose Eddie L Imada Masayoshi Itoh Bogumil Kaczkowski Aditi Kanhere Emily Kawabata Hideya Kawaji Tsugumi Kawashima S Thomas Kelly Miki Kojima Naoto Kondo Haruhiko Koseki Tsukasa Kouno Anton Kratz Mariola Kurowska-Stolarska Andrew Tae Jun Kwon Jeffrey Leek Andreas Lennartsson Marina Lizio Fernando López-Redondo Joachim Luginbühl Shiori Maeda Vsevolod J Makeev Luigi Marchionni Yulia A Medvedeva Aki Minoda Ferenc Müller Manuel Muñoz-Aguirre Mitsuyoshi Murata Hiromi Nishiyori Kazuhiro R Nitta Shuhei Noguchi Yukihiko Noro Ramil Nurtdinov Yasushi Okazaki Valerio Orlando Denis Paquette Callum J C Parr Owen J L Rackham Patrizia Rizzu Diego Fernando Sánchez Martinez Albin Sandelin Pillay Sanjana Colin A M Semple Youtaro Shibayama Divya M Sivaraman Takahiro Suzuki Suzannah C Szumowski Michihira Tagami Martin S Taylor Chikashi Terao Malte Thodberg Supat Thongjuea Vidisha Tripathi Igor Ulitsky Roberto Verardo Ilya E Vorontsov Chinatsu Yamamoto Robert S Young J Kenneth Baillie Alistair R R Forrest Roderic Guigó Michael M Hoffman Chung Chau Hon Takeya Kasukawa Sakari Kauppinen Juha Kere Boris Lenhard Claudio Schneider Harukazu Suzuki Ken Yagi Michiel J L de Hoon Jay W Shin Piero Carninci

Genome Res 2020 07 27;30(7):1060-1072. Epub 2020 Jul 27.

RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa 230-0045, Japan.

Long noncoding RNAs (lncRNAs) constitute the majority of transcripts in the mammalian genomes, and yet, their functions remain largely unknown. As part of the FANTOM6 project, we systematically knocked down the expression of 285 lncRNAs in human dermal fibroblasts and quantified cellular growth, morphological changes, and transcriptomic responses using Capped Analysis of Gene Expression (CAGE). Antisense oligonucleotides targeting the same lncRNAs exhibited global concordance, and the molecular phenotype, measured by CAGE, recapitulated the observed cellular phenotypes while providing additional insights on the affected genes and pathways. Here, we disseminate the largest-to-date lncRNA knockdown data set with molecular phenotyping (over 1000 CAGE deep-sequencing libraries) for further exploration and highlight functional roles for and .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.254219.119DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7397864PMC
July 2020

Insights gained from a comprehensive all-against-all transcription factor binding motif benchmarking study.

Genome Biol 2020 05 11;21(1):114. Epub 2020 May 11.

School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.

Background: Positional weight matrix (PWM) is a de facto standard model to describe transcription factor (TF) DNA binding specificities. PWMs inferred from in vivo or in vitro data are stored in many databases and used in a plethora of biological applications. This calls for comprehensive benchmarking of public PWM models with large experimental reference sets.

Results: Here we report results from all-against-all benchmarking of PWM models for DNA binding sites of human TFs on a large compilation of in vitro (HT-SELEX, PBM) and in vivo (ChIP-seq) binding data. We observe that the best performing PWM for a given TF often belongs to another TF, usually from the same family. Occasionally, binding specificity is correlated with the structural class of the DNA binding domain, indicated by good cross-family performance measures. Benchmarking-based selection of family-representative motifs is more effective than motif clustering-based approaches. Overall, there is good agreement between in vitro and in vivo performance measures. However, for some in vivo experiments, the best performing PWM is assigned to an unrelated TF, indicating a binding mode involving protein-protein cooperativity.

Conclusions: In an all-against-all setting, we compute more than 18 million performance measure values for different PWM-experiment combinations and offer these results as a public resource to the research community. The benchmarking protocols are provided via a web interface and as docker images. The methods and results from this study may help others make better use of public TF specificity models, as well as public TF binding data sets.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-020-01996-3DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7212583PMC
May 2020

What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants.

Front Genet 2019 31;10:1078. Epub 2019 Oct 31.

Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia.

Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent "Regulation Saturation" Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the "information leakage" caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_regsat and https://genomeinterpretation.org/content/expression-variants.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fgene.2019.01078DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6834773PMC
October 2019

Genome-wide map of human and mouse transcription factor binding sites aggregated from ChIP-Seq data.

BMC Res Notes 2018 Oct 23;11(1):756. Epub 2018 Oct 23.

Vavilov Institute of General Genetics, Russian Academy of Sciences, GSP-1, Gubkina 3, Moscow, Russia, 119991.

Objectives: Mammalian genomics studies, especially those focusing on transcriptional regulation, require information on genomic locations of regulatory regions, particularly, transcription factor (TF) binding sites. There are plenty of published ChIP-Seq data on in vivo binding of transcription factors in different cell types and conditions. However, handling of thousands of separate data sets is often impractical and it is desirable to have a single global map of genomic regions potentially bound by a particular TF in any of studied cell types and conditions.

Data Description: Here we report human and mouse cistromes, the maps of genomic regions that are routinely identified as TF binding sites, organized by TF. We provide cistromes for 349 mouse and 599 human TFs. Given a TF, its cistrome regions are supported by evidence from several ChIP-Seq experiments or several computational tools, and, as an optional filter, contain occurrences of sequence motifs recognized by the TF. Using the cistrome, we provide an annotation of TF binding sites in the vicinity of human and mouse transcription start sites. This information is useful for selecting potential gene targets of transcription factors and detecting co-regulated genes in differential gene expression data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13104-018-3856-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6199713PMC
October 2018

HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis.

Nucleic Acids Res 2018 01;46(D1):D252-D259

Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, 119991, GSP-1, Vavilova 32, Moscow, Russia.

We present a major update of the HOCOMOCO collection that consists of patterns describing DNA binding specificities for human and mouse transcription factors. In this release, we profited from a nearly doubled volume of published in vivo experiments on transcription factor (TF) binding to expand the repertoire of binding models, replace low-quality models previously based on in vitro data only and cover more than a hundred TFs with previously unknown binding specificities. This was achieved by systematic motif discovery from more than five thousand ChIP-Seq experiments uniformly processed within the BioUML framework with several ChIP-Seq peak calling tools and aggregated in the GTRD database. HOCOMOCO v11 contains binding models for 453 mouse and 680 human transcription factors and includes 1302 mononucleotide and 576 dinucleotide position weight matrices, which describe primary binding preferences of each transcription factor and reliable alternative binding specificities. An interactive interface and bulk downloads are available on the web: http://hocomoco.autosome.ru and http://www.cbrc.kaust.edu.sa/hocomoco11. In this release, we complement HOCOMOCO by MoLoTool (Motif Location Toolbox, http://molotool.autosome.ru) that applies HOCOMOCO models for visualization of binding sites in short DNA sequences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkx1106DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753240PMC
January 2018

The single nucleotide variant rs12722489 determines differential estrogen receptor binding and enhancer properties of an IL2RA intronic region.

PLoS One 2017 24;12(2):e0172681. Epub 2017 Feb 24.

Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia.

We studied functional effect of rs12722489 single nucleotide polymorphism located in the first intron of human IL2RA gene on transcriptional regulation. This polymorphism is associated with multiple autoimmune conditions (rheumatoid arthritis, multiple sclerosis, Crohn's disease, and ulcerative colitis). Analysis in silico suggested significant difference in the affinity of estrogen receptor (ER) binding site between alternative allelic variants, with stronger predicted affinity for the risk (G) allele. Electrophoretic mobility shift assay showed that purified human ERα bound only G variant of a 32-bp genomic sequence containing rs12722489. Chromatin immunoprecipitation demonstrated that endogenous human ERα interacted with rs12722489 genomic region in vivo and DNA pull-down assay confirmed differential allelic binding of amplified 189-bp genomic fragments containing rs12722489 with endogenous human ERα. In a luciferase reporter assay, a kilobase-long genomic segment containing G but not A allele of rs12722489 demonstrated enhancer properties in MT-2 cell line, an HTLV-1 transformed human cell line with a regulatory T cell phenotype.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0172681PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5325477PMC
August 2017

Multiple single nucleotide polymorphisms in the first intron of the IL2RA gene affect transcription factor binding and enhancer activity.

Gene 2017 Feb 19;602:50-56. Epub 2016 Nov 19.

Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia; Moscow Institute of Physics and Technology, Department Molecular and Biological Physics, Moscow, Russia; Faculty of Biology, Lomonosov Moscow State University, Moscow, Russia. Electronic address:

IL2RA gene encodes the alpha subunit of a high-affinity receptor for interleukin-2 which is expressed by several distinct populations of lymphocytes involved in autoimmune processes. A large number of polymorphic alleles of the IL2RA locus are associated with the development of various autoimmune diseases. With bioinformatics analysis we the dissected the first intron of the IL2RA gene and selected several single nucleotide polymorphisms (SNPs) that may influence the regulation of the IL2RA gene in cell types relevant to autoimmune pathology. We described five enhancers containing the selected SNPs that stimulated activity of the IL2RA promoter in a cell-type specific manner, and tested the effect of specific SNP alleles on activity of the respective enhancers (E1 to E5, labeled according to the distance to the promoter). The E4 enhancer with minor T variant of rs61839660 SNP demonstrated reduced activity due to disrupted binding of MEF2A/C transcription factors (TFs). Neither rs706778 nor rs706779 SNPs, both associated with a number of autoimmune diseases, had any effect on the activity of the enhancer E2. However, rare variants of several SNPs (rs139767239, rs115133228, rs12722502, rs12722635) genetically linked to either rs706778 and/or rs706779 significantly influenced the activity of E1, E3 and E5 enhancers, presumably by disrupting EBF1, GABPA and ELF1 binding sites.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.gene.2016.11.032DOI Listing
February 2017

Early B-cell factor 1 (EBF1) is critical for transcriptional control of SLAMF1 gene in human B cells.

Biochim Biophys Acta 2016 10 14;1859(10):1259-68. Epub 2016 Jul 14.

Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia; Faculty of Biology, Lomonosov Moscow State University, Moscow, Russia. Electronic address:

Signaling lymphocytic activation molecule family member 1 (SLAMF1)/CD150 is a co-stimulatory receptor expressed on a variety of hematopoietic cells, in particular on mature lymphocytes activated by specific antigen, costimulation and cytokines. Changes in CD150 expression level have been reported in association with autoimmunity and with B-cell chronic lymphocytic leukemia. We characterized the core promoter for SLAMF1 gene in human B-cell lines and explored binding sites for a number of transcription factors involved in B cell differentiation and activation. Mutations of SP1, STAT6, IRF4, NF-kB, ELF1, TCF3, and SPI1/PU.1 sites resulted in significantly decreased promoter activity of varying magnitude, depending on the cell line tested. The most profound effect on the promoter strength was observed upon mutation of the binding site for Early B-cell factor 1 (EBF1). This mutation produced a 10-20 fold drop in promoter activity and pinpointed EBF1 as the master regulator of human SLAMF1 gene in B cells. We also identified three potent transcriptional enhancers in human SLAMF1 locus, each containing functional EBF1 binding sites. Thus, EBF1 interacts with specific binding sites located both in the promoter and in the enhancer regions of the SLAMF1 gene and is critical for its expression in human B cells.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.bbagrm.2016.07.004DOI Listing
October 2016

Negative selection maintains transcription factor binding motifs in human cancer.

BMC Genomics 2016 06 23;17 Suppl 2:395. Epub 2016 Jun 23.

Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, GSP-1, Gubkina 3, Moscow, Russia.

Background: Somatic mutations in cancer cells affect various genomic elements disrupting important cell functions. In particular, mutations in DNA binding sites recognized by transcription factors can alter regulator binding affinities and, consequently, expression of target genes. A number of promoter mutations have been linked with an increased risk of cancer. Cancer somatic mutations in binding sites of selected transcription factors have been found under positive selection. However, action and significance of negative selection in non-coding regions remain controversial.

Results: Here we present analysis of transcription factor binding motifs co-localized with non-coding variants. To avoid statistical bias we account for mutation signatures of different cancer types. For many transcription factors, including multiple members of FOX, HOX, and NR families, we show that human cancers accumulate fewer mutations than expected by chance that increase or decrease affinity of predicted binding sites. Such stability of binding motifs is even more exhibited in DNase accessible regions.

Conclusions: Our data demonstrate negative selection against binding sites alterations and suggest that such selection pressure protects cancer cells from rewiring of regulatory circuits. Further analysis of transcription factors with conserved binding motifs can reveal cell regulatory pathways crucial for the survivability of various human cancers.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-016-2728-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4928157PMC
June 2016

HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models.

Nucleic Acids Res 2016 Jan 19;44(D1):D116-25. Epub 2015 Nov 19.

Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, 119991, GSP-1, Vavilova 32, Moscow, Russia Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, GSP-1, Gubkina 3, Moscow, Russia Moscow Institute of Physics and Technology, 141700, Institutskiy per. 9, Dolgoprudny, Moscow Region, Russia

Models of transcription factor (TF) binding sites provide a basis for a wide spectrum of studies in regulatory genomics, from reconstruction of regulatory networks to functional annotation of transcripts and sequence variants. While TFs may recognize different sequence patterns in different conditions, it is pragmatic to have a single generic model for each particular TF as a baseline for practical applications. Here we present the expanded and enhanced version of HOCOMOCO (http://hocomoco.autosome.ru and http://www.cbrc.kaust.edu.sa/hocomoco10), the collection of models of DNA patterns, recognized by transcription factors. HOCOMOCO now provides position weight matrix (PWM) models for binding sites of 601 human TFs and, in addition, PWMs for 396 mouse TFs. Furthermore, we introduce the largest up to date collection of dinucleotide PWM models for 86 (52) human (mouse) TFs. The update is based on the analysis of massive ChIP-Seq and HT-SELEX datasets, with the validation of the resulting models on in vivo data. To facilitate a practical application, all HOCOMOCO models are linked to gene and protein databases (Entrez Gene, HGNC, UniProt) and accompanied by precomputed score thresholds. Finally, we provide command-line tools for PWM and diPWM threshold estimation and motif finding in nucleotide sequences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkv1249DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702883PMC
January 2016

EpiFactors: a comprehensive database of human epigenetic factors and complexes.

Database (Oxford) 2015 7;2015:bav067. Epub 2015 Jul 7.

Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, NO-7489 Trondheim, Norway,

Epigenetics refers to stable and long-term alterations of cellular traits that are not caused by changes in the DNA sequence per se. Rather, covalent modifications of DNA and histones affect gene expression and genome stability via proteins that recognize and act upon such modifications. Many enzymes that catalyse epigenetic modifications or are critical for enzymatic complexes have been discovered, and this is encouraging investigators to study the role of these proteins in diverse normal and pathological processes. Rapidly growing knowledge in the area has resulted in the need for a resource that compiles, organizes and presents curated information to the researchers in an easily accessible and user-friendly form. Here we present EpiFactors, a manually curated database providing information about epigenetic regulators, their complexes, targets and products. EpiFactors contains information on 815 proteins, including 95 histones and protamines. For 789 of these genes, we include expressions values across several samples, in particular a collection of 458 human primary cell samples (for approximately 200 cell types, in many cases from three individual donors), covering most mammalian cell steady states, 255 different cancer cell lines (representing approximately 150 cancer subtypes) and 134 human postmortem tissues. Expression values were obtained by the FANTOM5 consortium using Cap Analysis of Gene Expression technique. EpiFactors also contains information on 69 protein complexes that are involved in epigenetic regulation. The resource is practical for a wide range of users, including biologists, pharmacologists and clinicians.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bav067DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4494013PMC
March 2016

A promoter-level mammalian expression atlas.

Authors:
Alistair R R Forrest Hideya Kawaji Michael Rehli J Kenneth Baillie Michiel J L de Hoon Vanja Haberle Timo Lassmann Ivan V Kulakovskiy Marina Lizio Masayoshi Itoh Robin Andersson Christopher J Mungall Terrence F Meehan Sebastian Schmeier Nicolas Bertin Mette Jørgensen Emmanuel Dimont Erik Arner Christian Schmidl Ulf Schaefer Yulia A Medvedeva Charles Plessy Morana Vitezic Jessica Severin Colin A Semple Yuri Ishizu Robert S Young Margherita Francescatto Intikhab Alam Davide Albanese Gabriel M Altschuler Takahiro Arakawa John A C Archer Peter Arner Magda Babina Sarah Rennie Piotr J Balwierz Anthony G Beckhouse Swati Pradhan-Bhatt Judith A Blake Antje Blumenthal Beatrice Bodega Alessandro Bonetti James Briggs Frank Brombacher A Maxwell Burroughs Andrea Califano Carlo V Cannistraci Daniel Carbajo Yun Chen Marco Chierici Yari Ciani Hans C Clevers Emiliano Dalla Carrie A Davis Michael Detmar Alexander D Diehl Taeko Dohi Finn Drabløs Albert S B Edge Matthias Edinger Karl Ekwall Mitsuhiro Endoh Hideki Enomoto Michela Fagiolini Lynsey Fairbairn Hai Fang Mary C Farach-Carson Geoffrey J Faulkner Alexander V Favorov Malcolm E Fisher Martin C Frith Rie Fujita Shiro Fukuda Cesare Furlanello Masaaki Furino Jun-ichi Furusawa Teunis B Geijtenbeek Andrew P Gibson Thomas Gingeras Daniel Goldowitz Julian Gough Sven Guhl Reto Guler Stefano Gustincich Thomas J Ha Masahide Hamaguchi Mitsuko Hara Matthias Harbers Jayson Harshbarger Akira Hasegawa Yuki Hasegawa Takehiro Hashimoto Meenhard Herlyn Kelly J Hitchens Shannan J Ho Sui Oliver M Hofmann Ilka Hoof Furni Hori Lukasz Huminiecki Kei Iida Tomokatsu Ikawa Boris R Jankovic Hui Jia Anagha Joshi Giuseppe Jurman Bogumil Kaczkowski Chieko Kai Kaoru Kaida Ai Kaiho Kazuhiro Kajiyama Mutsumi Kanamori-Katayama Artem S Kasianov Takeya Kasukawa Shintaro Katayama Sachi Kato Shuji Kawaguchi Hiroshi Kawamoto Yuki I Kawamura Tsugumi Kawashima Judith S Kempfle Tony J Kenna Juha Kere Levon M Khachigian Toshio Kitamura S Peter Klinken Alan J Knox Miki Kojima Soichi Kojima Naoto Kondo Haruhiko Koseki Shigeo Koyasu Sarah Krampitz Atsutaka Kubosaki Andrew T Kwon Jeroen F J Laros Weonju Lee Andreas Lennartsson Kang Li Berit Lilje Leonard Lipovich Alan Mackay-Sim Ri-ichiroh Manabe Jessica C Mar Benoit Marchand Anthony Mathelier Niklas Mejhert Alison Meynert Yosuke Mizuno David A de Lima Morais Hiromasa Morikawa Mitsuru Morimoto Kazuyo Moro Efthymios Motakis Hozumi Motohashi Christine L Mummery Mitsuyoshi Murata Sayaka Nagao-Sato Yutaka Nakachi Fumio Nakahara Toshiyuki Nakamura Yukio Nakamura Kenichi Nakazato Erik van Nimwegen Noriko Ninomiya Hiromi Nishiyori Shohei Noma Shohei Noma Tadasuke Noazaki Soichi Ogishima Naganari Ohkura Hiroko Ohimiya Hiroshi Ohno Mitsuhiro Ohshima Mariko Okada-Hatakeyama Yasushi Okazaki Valerio Orlando Dmitry A Ovchinnikov Arnab Pain Robert Passier Margaret Patrikakis Helena Persson Silvano Piazza James G D Prendergast Owen J L Rackham Jordan A Ramilowski Mamoon Rashid Timothy Ravasi Patrizia Rizzu Marco Roncador Sugata Roy Morten B Rye Eri Saijyo Antti Sajantila Akiko Saka Shimon Sakaguchi Mizuho Sakai Hiroki Sato Suzana Savvi Alka Saxena Claudio Schneider Erik A Schultes Gundula G Schulze-Tanzil Anita Schwegmann Thierry Sengstag Guojun Sheng Hisashi Shimoji Yishai Shimoni Jay W Shin Christophe Simon Daisuke Sugiyama Takaai Sugiyama Masanori Suzuki Naoko Suzuki Rolf K Swoboda Peter A C 't Hoen Michihira Tagami Naoko Takahashi Jun Takai Hiroshi Tanaka Hideki Tatsukawa Zuotian Tatum Mark Thompson Hiroo Toyodo Tetsuro Toyoda Elvind Valen Marc van de Wetering Linda M van den Berg Roberto Verado Dipti Vijayan Ilya E Vorontsov Wyeth W Wasserman Shoko Watanabe Christine A Wells Louise N Winteringham Ernst Wolvetang Emily J Wood Yoko Yamaguchi Masayuki Yamamoto Misako Yoneda Yohei Yonekura Shigehiro Yoshida Susan E Zabierowski Peter G Zhang Xiaobei Zhao Silvia Zucchelli Kim M Summers Harukazu Suzuki Carsten O Daub Jun Kawai Peter Heutink Winston Hide Tom C Freeman Boris Lenhard Vladimir B Bajic Martin S Taylor Vsevolod J Makeev Albin Sandelin David A Hume Piero Carninci Yoshihide Hayashizaki

Nature 2014 Mar;507(7493):462-70

Regulated transcription controls the diversity, developmental pathways and spatial organization of the hundreds of cell types that make up a mammal. Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) and their usage in human and mouse primary cells, cell lines and tissues to produce a comprehensive overview of mammalian gene expression across the human body. We find that few genes are truly 'housekeeping', whereas many mammalian promoters are composite entities composed of several closely separated TSSs, with independent cell-type-specific expression profiles. TSSs specific to different cell types evolve at different rates, whereas promoters of broadly expressed genes are the most conserved. Promoter-based expression analysis reveals key transcription factors defining cell states and links them to binding-site motifs. The functions of identified novel transcripts can be predicted by coexpression and sample ontology enrichment analyses. The functional annotation of the mammalian genome 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type-specific transcriptomes with wide applications in biomedical research.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature13182DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4529748PMC
March 2014

Jaccard index based similarity measure to compare transcription factor binding site models.

Algorithms Mol Biol 2013 Sep 30;8(1):23. Epub 2013 Sep 30.

Laboratory of Bioinformatics and Systems Biology, Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov str. 32, Moscow 119991, GSP-1, Russia.

Background: Positional weight matrix (PWM) remains the most popular for quantification of transcription factor (TF) binding. PWM supplied with a score threshold defines a set of putative transcription factor binding sites (TFBS), thus providing a TFBS model.TF binding DNA fragments obtained by different experimental methods usually give similar but not identical PWMs. This is also common for different TFs from the same structural family. Thus it is often necessary to measure the similarity between PWMs. The popular tools compare PWMs directly using matrix elements. Yet, for log-odds PWMs, negative elements do not contribute to the scores of highly scoring TFBS and thus may be different without affecting the sets of the best recognized binding sites. Moreover, the two TFBS sets recognized by a given pair of PWMs can be more or less different depending on the score thresholds.

Results: We propose a practical approach for comparing two TFBS models, each consisting of a PWM and the respective scoring threshold. The proposed measure is a variant of the Jaccard index between two TFBS sets. The measure defines a metric space for TFBS models of all finite lengths. The algorithm can compare TFBS models constructed using substantially different approaches, like PWMs with raw positional counts and log-odds. We present the efficient software implementation: MACRO-APE (MAtrix CompaRisOn by Approximate P-value Estimation).

Conclusions: MACRO-APE can be effectively used to compute the Jaccard index based similarity for two TFBS models. A two-pass scanning algorithm is presented to scan a given collection of PWMs for PWMs similar to a given query.

Availability And Implementation: MACRO-APE is implemented in ruby 1.9; software including source code and a manual is freely available at http://autosome.ru/macroape/ and in supplementary materials.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1748-7188-8-23DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3851813PMC
September 2013

From binding motifs in ChIP-Seq data to improved models of transcription factor binding sites.

J Bioinform Comput Biol 2013 Feb 16;11(1):1340004. Epub 2013 Jan 16.

Laboratory of Bioinformatics and Systems Biology, Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov Street 32, Moscow 119991, GSP-1, Russia.

Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) became a method of choice to locate DNA segments bound by different regulatory proteins. ChIP-Seq produces extremely valuable information to study transcriptional regulation. The wet-lab workflow is often supported by downstream computational analysis including construction of models of nucleotide sequences of transcription factor binding sites in DNA, which can be used to detect binding sites in ChIP-Seq data at a single base pair resolution. The most popular TFBS model is represented by positional weight matrix (PWM) with statistically independent positional weights of nucleotides in different columns; such PWMs are constructed from a gapless multiple local alignment of sequences containing experimentally identified TFBSs. Modern high-throughput techniques, including ChIP-Seq, provide enough data for careful training of advanced models containing more parameters than PWM. Yet, many suggested multiparametric models often provide only incremental improvement of TFBS recognition quality comparing to traditional PWMs trained on ChIP-Seq data. We present a novel computational tool, diChIPMunk, that constructs TFBS models as optimal dinucleotide PWMs, thus accounting for correlations between nucleotides neighboring in input sequences. diChIPMunk utilizes many advantages of ChIPMunk, its ancestor algorithm, accounting for ChIP-Seq base coverage profiles ("peak shape") and using the effective subsampling-based core procedure which allows processing of large datasets. We demonstrate that diPWMs constructed by diChIPMunk outperform traditional PWMs constructed by ChIPMunk from the same ChIP-Seq data. Software website: http://autosome.ru/dichipmunk/
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1142/S0219720013400040DOI Listing
February 2013

In silico motif analysis suggests an interplay of transcriptional and translational control in mTOR response.

Translation (Austin) 2013 11;1(2):e27469. Epub 2013 Dec 11.

Department of Computational Systems Biology; Vavilov Institute of General Genetics; Russian Academy of Sciences; Moscow, Russia; Engelhardt Institute of Molecular Biology; Russian Academy of Sciences; Moscow, Russia.

The short 5'-terminal oligopyrimidine tract (TOP) of 5' UTRs is a well-known regulatory sequence motif of mRNAs that are subject to growth-dependent translation. Specifically, translation of TOP mRNAs is regulated by the mTOR signaling pathway that is involved in cell proliferation, cancer development and aging. High throughput data permit detailed study of specific features of the mRNA TOP motif and its DNA origins at transcription start sites (TSS). Recently, ribosome profiling was used to identify mRNA targets of the mTOR pathway in PC3 cells. A novel pyrimidine-rich translational element (PRTE) was reported to play a key role without positional preferences within the 5' UTRs, unlike 5' TOP, which are strictly located at the 5' ends. In this study, we couple recently reported ribosome profiling data on the mTOR mRNA targets with the annotation of TSS obtained by HeliScopeCAGE. We confirm the canonical TOP and strong positional preferences of respective oligopyrimidine tracts (OP) straddling the experimentally validated TSS regions at the DNA level. Such OP localization ensures that transcription from OP segments creates the 5'-terminal TOP in the corresponding mRNAs. We demonstrate that OP are not overrepresented in downstream regions of 5' UTRs of mTOR targets. Finally, we highlight several mTOR target genes with broad and multimodal TSS spanning dozens of nucleotides that are only partically covered with an OP. Therefore, in such cases only a fraction of all produced mRNAs carry a TOP regulatory motif and, thus, respond to mTOR via TOP mechanism. We hypothesize that the interplay between transcription and translation may play a crucial role in the regulation of the mTOR response.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.4161/trla.27469DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4718056PMC
March 2016

HOCOMOCO: a comprehensive collection of human transcription factor binding sites models.

Nucleic Acids Res 2013 Jan 21;41(Database issue):D195-202. Epub 2012 Nov 21.

Laboratory of Bioinformatics and Systems Biology, Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov Street 32, Moscow 119991, GSP-1, Russia.

Transcription factor (TF) binding site (TFBS) models are crucial for computational reconstruction of transcription regulatory networks. In existing repositories, a TF often has several models (also called binding profiles or motifs), obtained from different experimental data. Having a single TFBS model for a TF is more pragmatic for practical applications. We show that integration of TFBS data from various types of experiments into a single model typically results in the improved model quality probably due to partial correction of source specific technique bias. We present the Homo sapiens comprehensive model collection (HOCOMOCO, http://autosome.ru/HOCOMOCO/, http://cbrc.kaust.edu.sa/hocomoco/) containing carefully hand-curated TFBS models constructed by integration of binding sequences obtained by both low- and high-throughput methods. To construct position weight matrices to represent these TFBS models, we used ChIPMunk software in four computational modes, including newly developed periodic positional prior mode associated with DNA helix pitch. We selected only one TFBS model per TF, unless there was a clear experimental evidence for two rather distinct TFBS models. We assigned a quality rating to each model. HOCOMOCO contains 426 systematically curated TFBS models for 401 human TFs, where 172 models are based on more than one data source.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gks1089DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531053PMC
January 2013