Publications by authors named "Philip Cayting"

14 Publications

  • Page 1 of 1

Principles of regulatory information conservation between mouse and human.

Nature 2014 Nov;515(7527):371-375

Department of Genetics, Stanford University, Stanford, CA 94305, USA.

To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human-mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature13985DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4343047PMC
November 2014

A comparative encyclopedia of DNA elements in the mouse genome.

Nature 2014 Nov;515(7527):355-64

Bioinformatics and Genomics, Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88, 08003 Barcelona, Catalonia, Spain.

The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature13992DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4266106PMC
November 2014

Comparative analysis of regulatory information and circuits across distant species.

Nature 2014 Aug;512(7515):453-6

National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, 20892, USA.

Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature13668DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4336544PMC
August 2014

ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.

Genome Res 2012 Sep;22(9):1813-31

Department of Genetics, Stanford University, Stanford, California 94305, USA.

Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.136184.111DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496PMC
September 2012

Architecture of the human regulatory network derived from ENCODE data.

Nature 2012 Sep;489(7414):91-100

Department of Genetics, Stanford University, 300 Pasteur Dr., M-344 Stanford, CA 94305, USA.

Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature11245DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4154057PMC
September 2012

An encyclopedia of mouse DNA elements (Mouse ENCODE).

Genome Biol 2012 Aug 13;13(8):418. Epub 2012 Aug 13.

To complement the human Encyclopedia of DNA Elements (ENCODE) project and to enable a broad range of mouse genomics efforts, the Mouse ENCODE Consortium is applying the same experimental pipelines developed for human ENCODE to annotate the mouse genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb-2012-13-8-418DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491367PMC
August 2012

Segmental duplications in the human genome reveal details of pseudogene formation.

Nucleic Acids Res 2010 Nov 8;38(20):6997-7007. Epub 2010 Jul 8.

Program in Computational Biology and Bioinformatics, Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.

Duplicated pseudogenes in the human genome are disabled copies of functioning parent genes. They result from block duplication events occurring throughout evolutionary history. Relatively recent duplications (with sequence similarity≥90% and length≥1 kb) are termed segmental duplications (SDs); here, we analyze the interrelationship of SDs and pseudogenes. We present a decision-tree approach to classify pseudogenes based on their (and their parents') characteristics in relation to SDs. The classification identifies 140 novel pseudogenes and makes possible improved annotation for the 3172 pseudogenes located in SDs. In particular, it reveals that many pseudogenes in SDs likely did not arise directly from parent genes, but are the result of a multi-step process. In these cases, the initial duplication or retrotransposition of a parent gene gives rise to a 'parent pseudogene', followed by further duplication creating duplicated-duplicated or duplicated-processed pseudogenes, respectively. Moreover, we can precisely identify these parent pseudogenes by overlap with ancestral SD loci. Finally, a comparison of nucleotide substitutions per site in a pseudogene with its surrounding SD region allows us to estimate the time difference between duplication and disablement events, and this suggests that most duplicated pseudogenes in SDs were likely disabled around the time of the original duplication.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkq587DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2978362PMC
November 2010

Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library.

Nat Biotechnol 2010 Jan 27;28(1):47-55. Epub 2009 Dec 27.

Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA.

Structural variants (SVs) are a major source of human genomic variation; however, characterizing them at nucleotide resolution remains challenging. Here we assemble a library of breakpoints at nucleotide resolution from collating and standardizing ~2,000 published SVs. For each breakpoint, we infer its ancestral state (through comparison to primate genomes) and its mechanism of formation (e.g., nonallelic homologous recombination, NAHR). We characterize breakpoint sequences with respect to genomic landmarks, chromosomal location, sequence motifs and physical properties, finding that the occurrence of insertions and deletions is more balanced than previously reported and that NAHR-formed breakpoints are associated with relatively rigid, stable DNA helices. Finally, we demonstrate an approach, BreakSeq, for scanning the reads from short-read sequenced genomes against our breakpoint library to accurately identify previously overlooked SVs, which we then validate by PCR. As new data become available, we expect our BreakSeq approach will become more sensitive and facilitate rapid SV genotyping of personal genomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nbt.1600DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2951730PMC
January 2010

PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data.

Genome Biol 2009 Feb 23;10(2):R23. Epub 2009 Feb 23.

Gene Expression Unit, European Molecular Biology Laboratory (EMBL), Meyerhofstr, Heidelberg, 69117, Germany.

Personal-genomics endeavors, such as the 1000 Genomes project, are generating maps of genomic structural variants by analyzing ends of massively sequenced genome fragments. To process these we developed Paired-End Mapper (PEMer; http://sv.gersteinlab.org/pemer). This comprises an analysis pipeline, compatible with several next-generation sequencing platforms; simulation-based error models, yielding confidence-values for each structural variant; and a back-end database. The simulations demonstrated high structural variant reconstruction efficiency for PEMer's coverage-adjusted multi-cutoff scoring-strategy and showed its relative insensitivity to base-calling errors.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb-2009-10-2-r23DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2688268PMC
February 2009

Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes.

Genome Biol 2009 5;10(1):R2. Epub 2009 Jan 5.

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.

Background: The availability of genome sequences of numerous organisms allows comparative study of pseudogenes in syntenic regions. Conservation of pseudogenes suggests that they might have a functional role in some instances.

Results: We report the first large-scale comparative analysis of ribosomal protein pseudogenes in four mammalian genomes (human, chimpanzee, mouse and rat). To this end, we have assigned these pseudogenes in the four organisms using an automated pipeline and make the results available online. Each organism has a large number of ribosomal protein pseudogenes (approximately 1,400 to 2,800). The majority of them are processed (generated by retrotransposition). However, we do not see a correlation between the number of pseudogenes associated with a ribosomal protein gene and its mRNA abundance. Analysis of pseudogenes in syntenic regions between species shows that most are conserved between human and chimpanzee, but very few are conserved between primates and rodents. Interestingly, syntenic pseudogenes have a lower rate of nucleotide substitution than their surrounding intergenic DNA. Moreover, evidence from expressed sequence tags indicates that two pseudogenes conserved between human and mouse are transcribed. Detailed analysis shows that one of them, the pseudogene of RPS27, is likely to be a protein-coding gene. This is significant as previous reports indicated there are exactly 80 ribosomal protein genes encoded by the human genome.

Conclusions: Our analysis indicates that processed ribosomal protein pseudogenes abound in mammalian genomes, but few of these are conserved between primates and rodents. This highlights the large amount of recent retrotranspositional activity in mammals and a relatively larger amount of it in the rodent lineage.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb-2009-10-1-r2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2687790PMC
September 2009

Pseudofam: the pseudogene families database.

Nucleic Acids Res 2009 Jan 28;37(Database issue):D738-43. Epub 2008 Oct 28.

Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA.

Pseudofam (http://pseudofam.pseudogene.org) is a database of pseudogene families based on the protein families from the Pfam database. It provides resources for analyzing the family structure of pseudogenes including query tools, statistical summaries and sequence alignments. The current version of Pseudofam contains more than 125,000 pseudogenes identified from 10 eukaryotic genomes and aligned within nearly 3000 families (approximately one-third of the total families in PfamA). Pseudofam uses a large-scale parallelized homology search algorithm (implemented as an extension of the PseudoPipe pipeline) to identify pseudogenes. Each identified pseudogene is assigned to its parent protein family and subsequently aligned to each other by transferring the parent domain alignments from the Pfam family. Pseudogenes are also given additional annotation based on an ontology, reflecting their mode of creation and subsequent history. In particular, our annotation highlights the association of pseudogene families with genomic features, such as segmental duplications. In addition, pseudogene families are associated with key statistics, which identify outlier families with an unusual degree of pseudogenization. The statistics also show how the number of genes and pseudogenes in families correlates across different species. Overall, they highlight the fact that housekeeping families tend to be enriched with a large number of pseudogenes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkn758DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2686518PMC
January 2009

Uncovering trends in gene naming.

Genome Biol 2008 Jan 31;9(1):401. Epub 2008 Jan 31.

Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA.

We take stock of current genetic nomenclature and attempt to organize strange and notable gene names. We categorize, for instance, those that involve a naming system transferred from another context (for example, Pavlov's dogs). We hope this analysis provides clues to better steer gene naming in the future.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb-2008-9-1-401DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2395243PMC
January 2008

Analysis of nuclear receptor pseudogenes in vertebrates: how the silent tell their stories.

Mol Biol Evol 2008 Jan 7;25(1):131-43. Epub 2007 Dec 7.

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, USA.

Transcription factor pseudogenes have not been systematically studied before. Nuclear receptors (NRs) constitute one of the largest groups of transcription factors in animals (e.g., 48 NRs in human). The availability of whole-genome sequences enables a global inventory of the NR pseudogenes in a number of vertebrate model organisms. Here we identify the NR pseudogenes in 8 vertebrate organisms and make our results available online at http://www.pseudogene.org/nr. The assignments reveal that NR pseudogenes as a group have characteristics related to generation and distribution contrary to expectations derived from previous large-scale pseudogene studies. In particular, 1) despite its large size, the NR gene family has only a very small number of pseudogenes in each of the vertebrate genomes examined; 2) despite the low transcription levels of NR genes, except for one, all other NR pseudogenes identified in this study are retropseudogenes; and 3) no duplicated NR pseudogenes are found, contrary to the fact that the NR gene family was expanded through several waves of gene duplication events. Our analyses further reveal a number of interesting aspects of NR pseudogenes. Specifically, through careful sequence analysis, we identify remnant introns in 2 mouse retropseudogenes, psiRev-erbbeta and psiLRH1. Generated from partially processed pre-mRNAs, they appear to be rare examples of highly unusual "semiprocessed" pseudogenes. Second, by comparing the genomic sequences, we uncover a pseudogene that is unique to the human lineage relative to chimpanzee. Generated by a recent duplication of a segment in the human genome, this pseudogene is a "duplicated-processed" pseudogene, belonging to a new pseudogene species. Finally, FXRbeta was nonfunctionalized in the human lineage and thus appears to be an example of a rare unitary pseudogene. By comparing orthologous sequences, we dated the FXR-FXRbeta duplication and the nonfunctionalization of FXRbeta in primates.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/msm251DOI Listing
January 2008

Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation.

Nucleic Acids Res 2007 Jan 11;35(Database issue):D55-60. Epub 2006 Nov 11.

Center for Comparative Genomics and Bioinformatics, 506B Wartik, Pennsylvania State University, University Park, PA 16802, USA.

The Pseudogene.org knowledgebase serves as a comprehensive repository for pseudogene annotation. The definition of a pseudogene varies within the literature, resulting in significantly different approaches to the problem of identification. Consequently, it is difficult to maintain a consistent collection of pseudogenes in detail necessary for their effective use. Our database is designed to address this issue. It integrates a variety of heterogeneous resources and supports a subset structure that highlights specific groups of pseudogenes that are of interest to the research community. Tools are provided for the comparison of sets and the creation of layered set unions, enabling researchers to derive a current 'consensus' set of pseudogenes. Additional features include versatile search, the capacity for robust interaction with other databases, the ability to reconstruct older versions of the database (accounting for changing genome builds) and an underlying object-oriented interface designed for researchers with a minimal knowledge of programming. At the present time, the database contains more than 100,000 pseudogenes spanning 64 prokaryote and 11 eukaryote genomes, including a collection of human annotations compiled from 16 sources.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkl851DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1669708PMC
January 2007
-->