Publications by authors named "Baikang Pei"

13 Publications

  • Page 1 of 1

GENCODE 2021.

Nucleic Acids Res 2021 01;49(D1):D916-D923

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa1087DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778937PMC
January 2021

IConMHC: a deep learning convolutional neural network model to predict peptide and MHC-I binding affinity.

Immunogenetics 2020 07 24;72(5):295-304. Epub 2020 Jun 24.

Amgen Research, Cambridge, MA, USA.

Tumor-specific neoantigens are mutated self-peptides presented by tumor cell major histocompatibility complex (MHC) molecules and are necessary to elicit host's anti-cancer cytotoxic T cell responses. It could be specifically recognized by neoantigen-specific T cell receptors (TCRs). However, current wet-lab assays for identifying peptide MHC binding are too expensive and time-consuming to meet the clinical needs. In this study, we developed an in silico method with a deep convolutional neural network (CNN) model, iConMHC, to predict peptide MHC binding affinity. Unlike other in silico methods that only learn from properties of amino acid in neoantigen peptides alone and/or MHCs alone, iConMHC learns from physical and chemical interaction properties between pairwise amino acids from the two molecules. These properties, such as contact potentials and distances in folded proteins, directly affect neoantigen-MHC binding affinity. In addition, IConMHC is a pan-allele model that is capable of making predictions for all the MHC alleles. Even for those rare MHC alleles without training data, iConMHC can make predictions with reasonable accuracy. We benchmarked iConMHC with other commonly used MHC-I binding predictors and found our model performs better than most of the pan-allele models.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s00251-020-01163-9DOI Listing
July 2020

GENCODE reference annotation for the human and mouse genomes.

Nucleic Acids Res 2019 01;47(D1):D766-D773

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky955DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323946PMC
January 2019

Enhanced transcriptome maps from multiple mouse tissues reveal evolutionary constraint in gene expression.

Nat Commun 2015 Jan 13;6:5903. Epub 2015 Jan 13.

Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA.

Mice have been a long-standing model for human biology and disease. Here we characterize, by RNA sequencing, the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates. Comparison with transcriptome profiles in human cell lines reveals substantial conservation of transcriptional programmes, and uncovers a distinct class of genes with levels of expression that have been constrained early in vertebrate evolution. This core set of genes captures a substantial fraction of the transcriptional output of mammalian cells, and participates in basic functional and structural housekeeping processes common to all cell types. Perturbation of these constrained genes is associated with significant phenotypes including embryonic lethality and cancer. Evolutionary constraint in gene expression levels is not reflected in the conservation of the genomic sequences, but is associated with conserved epigenetic marking, as well as with characteristic post-transcriptional regulatory programme, in which sub-cellular localization and alternative splicing play comparatively large roles.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ncomms6903DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4308717PMC
January 2015

Comparative analysis of the transcriptome across distant species.

Nature 2014 Aug;512(7515):445-8

Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature13424DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4155737PMC
August 2014

Comparative analysis of pseudogenes across three phyla.

Proc Natl Acad Sci U S A 2014 Sep 25;111(37):13361-6. Epub 2014 Aug 25.

Program in Computational Biology and Bioinformatics and Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520; Department of Computer Science, Yale University, New Haven, CT 06511

Pseudogenes are degraded fossil copies of genes. Here, we report a comparison of pseudogenes spanning three phyla, leveraging the completed annotations of the human, worm, and fly genomes, which we make available as an online resource. We find that pseudogenes are lineage specific, much more so than protein-coding genes, reflecting the different remodeling processes marking each organism's genome evolution. The majority of human pseudogenes are processed, resulting from a retrotranspositional burst at the dawn of the primate lineage. This burst can be seen in the largely uniform distribution of pseudogenes across the genome, their preservation in areas with low recombination rates, and their preponderance in highly expressed gene families. In contrast, worm and fly pseudogenes tell a story of numerous duplication events. In worm, these duplications have been preserved through selective sweeps, so we see a large number of pseudogenes associated with highly duplicated families such as chemoreceptors. However, in fly, the large effective population size and high deletion rate resulted in a depletion of the pseudogene complement. Despite large variations between these species, we also find notable similarities. Overall, we identify a broad spectrum of biochemical activity for pseudogenes, with the majority in each organism exhibiting varying degrees of partial activity. In particular, we identify a consistent amount of transcription (∼15%) across all species, suggesting a uniform degradation process. Also, we see a uniform decay of pseudogene promoter activity relative to their coding counterparts and identify a number of pseudogenes with conserved upstream sequences and activity, hinting at potential regulatory roles.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.1407293111DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4169933PMC
September 2014

Analysis of variable retroduplications in human populations suggests coupling of retrotransposition to cell division.

Genome Res 2013 Dec 11;23(12):2042-52. Epub 2013 Sep 11.

Program in Computational Biology and Bioinformatics.

In primates and other animals, reverse transcription of mRNA followed by genomic integration creates retroduplications. Expressed retroduplications are either "retrogenes" coding for functioning proteins, or expressed "processed pseudogenes," which can function as noncoding RNAs. To date, little is known about the variation in retroduplications in terms of their presence or absence across individuals in the human population. We have developed new methodologies that allow us to identify "novel" retroduplications (i.e., those not present in the reference genome), to find their insertion points, and to genotype them. Using these methods, we catalogued and analyzed 174 retroduplication variants in almost one thousand humans, which were sequenced as part of Phase 1 of The 1000 Genomes Project Consortium. The accuracy of our data set was corroborated by (1) multiple lines of sequencing evidence for retroduplication (e.g., depth of coverage in exons vs. introns), (2) experimental validation, and (3) the fact that we can reconstruct a correct phylogenetic tree of human subpopulations based solely on retroduplications. We also show that parent genes of retroduplication variants tend to be expressed at the M-to-G1 transition in the cell cycle and that M-to-G1 expressed genes have more copies of fixed retroduplications than genes expressed at other times. These findings suggest that cell division is coupled to retrotransposition and, perhaps, is even a requirement for it.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.154625.113DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3847774PMC
December 2013

A Bayesian Approach to Pathway Analysis by Integrating Gene-Gene Functional Directions and Microarray Data.

Stat Biosci 2012 May 29;4(1):105-131. Epub 2011 Dec 29.

Department of Statistics, University of Connecticut, Storrs, CT 06269, USA

Many statistical methods have been developed to screen for differentially expressed genes associated with specific phenotypes in the microarray data. However, it remains a major challenge to synthesize the observed expression patterns with abundant biological knowledge for more complete understanding of the biological functions among genes. Various methods including clustering analysis on genes, neural network, Bayesian network and pathway analysis have been developed toward this goal. In most of these procedures, the activation and inhibition relationships among genes have hardly been utilized in the modeling steps. We propose two novel Bayesian models to integrate the microarray data with the putative pathway structures obtained from the KEGG database and the directional gene-gene interactions in the medical literature. We define the symmetric Kullback-Leibler divergence of a pathway, and use it to identify the pathway(s) most supported by the microarray data. Monte Carlo Markov Chain sampling algorithm is given for posterior computation in the hierarchical model. The proposed method is shown to select the most supported pathway in an illustrative example. Finally, we apply the methodology to a real microarray data set to understand the gene expression profile of osteoblast lineage at defined stages of differentiation. We observe that our method correctly identifies the pathways that are reported to play essential roles in modulating bone mass.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3592971PMC
http://dx.doi.org/10.1007/s12561-011-9046-1DOI Listing
May 2012

Reconstruction of biological networks by incorporating prior knowledge into Bayesian network models.

J Comput Biol 2012 Dec;19(12):1324-34

Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA.

Bayesian network model is widely used for reverse engineering of biological network structures. An advantage of this model is its capability to integrate prior knowledge into the model learning process, which can lead to improving the quality of the network reconstruction outcome. Some previous works have explored this area with focus on using prior knowledge of the direct molecular links, except for a few recent ones proposing to examine the effects of molecular orderings. In this study, we propose a Bayesian network model that can integrate both direct links and orderings into the model. Random weights are assigned to these two types of prior knowledge to alleviate bias toward certain types of information. We evaluate our model performance using both synthetic data and biological data for the RAF signaling network, and illustrate the significant improvement on network structure reconstruction of the proposing models over the existing methods. We also examine the correlation between the improvement and the abundance of ordering prior knowledge. To address the issue of generating prior knowledge, we propose an approach to automatically extract potential molecular orderings from knowledge resources such as Kyoto Encyclopedia of Genes and Genomes (KEGG) database and Gene Ontology (GO) annotation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2011.0194DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3513982PMC
December 2012

GENCODE: the reference human genome annotation for The ENCODE Project.

Genome Res 2012 Sep;22(9):1760-74

Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.135350.111DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431492PMC
September 2012

The GENCODE pseudogene resource.

Genome Biol 2012 Sep 26;13(9):R51. Epub 2012 Sep 26.

Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA.

Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data.

Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.

Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/gb-2012-13-9-r51DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491395PMC
September 2012

Learning Bayesian networks with integration of indirect prior knowledge.

Int J Data Min Bioinform 2010 ;4(5):505-19

Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA.

A Bayesian network model can be used to study the structures of gene regulatory networks. It has the ability to integrate information from both prior knowledge and experimental data. In this study, we propose an approach to efficiently integrate global ordering information into model learning, where the ordering information specifies the indirect relationships among genes. We demonstrate that, compared with a traditional Bayesian network model that uses only local prior knowledge, utilising additional global ordering knowledge can significantly improve the model's performance. The magnitude of this improvement depends on abundance of global ordering information and data quality.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1504/ijdmb.2010.035897DOI Listing
March 2011

Computing consistency between microarray data and known gene regulation relationships.

IEEE Trans Inf Technol Biomed 2009 Nov 25;13(6):1075-82. Epub 2009 Sep 25.

Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA.

Microarray experiments produce expression patterns for thousands of genes at once. On the other hand, biomedical literature contains large amounts of gene regulation relationship information accumulated over the years. One obvious requirement is an automated way of comparing microarray data with the collection of known gene regulation relationships. Such an automated comparison is imperative because it can help biologists rapidly understand the context of a given microarray experiment. In addition, the consistency measure can be used to either validate or refute the hypothesis being tested using the microarray experiment. In this paper we present a systematic way of examining the consistency between a given set of microarray data and known gene regulation relationships. We first introduce a simple gene regulation network model with two separate algorithms designed to isolate a maximally consistent network. Subsequently, we extend the model to take into account multiple regulating factors for a single gene while highlighting both consistencies and inconsistencies. We illustrate the effectiveness of our approach with two practical examples, one that picks the peroxisome proliferator-activated receptor (PPAR) pathway as highly consistent from multiple pathways of Kyoto encyclopedia of genes and genomes (KEGG), and another that isolates key regulatory relationships involving nfkb1 and others known for macrophage's counter response to inflammation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TITB.2009.2032540DOI Listing
November 2009
-->