Publications by authors named "Minmei Hou"

9 Publications

  • Page 1 of 1

Alignathon: a competitive assessment of whole-genome alignment methods.

Genome Res 2014 Dec 1;24(12):2077-89. Epub 2014 Oct 1.

Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California 95064, USA; Biomolecular Engineering Department, University of California Santa Cruz, Santa Cruz, California 95064, USA;

Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.174920.114DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4248324PMC
December 2014

Pico-inplace-inversions between human and chimpanzee.

Bioinformatics 2011 Dec 12;27(23):3266-75. Epub 2011 Oct 12.

Department of Computer Science, Northern Illinois University, DeKalb, IL 60115, USA.

Motivation: There have been several studies on the micro-inversions between human and chimpanzee, but there are large discrepancies among their results. Furthermore, all of them rely on alignment procedures or existing alignment results to identify inversions. However, the core alignment procedures do not take very small inversions into consideration. Therefore, their analyses cannot find inversions that are too small to be detected by a classic aligner. We call such inversions pico-inversions.

Results: We re-analyzed human-chimpanzee alignment from the UCSC Genome Browser for micro-inplace-inversions and screened for pico-inplace-inversions using a likelihood ratio test. We report that the quantity of inplace-inversions between human and chimpanzee is substantially greater than what had previously been discovered. We also present the software tool PicoInversionMiner to detect pico-inplace-inversions between closely related species.

Availability: Software tools, scripts and result data are available at http://faculty.cs.niu.edu/~hou/PicoInversion.html.

Contact: mhou@cs.niu.edu.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btr566DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3223364PMC
December 2011

28-way vertebrate alignment and conservation track in the UCSC Genome Browser.

Genome Res 2007 Dec 5;17(12):1797-808. Epub 2007 Nov 5.

Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, Pennsylvania 16802, USA.

This article describes a set of alignments of 28 vertebrate genome sequences that is provided by the UCSC Genome Browser. The alignments can be viewed on the Human Genome Browser (March 2006 assembly) at http://genome.ucsc.edu, downloaded in bulk by anonymous FTP from http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz28way, or analyzed with the Galaxy server at http://g2.bx.psu.edu. This article illustrates the power of this resource for exploring vertebrate and mammalian evolution, using three examples. First, we present several vignettes involving insertions and deletions within protein-coding regions, including a look at some human-specific indels. Then we study the extent to which start codons and stop codons in the human sequence are conserved in other species, showing that start codons are in general more poorly conserved than stop codons. Finally, an investigation of the phylogenetic depth of conservation for several classes of functional elements in the human genome reveals striking differences in the rates and modes of decay in alignability. Each functional class has a distinctive period of stringent constraint, followed by decays that allow (for the case of regulatory regions) or reject (for coding regions and ultraconserved elements) insertions and deletions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.6761107DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2099589PMC
December 2007

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Nature 2007 Jun;447(7146):799-816

We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2212820PMC
http://dx.doi.org/10.1038/nature05874DOI Listing
June 2007

Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome.

Genome Res 2007 Jun;17(6):760-74

Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.

A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.6034307DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1891336PMC
June 2007

HomologMiner: looking for homologous genomic groups in whole genomes.

Bioinformatics 2007 Apr 18;23(8):917-25. Epub 2007 Feb 18.

Department of Computer Science & Engineering, Penn State University, PA, USA.

Motivation: Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding low-complexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain.

Results: We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2) and mouse (mm8). Groups obtained include gene families (e.g. olfactory receptor gene family, zinc finger families), unannotated interspersed repeats and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families.

Availability: All programs and datasets are downloadable from www.bx.psu.edu/miller_lab.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btm048DOI Listing
April 2007

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

Genome Res 2005 Aug 15;15(8):1034-50. Epub 2005 Jul 15.

Center for Biomolecular Science and Engineering, University of California, Santa Cruz, Santa Cruz, California 95064, USA.

We have conducted a comprehensive search for conserved elements in vertebrate genomes, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes). Parallel searches have been performed with multiple alignments of four insect species (three species of Drosophila and Anopheles gambiae), two species of Caenorhabditis, and seven species of Saccharomyces. Conserved elements were identified with a computer program called phastCons, which is based on a two-state phylogenetic hidden Markov model (phylo-HMM). PhastCons works by fitting a phylo-HMM to the data by maximum likelihood, subject to constraints designed to calibrate the model across species groups, and then predicting conserved elements based on this model. The predicted elements cover roughly 3%-8% of the human genome (depending on the details of the calibration procedure) and substantially higher fractions of the more compact Drosophila melanogaster (37%-53%), Caenorhabditis elegans (18%-37%), and Saccharaomyces cerevisiae (47%-68%) genomes. From yeasts to vertebrates, in order of increasing genome size and general biological complexity, increasing fractions of conserved bases are found to lie outside of the exons of known protein-coding genes. In all groups, the most highly conserved elements (HCEs), by log-odds score, are hundreds or thousands of bases long. These elements share certain properties with ultraconserved elements, but they tend to be longer and less perfectly conserved, and they overlap genes of somewhat different functional categories. In vertebrates, HCEs are associated with the 3' UTRs of regulatory genes, stable gene deserts, and megabase-sized regions rich in moderately conserved noncoding sequences. Noncoding HCEs also show strong statistical evidence of an enrichment for RNA secondary structure.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.3715005DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182216PMC
August 2005

Mulan: multiple-sequence local alignment and visualization for studying function and evolution.

Genome Res 2005 Jan 8;15(1):184-94. Epub 2004 Dec 8.

Energy, Environment, Biology and Institutional Computing, Lawrence Livermore National Laboratory, Livermore, California 94550, USA.

Multiple-sequence alignment analysis is a powerful approach for understanding phylogenetic relationships, annotating genes, and detecting functional regulatory elements. With a growing number of partly or fully sequenced vertebrate genomes, effective tools for performing multiple comparisons are required to accurately and efficiently assist biological discoveries. Here we introduce Mulan (http://mulan.dcode.org/), a novel method and a network server for comparing multiple draft and finished-quality sequences to identify functional elements conserved over evolutionary time. Mulan brings together several novel algorithms: the TBA multi-aligner program for rapid identification of local sequence conservation, and the multiTF program for detecting evolutionarily conserved transcription factor binding sites in multiple alignments. In addition, Mulan supports two-way communication with the GALA database; alignments of multiple species dynamically generated in GALA can be viewed in Mulan, and conserved transcription factor binding sites identified with Mulan/multiTF can be integrated and overlaid with extensive genome annotation data using GALA. Local multiple alignments computed by Mulan ensure reliable representation of short- and large-scale genomic rearrangements in distant organisms. Mulan allows for interactive modification of critical conservation parameters to differentially predict conserved regions in comparisons of both closely and distantly related species. We illustrate the uses and applications of the Mulan tool through multispecies comparisons of the GATA3 gene locus and the identification of elements that are conserved in a different way in avians than in other genomes, allowing speculation on the evolution of birds. Source code for the aligners and the aligner-evaluation software can be freely downloaded from http://www.bx.psu.edu/miller_lab/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.3007205DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC540288PMC
January 2005

Genome sequence of the Brown Norway rat yields insights into mammalian evolution.

Nature 2004 Apr;428(6982):493-521

Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, MS BCM226, One Baylor Plaza, Houston, Texas 77030, USA. http://www.hgsc.bcm.tmc.edu

The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature02426DOI Listing
April 2004