Publications by authors named "Michael F Lin"

30 Publications

  • Page 1 of 1

Accurate, scalable cohort variant calls using DeepVariant and GLnexus.

Bioinformatics 2021 Jan 5. Epub 2021 Jan 5.

Google Health, Cambridge, MA 02142 and Palo Alto, CA, USA.

Motivation: Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging.

Results: We introduce an open-source cohort-calling method that uses the highly-accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimized the method across a range of cohort sizes, sequencing methods, and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently-generated GATK Best Practices pipeline.

Availability And Implementation: We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-sourced, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa1081DOI Listing
January 2021

Sparse project VCF: efficient encoding of population genotype matrices.

Bioinformatics 2020 Dec 10. Epub 2020 Dec 10.

Regeneron Genetics Center, Tarrytown, NY, USA.

Summary: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10X size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts.

Availability And Implementation: Apache-licensed reference implementation: github.com/mlin/spVCF.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa1004DOI Listing
December 2020

IDseq-An open source cloud-based pipeline and analysis service for metagenomic pathogen detection and monitoring.

Gigascience 2020 10;9(10)

Chan Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158, USA.

Background: Metagenomic next-generation sequencing (mNGS) has enabled the rapid, unbiased detection and identification of microbes without pathogen-specific reagents, culturing, or a priori knowledge of the microbial landscape. mNGS data analysis requires a series of computationally intensive processing steps to accurately determine the microbial composition of a sample. Existing mNGS data analysis tools typically require bioinformatics expertise and access to local server-class hardware resources. For many research laboratories, this presents an obstacle, especially in resource-limited environments.

Findings: We present IDseq, an open source cloud-based metagenomics pipeline and service for global pathogen detection and monitoring (https://idseq.net). The IDseq Portal accepts raw mNGS data, performs host and quality filtration steps, then executes an assembly-based alignment pipeline, which results in the assignment of reads and contigs to taxonomic categories. The taxonomic relative abundances are reported and visualized in an easy-to-use web application to facilitate data interpretation and hypothesis generation. Furthermore, IDseq supports environmental background model generation and automatic internal spike-in control recognition, providing statistics that are critical for data interpretation. IDseq was designed with the specific intent of detecting novel pathogens. Here, we benchmark novel virus detection capability using both synthetically evolved viral sequences and real-world samples, including IDseq analysis of a nasopharyngeal swab sample acquired and processed locally in Cambodia from a tourist from Wuhan, China, infected with the recently emergent SARS-CoV-2.

Conclusion: The IDseq Portal reduces the barrier to entry for mNGS data analysis and enables bench scientists, clinicians, and bioinformaticians to gain insight from mNGS datasets for both known and novel pathogens.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giaa111DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7566497PMC
October 2020

A Color Flow Tract in Ultrasound-Guided Random Renal Core Biopsy Predicts Complications.

J Ultrasound Med 2020 Jul 29;39(7):1335-1342. Epub 2020 Jan 29.

Mallinckrodt Institute of Radiology.

Objectives: To determine patient and procedural risk factors for major complications in ultrasound (US)-guided random renal core biopsy.

Methods: Random renal biopsies performed by radiologists in the US department at a single institution between 2014 and 2018 were retrospectively reviewed. The patient's age, sex, race, and estimated glomerular filtration rate (eGFR) were recorded. The biopsy approach, needle gauge, length of cores, number of throws, and presence of a color flow tract were recorded. Outcome data included minor and major complications. Associations between variables were tested with χ analyses and univariable/multivariable logistic regression models.

Results: A total of 231 biopsies (167 native and 64 allografts) were reviewed. There was no significant difference in the sex, age, race, or eGFR between native and allograft groups. The overall rate for any complication was 18.2%, with a 4.3% rate of major complications, which was significantly greater in native compared to allograft biopsies (6% versus 0%; P = .045). A risk analysis in native biopsies only showed that major complications were significantly associated with a low eGFR such that patients with stage 4 or 5 kidney disease had higher odds of complications (odds ratio [95% confidence interval]: stage 4, 9.405 [1.995-44.338]; P = .0393; stage 5, 10.749 [2.218-52.080]; P = .0203) than patients with normal function (eGFR >60 mL/min). The presence of a color flow tract portended a 10.7 times greater risk of having any complication (95% confidence interval, 4.595-24.994; P < .001). Other procedural factors were not significantly associated with complications.

Conclusions: There is an increased risk of major complications in US-guided random native kidney biopsy in patients with a low eGFR (<30 mL/min) and a patent color flow tract in the immediate postbiopsy setting.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/jum.15227DOI Listing
July 2020

Variation graph toolkit improves read mapping by representing genetic variation in the reference.

Nat Biotechnol 2018 10 20;36(9):875-879. Epub 2018 Aug 20.

Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.

Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual's genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large-scale structural variation such as inversions and duplications. Previous graph genome software implementations have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at a gigabase scale, or at the topological complexity of de novo assemblies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nbt.4227DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6126949PMC
October 2018

Formulating a Treatment Plan in Suspected Lymphoma: Ultrasound-Guided Core Needle Biopsy Versus Core Needle Biopsy and Fine-Needle Aspiration of Peripheral Lymph Nodes.

J Ultrasound Med 2019 Mar 25;38(3):581-586. Epub 2018 Jul 25.

Washington University, St Louis, Missouri, USA.

Objectives: Image-guided tissue sampling in the workup of suspected lymphoma can be performed by core needle biopsy (CNB) or CNB with fine-needle aspiration (FNA). We compared the yield of clinically actionable diagnoses between these methods of tissue sampling.

Methods: All ultrasound-guided percutaneous peripheral lymph node biopsies from 2010 to 2017 at a single institution were retrospectively reviewed for biopsy type (CNB versus CNB + FNA), prior diagnosis of lymphoma, size of the target lymph node, number of cores, length of core specimens, and pathologic diagnosis. Lymphoma and lymphoid tissue were included; metastatic disease and nonlymphoid tissue were excluded. An oncologist specializing in lymphoma independently determined whether an actionable diagnosis could be made with the pathologic results in the context of the patient's medical record. χ analyses and univariable/multivariable logistic regression models were used for statistical analyses.

Results: Of 578 lymph node biopsies, 306 (53%) had a prior diagnosis of lymphoma; 273 (47%) were CNB, and 305 (53%) were CNB + FNA. There was no significant difference between biopsy types (CNB versus CNB + FNA) in the number of cores (median [25th, 75th percentiles], 3 [3, 4] versus 4 [3, 4]; P = .47) or total length of tissue (4.1 [2.5, 6.1] versus 3.7 [2.3, 6] cm; P = .09). There was no difference in obtaining an actionable diagnosis between biopsy types after controlling for a known history of lymphoma (P = .271) or after controlling for the number of core specimens (P = .826).

Conclusions: In cases of suspected lymphoma, CNB without FNA was sufficient to obtain an actionable diagnosis.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/jum.14724DOI Listing
March 2019

Evolutionary Dynamics of Abundant Stop Codon Readthrough.

Mol Biol Evol 2016 12 7;33(12):3108-3132. Epub 2016 Sep 7.

MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA

Translational stop codon readthrough emerged as a major regulatory mechanism affecting hundreds of genes in animal genomes, based on recent comparative genomics and ribosomal profiling evidence, but its evolutionary properties remain unknown. Here, we leverage comparative genomic evidence across 21 Anopheles mosquitoes to systematically annotate readthrough genes in the malaria vector Anopheles gambiae, and to provide the first study of abundant readthrough evolution, by comparison with 20 Drosophila species. Using improved comparative genomics methods for detecting readthrough, we identify evolutionary signatures of conserved, functional readthrough of 353 stop codons in the malaria vector, Anopheles gambiae, and of 51 additional Drosophila melanogaster stop codons, including several cases of double and triple readthrough and of readthrough of two adjacent stop codons. We find that most differences between the readthrough repertoires of the two species arose from readthrough gain or loss in existing genes, rather than birth of new genes or gene death; that readthrough-associated RNA structures are sometimes gained or lost while readthrough persists; that readthrough is more likely to be lost at TAA and TAG stop codons; and that readthrough is under continued purifying evolutionary selection in mosquito, based on population genetic evidence. We also determine readthrough-associated gene properties that predate readthrough, and identify differences in the characteristic properties of readthrough genes between clades. We estimate more than 600 functional readthrough stop codons in mosquito and 900 in fruit fly, provide evidence of readthrough control of peroxisomal targeting, and refine the phylogenetic extent of abundant readthrough as following divergence from centipede.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/msw189DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5100048PMC
December 2016

Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone.

Cell 2015 Jun;161(7):1516-26

Broad Institute of Harvard and MIT, 75 Ames Street, Cambridge, MA 02142, USA; Harvard University, 52 Oxford Street, Cambridge, MA 02138, USA. Electronic address:

The 2013-2015 Ebola virus disease (EVD) epidemic is caused by the Makona variant of Ebola virus (EBOV). Early in the epidemic, genome sequencing provided insights into virus evolution and transmission and offered important information for outbreak response. Here, we analyze sequences from 232 patients sampled over 7 months in Sierra Leone, along with 86 previously released genomes from earlier in the epidemic. We confirm sustained human-to-human transmission within Sierra Leone and find no evidence for import or export of EBOV across national borders after its initial introduction. Using high-depth replicate sequencing, we observe both host-to-host transmission and recurrent emergence of intrahost genetic variants. We trace the increasing impact of purifying selection in suppressing the accumulation of nonsynonymous mutations over time. Finally, we note changes in the mucin-like domain of EBOV glycoprotein that merit further investigation. These findings clarify the movement of EBOV within the region and describe viral evolution during prolonged human-to-human transmission.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.cell.2015.06.007DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4503805PMC
June 2015

FRESCo: finding regions of excess synonymous constraint in diverse viruses.

Genome Biol 2015 Feb 17;16:38. Epub 2015 Feb 17.

Background: The increasing availability of sequence data for many viruses provides power to detect regions under unusual evolutionary constraint at a high resolution. One approach leverages the synonymous substitution rate as a signature to pinpoint genic regions encoding overlapping or embedded functional elements. Protein-coding regions in viral genomes often contain overlapping RNA structural elements, reading frames, regulatory elements, microRNAs, and packaging signals. Synonymous substitutions in these regions would be selectively disfavored and thus these regions are characterized by excess synonymous constraint. Codon choice can also modulate transcriptional efficiency, translational accuracy, and protein folding.

Results: We developed a phylogenetic codon model-based framework, FRESCo, designed to find regions of excess synonymous constraint in short, deep alignments, such as individual viral genes across many sequenced isolates. We demonstrated the high specificity of our approach on simulated data and applied our framework to the protein-coding regions of approximately 30 distinct species of viruses with diverse genome architectures.

Conclusions: FRESCo recovers known multifunctional regions in well-characterized viruses such as hepatitis B virus, poliovirus, and West Nile virus, often at a single-codon resolution, and predicts many novel functional elements overlapping viral genes, including in Lassa and Ebola viruses. In a number of viruses, the synonymously constrained regions that we identified also display conserved, stable predicted RNA structures, including putative novel elements in multiple viral species.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-015-0603-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4376164PMC
February 2015

Histogram analysis for characterization of indeterminate adrenal nodules on noncontrast CT.

Abdom Imaging 2015 Aug;40(6):1666-74

Mallinckrodt Institute of Radiology, Washington University in St. Louis, 510 South Kingshighway, Box 8131, St. Louis, MO, 63131, USA,

Objective: To determine the effectiveness of the CT histogram method to characterize indeterminate adrenal nodules above 10 Hounsfield units (HU) on noncontrast CT.

Materials And Methods: Retrospective review of clinical CT data from January 2005 through 2008 identified 194 indeterminate adrenal nodules (>10 HU on noncontrast CT) in 175 patients. 20 nodules in 18 patients were excluded due to large standard deviation (SD > 30) of HU values. Of the remaining 174 nodules, 131 were classified as benign lipid-poor nodules based on size stability for ≥1 year (104), in- and opposed-phase MRI (17), adrenal washout CT (3), or biopsy (7). 43 were classified as malignant by size increase over a short time (30), avid FDG uptake on PET/CT (15), or biopsy (5). Histogram analysis was performed by drawing a circular region of interest on all adrenal nodules. Mean attenuation, total number of pixels, number of negative pixels, and percentage of negative pixels were recorded for each nodule.

Results: At the threshold value of >10% negative pixels, 59/131 benign nodules were correctly characterized, but 1/43 malignant nodules was falsely characterized as benign (sensitivity 45%, specificity 98%, positive predictive value 98%). With a slightly higher threshold value of >15% negative pixels, there were no false benign judgments. 36 nodules had more than 15% negative pixels, all of which were benign (sensitivity 27%, specificity 100%, positive predictive value 100%). In the subgroup of benign nodules measuring 11-20 HU, 80% and 54% were identified with threshold values of >10% and >15% negative pixels, respectively.

Conclusion: The CT histogram method with a threshold value of >10% negative pixels can identify many benign adrenal nodules with attenuation values >10 HU on unenhanced CT with extremely high specificity. A threshold of >15% negative pixels can achieve 100% specificity. This method is highly robust provided very "noisy" CT examinations (SD > 30) are eliminated.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s00261-014-0307-6DOI Listing
August 2015

The effect of donor kidney volume on recipient outcomes: "dose" matters.

Transplantation 2013 Apr;95(7):e46

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1097/TP.0b013e3182865624DOI Listing
April 2013

Renal measurements on CT angiograms: correlation with graft function at living donor renal transplantation.

Radiology 2012 Oct 12;265(1):151-7. Epub 2012 Jul 12.

Mallinckrodt Institute of Radiology, Washington University/Barnes Jewish Hospital, 510 S Kingshighway Blvd, Campus Box 8131, St Louis, MO 63110, USA.

Purpose: To determine which measurement of donor renal size on computed tomographic (CT) angiograms has the greatest correlation with renal function preoperatively in the donor and postoperatively in the transplant recipient.

Materials And Methods: Informed consent was waived for this retrospective HIPAA-compliant study approved by the institutional review board. Renal length, total volume, and cortical volume were measured on renal donor CT angiograms in 111 patients. Preoperative serum creatinine values for donors and postoperative creatinine values for recipients at hospital discharge and 6, 12, 24, and 36 months after transplant were collected, and estimated glomerular filtration rate (eGFR) was calculated. Correlation coefficients with 95% confidence intervals (CIs) were obtained for renal measures and donor eGFR and for renal measures adjusted to recipient body habitus and posttransplant creatinine level in the recipient. Thresholds were set for adjusted length and volumes, and the odds ratio (OR) for creatinine level less than 1.5 mg/dL at 36 months was calculated.

Results: Renal volumes and length were correlated with donor eGFR (r=0.58 [95% CI: 0.44, 0.69] for cortical volume, 0.56 [95% CI: 0.42, 0.68] for total volume, and 0.43 [95% CI: 0.27, 0.57] for renal length). All three measures, adjusted to recipient body habitus, were correlated with recipient renal function from discharge (r=-0.41 to -0.43) up to 36 months after transplantation (r=-0.33 to -0.41). By using a threshold of 1.5 for cortical volume to recipient weight, 2.25 for total volume to recipient weight, and 0.175 for renal length to recipient weight, the odds of creatinine level greater than 1.5 mg/dL were four times as great for smaller kidney-to-recipient weight ratios, a statistically significant pattern for cortical volume (OR, 4.07; 95% CI: 1.10, 15.09) but not total volume (OR, 4.24; 95% CI: 0.90, 20.01) or renal length (OR, 4.08; 95% CI: 0.48-34.29).

Conclusion: Renal length and volumes correlated with recipient renal function up to 36 months after transplant. A low ratio of cortical volume to recipient weight was associated with diminished renal function at 36 months after transplant.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1148/radiol.12112338DOI Listing
October 2012

Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis.

Genome Res 2012 Mar 22;22(3):577-91. Epub 2011 Nov 22.

Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA.

Long noncoding RNAs (lncRNAs) comprise a diverse class of transcripts that structurally resemble mRNAs but do not encode proteins. Recent genome-wide studies in humans and the mouse have annotated lncRNAs expressed in cell lines and adult tissues, but a systematic analysis of lncRNAs expressed during vertebrate embryogenesis has been elusive. To identify lncRNAs with potential functions in vertebrate embryogenesis, we performed a time-series of RNA-seq experiments at eight stages during early zebrafish development. We reconstructed 56,535 high-confidence transcripts in 28,912 loci, recovering the vast majority of expressed RefSeq transcripts while identifying thousands of novel isoforms and expressed loci. We defined a stringent set of 1133 noncoding multi-exonic transcripts expressed during embryogenesis. These include long intergenic ncRNAs (lincRNAs), intronic overlapping lncRNAs, exonic antisense overlapping lncRNAs, and precursors for small RNAs (sRNAs). Zebrafish lncRNAs share many of the characteristics of their mammalian counterparts: relatively short length, low exon number, low expression, and conservation levels comparable to that of introns. Subsets of lncRNAs carry chromatin signatures characteristic of genes with developmental functions. The temporal expression profile of lncRNAs revealed two novel properties: lncRNAs are expressed in narrower time windows than are protein-coding genes and are specifically enriched in early-stage embryos. In addition, several lncRNAs show tissue-specific expression and distinct subcellular localization patterns. Integrative computational analyses associated individual lncRNAs with specific pathways and functions, ranging from cell cycle regulation to morphogenesis. Our study provides the first systematic identification of lncRNAs in a vertebrate embryo and forms the foundation for future genetic, genomic, and evolutionary studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.133009.111DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3290793PMC
March 2012

Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes.

Genome Res 2011 Nov 12;21(11):1916-28. Epub 2011 Oct 12.

Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.

The degeneracy of the genetic code allows protein-coding DNA and RNA sequences to simultaneously encode additional, overlapping functional elements. A sequence in which both protein-coding and additional overlapping functions have evolved under purifying selection should show increased evolutionary conservation compared to typical protein-coding genes--especially at synonymous sites. In this study, we use genome alignments of 29 placental mammals to systematically locate short regions within human ORFs that show conspicuously low estimated rates of synonymous substitution across these species. The 29-species alignment provides statistical power to locate more than 10,000 such regions with resolution down to nine-codon windows, which are found within more than a quarter of all human protein-coding genes and contain ∼2% of their synonymous sites. We collect numerous lines of evidence that the observed synonymous constraint in these regions reflects selection on overlapping functional elements including splicing regulatory elements, dual-coding genes, RNA secondary structures, microRNA target sites, and developmental enhancers. Our results show that overlapping functional elements are common in mammalian genes, despite the vast genomic landscape.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.108753.110DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3205576PMC
November 2011

Evidence of abundant stop codon readthrough in Drosophila and other metazoa.

Genome Res 2011 Dec 12;21(12):2096-113. Epub 2011 Oct 12.

MIT Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.

While translational stop codon readthrough is often used by viral genomes, it has been observed for only a handful of eukaryotic genes. We previously used comparative genomics evidence to recognize protein-coding regions in 12 species of Drosophila and showed that for 149 genes, the open reading frame following the stop codon has a protein-coding conservation signature, hinting that stop codon readthrough might be common in Drosophila. We return to this observation armed with deep RNA sequence data from the modENCODE project, an improved higher-resolution comparative genomics metric for detecting protein-coding regions, comparative sequence information from additional species, and directed experimental evidence. We report an expanded set of 283 readthrough candidates, including 16 double-readthrough candidates; these were manually curated to rule out alternatives such as A-to-I editing, alternative splicing, dicistronic translation, and selenocysteine incorporation. We report experimental evidence of translation using GFP tagging and mass spectrometry for several readthrough regions. We find that the set of readthrough candidates differs from other genes in length, composition, conservation, stop codon context, and in some cases, conserved stem-loops, providing clues about readthrough regulation and potential mechanisms. Lastly, we expand our studies beyond Drosophila and find evidence of abundant readthrough in several other insect species and one crustacean, and several readthrough candidates in nematode and human, suggesting that functionally important translational stop codon readthrough is significantly more prevalent in Metazoa than previously recognized.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.119974.110DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3227100PMC
December 2011

A high-resolution map of human evolutionary constraint using 29 mammals.

Nature 2011 Oct 12;478(7370):476-82. Epub 2011 Oct 12.

Broad Institute of Harvard and Massachusetts Institute of Technology, 7 Cambridge Center, Cambridge, Massachusetts 02142, USA.

The comparison of related genomes has emerged as a powerful lens for genome interpretation. Here we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and locate constrained elements covering ∼4.2% of the genome. We use evolutionary signatures and comparisons with experimental data sets to suggest candidate functions for ∼60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements and more than 1,000 primate- and human-accelerated elements. Overlap with disease-associated variants indicates that our findings will be relevant for studies of human biology, health and disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature10530DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3207357PMC
October 2011

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions.

Bioinformatics 2011 Jul;27(13):i275-82

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street 32-D510, Cambridge, MA 02139, USA.

Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.

Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.

Availability And Implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF CONTACT: mlin@mit.edu; manoli@mit.edu.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btr209DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3117341PMC
July 2011

Extensive and coordinated transcription of noncoding RNAs within cell-cycle promoters.

Nat Genet 2011 Jun 5;43(7):621-9. Epub 2011 Jun 5.

Program in Epithelial Biology, Stanford University School of Medicine, Stanford, California, USA.

Transcription of long noncoding RNAs (lncRNAs) within gene regulatory elements can modulate gene activity in response to external stimuli, but the scope and functions of such activity are not known. Here we use an ultrahigh-density array that tiles the promoters of 56 cell-cycle genes to interrogate 108 samples representing diverse perturbations. We identify 216 transcribed regions that encode putative lncRNAs, many with RT-PCR-validated periodic expression during the cell cycle, show altered expression in human cancers and are regulated in expression by specific oncogenic stimuli, stem cell differentiation or DNA damage. DNA damage induces five lncRNAs from the CDKN1A promoter, and one such lncRNA, named PANDA, is induced in a p53-dependent manner. PANDA interacts with the transcription factor NF-YA to limit expression of pro-apoptotic genes; PANDA depletion markedly sensitized human fibroblasts to apoptosis by doxorubicin. These findings suggest potentially widespread roles for promoter lncRNAs in cell-growth control.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ng.848DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3652667PMC
June 2011

Comparative functional genomics of the fission yeasts.

Science 2011 May 21;332(6032):930-6. Epub 2011 Apr 21.

Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA 01605, USA.

The fission yeast clade--comprising Schizosaccharomyces pombe, S. octosporus, S. cryophilus, and S. japonicus--occupies the basal branch of Ascomycete fungi and is an important model of eukaryote biology. A comparative annotation of these genomes identified a near extinction of transposons and the associated innovation of transposon-free centromeres. Expression analysis established that meiotic genes are subject to antisense transcription during vegetative growth, which suggests a mechanism for their tight regulation. In addition, trans-acting regulators control new genes within the context of expanded functional modules for meiosis and stress response. Differences in gene content and regulation also explain why, unlike the budding yeast of Saccharomycotina, fission yeasts cannot use ethanol as a primary carbon source. These analyses elucidate the genome structure and gene regulation of fission yeast and provide tools for investigation across the Schizosaccharomyces clade.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.1203357DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3131103PMC
May 2011

Error and error mitigation in low-coverage genome assemblies.

PLoS One 2011 Feb 14;6(2):e17034. Epub 2011 Feb 14.

Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America.

The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1-4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0017034PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3038916PMC
February 2011

Identification of functional elements and regulatory circuits by Drosophila modENCODE.

Science 2010 Dec 22;330(6012):1787-97. Epub 2010 Dec 22.

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA.

To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.1198374DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3192495PMC
December 2010

The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

Genome Res 2009 Jul 4;19(7):1316-23. Epub 2009 Jun 4.

National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA.

Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.080531.108DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2704439PMC
July 2009

Evolution of pathogenicity and sexual reproduction in eight Candida genomes.

Nature 2009 Jun;459(7247):657-62

UCD School of Biomolecular and Biomedical Science, Conway Institute, University College Dublin, Belfield, Dublin 4, Ireland.

Candida species are the most common cause of opportunistic fungal infection worldwide. Here we report the genome sequences of six Candida species and compare these and related pathogens and non-pathogens. There are significant expansions of cell wall, secreted and transporter gene families in pathogenic species, suggesting adaptations associated with virulence. Large genomic tracts are homozygous in three diploid species, possibly resulting from recent recombination events. Surprisingly, key components of the mating and meiosis pathways are missing from several species. These include major differences at the mating-type loci (MTL); Lodderomyces elongisporus lacks MTL, and components of the a1/2 cell identity determinant were lost in other species, raising questions about how mating and cell types are controlled. Analysis of the CUG leucine-to-serine genetic-code change reveals that 99% of ancestral CUG codons were erased and new ones arose elsewhere. Lastly, we revise the Candida albicans gene catalogue, identifying many new genes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature08064DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2834264PMC
June 2009

Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals.

Nature 2009 Mar 1;458(7235):223-7. Epub 2009 Feb 1.

Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, Massachusetts 02142, USA.

There is growing recognition that mammalian cells produce many thousands of large intergenic transcripts. However, the functional significance of these transcripts has been particularly controversial. Although there are some well-characterized examples, most (>95%) show little evidence of evolutionary conservation and have been suggested to represent transcriptional noise. Here we report a new approach to identifying large non-coding RNAs using chromatin-state maps to discover discrete transcriptional units intervening known protein-coding loci. Our approach identified approximately 1,600 large multi-exonic RNAs across four mouse cell types. In sharp contrast to previous collections, these large intervening non-coding RNAs (lincRNAs) show strong purifying selection in their genomic loci, exonic sequences and promoter regions, with greater than 95% showing clear evolutionary conservation. We also developed a functional genomics approach that assigns putative functions to each lincRNA, demonstrating a diverse range of roles for lincRNAs in processes from embryonic stem cell pluripotency to cell proliferation. We obtained independent functional validation for the predictions for over 100 lincRNAs, using cell-based assays. In particular, we demonstrate that specific lincRNAs are transcriptionally regulated by key transcription factors in these processes such as p53, NFkappaB, Sox2, Oct4 (also known as Pou5f1) and Nanog. Together, these results define a unique collection of functional lincRNAs that are highly conserved and implicated in diverse biological processes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature07672DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2754849PMC
March 2009

Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes.

PLoS Comput Biol 2008 Apr 18;4(4):e1000067. Epub 2008 Apr 18.

Broad Institute of MIT and Harvard University, Cambridge, Massachusetts, United States of America.

Comparative genomics of multiple related species is a powerful methodology for the discovery of functional genomic elements, and its power should increase with the number of species compared. Here, we use 12 Drosophila genomes to study the power of comparative genomics metrics to distinguish between protein-coding and non-coding regions. First, we study the relative power of different comparative metrics and their relationship to single-species metrics. We find that even relatively simple multi-species metrics robustly outperform advanced single-species metrics, especially for shorter exons (< or =240 nt), which are common in animal genomes. Moreover, the two capture largely independent features of protein-coding genes, with different sensitivity/specificity trade-offs, such that their combinations lead to even greater discriminatory power. In addition, we study how discovery power scales with the number and phylogenetic distance of the genomes compared. We find that species at a broad range of distances are comparably effective informants for pairwise comparative gene identification, but that these are surpassed by multi-species comparisons at similar evolutionary divergence. In particular, while pairwise discovery power plateaued at larger distances and never outperformed the most advanced single-species metrics, multi-species comparisons continued to benefit even from the most distant species with no apparent saturation. Last, we find that genes in functional categories typically considered fast-evolving can nonetheless be recovered at very high rates using comparative methods. Our results have implications for comparative genomics analyses in any species, including the human.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1000067DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2291194PMC
April 2008

Developing role of magnetic resonance imaging in Crohn's disease.

Curr Opin Gastroenterol 2008 Mar;24(2):135-40

Mallinkrodt Institute of Radiology, Washington University in St Louis, Missouri, USA.

Purpose Of Review: There is growing concern among the medical community that diagnostic radiation adds to the already increased risk of developing lymphoma that may be inherent in, or related to the treatment of, inflammatory bowel disease. This article describes recent progress in magnetic resonance enterography techniques, and examines the role of MRI in the evaluation of Crohn's disease.

Recent Findings: Recent advancements in magnetic resonance technology and imaging protocol have made MRI of the small bowel feasible. With improved coils, breath-hold sequences and faster acquisition techniques, MRI capably depicts disease location, extent, and complications. Most of the current literature recognizes MRI as an excellent tool in characterizing transmural and extraluminal changes of Crohn's disease.

Summary: The lack of ionizing radiation is the main driving force for MRI of Crohn's disease. This advantage is magnified by the relatively young age of Crohn's disease patients. While intrinsic susceptibility to air and motion may limit its use in some patients, MRI shows promising potential as an alternative to computed tomography in monitoring disease progression or response to therapy.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1097/MOG.0b013e3282f49b14DOI Listing
March 2008

Distinguishing protein-coding and noncoding genes in the human genome.

Proc Natl Acad Sci U S A 2007 Dec 26;104(49):19428-33. Epub 2007 Nov 26.

Broad Institute of Massachusetts Institute of Technology and Harvard, 7 Cambridge Center, Cambridge, MA 02142, USA.

Although the Human Genome Project was completed 4 years ago, the catalog of human protein-coding genes remains a matter of controversy. Current catalogs list a total of approximately 24,500 putative protein-coding genes. It is broadly suspected that a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts, because they show no evidence of evolutionary conservation with mouse or dog. However, there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation: the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages. Here, we reject this hypothesis by carefully analyzing the nonconserved ORFs-specifically, their properties in other primates. We show that the vast majority of these ORFs are random occurrences. The analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to approximately 20,500. Specifically, it suggests that nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein. It also provides a principled methodology for evaluating future proposed additions to the human gene catalog. Finally, the results indicate that there has been relatively little true innovation in mammalian protein-coding genes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.0709013104DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2148306PMC
December 2007

Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures.

Nature 2007 Nov;450(7167):219-32

The Broad Institute, Massachusetts Institute of Technology and Harvard University, Cambridge, Massachusetts 02140, USA.

Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or 'evolutionary signatures', dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature06340DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2474711PMC
November 2007

Evolution of genes and genomes on the Drosophila phylogeny.

Authors:
Andrew G Clark Michael B Eisen Douglas R Smith Casey M Bergman Brian Oliver Therese A Markow Thomas C Kaufman Manolis Kellis William Gelbart Venky N Iyer Daniel A Pollard Timothy B Sackton Amanda M Larracuente Nadia D Singh Jose P Abad Dawn N Abt Boris Adryan Montserrat Aguade Hiroshi Akashi Wyatt W Anderson Charles F Aquadro David H Ardell Roman Arguello Carlo G Artieri Daniel A Barbash Daniel Barker Paolo Barsanti Phil Batterham Serafim Batzoglou Dave Begun Arjun Bhutkar Enrico Blanco Stephanie A Bosak Robert K Bradley Adrianne D Brand Michael R Brent Angela N Brooks Randall H Brown Roger K Butlin Corrado Caggese Brian R Calvi A Bernardo de Carvalho Anat Caspi Sergio Castrezana Susan E Celniker Jean L Chang Charles Chapple Sourav Chatterji Asif Chinwalla Alberto Civetta Sandra W Clifton Josep M Comeron James C Costello Jerry A Coyne Jennifer Daub Robert G David Arthur L Delcher Kim Delehaunty Chuong B Do Heather Ebling Kevin Edwards Thomas Eickbush Jay D Evans Alan Filipski Sven Findeiss Eva Freyhult Lucinda Fulton Robert Fulton Ana C L Garcia Anastasia Gardiner David A Garfield Barry E Garvin Greg Gibson Don Gilbert Sante Gnerre Jennifer Godfrey Robert Good Valer Gotea Brenton Gravely Anthony J Greenberg Sam Griffiths-Jones Samuel Gross Roderic Guigo Erik A Gustafson Wilfried Haerty Matthew W Hahn Daniel L Halligan Aaron L Halpern Gillian M Halter Mira V Han Andreas Heger LaDeana Hillier Angie S Hinrichs Ian Holmes Roger A Hoskins Melissa J Hubisz Dan Hultmark Melanie A Huntley David B Jaffe Santosh Jagadeeshan William R Jeck Justin Johnson Corbin D Jones William C Jordan Gary H Karpen Eiko Kataoka Peter D Keightley Pouya Kheradpour Ewen F Kirkness Leonardo B Koerich Karsten Kristiansen Dave Kudrna Rob J Kulathinal Sudhir Kumar Roberta Kwok Eric Lander Charles H Langley Richard Lapoint Brian P Lazzaro So-Jeong Lee Lisa Levesque Ruiqiang Li Chiao-Feng Lin Michael F Lin Kerstin Lindblad-Toh Ana Llopart Manyuan Long Lloyd Low Elena Lozovsky Jian Lu Meizhong Luo Carlos A Machado Wojciech Makalowski Mar Marzo Muneo Matsuda Luciano Matzkin Bryant McAllister Carolyn S McBride Brendan McKernan Kevin McKernan Maria Mendez-Lago Patrick Minx Michael U Mollenhauer Kristi Montooth Stephen M Mount Xu Mu Eugene Myers Barbara Negre Stuart Newfeld Rasmus Nielsen Mohamed A F Noor Patrick O'Grady Lior Pachter Montserrat Papaceit Matthew J Parisi Michael Parisi Leopold Parts Jakob S Pedersen Graziano Pesole Adam M Phillippy Chris P Ponting Mihai Pop Damiano Porcelli Jeffrey R Powell Sonja Prohaska Kim Pruitt Marta Puig Hadi Quesneville Kristipati Ravi Ram David Rand Matthew D Rasmussen Laura K Reed Robert Reenan Amy Reily Karin A Remington Tania T Rieger Michael G Ritchie Charles Robin Yu-Hui Rogers Claudia Rohde Julio Rozas Marc J Rubenfield Alfredo Ruiz Susan Russo Steven L Salzberg Alejandro Sanchez-Gracia David J Saranga Hajime Sato Stephen W Schaeffer Michael C Schatz Todd Schlenke Russell Schwartz Carmen Segarra Rama S Singh Laura Sirot Marina Sirota Nicholas B Sisneros Chris D Smith Temple F Smith John Spieth Deborah E Stage Alexander Stark Wolfgang Stephan Robert L Strausberg Sebastian Strempel David Sturgill Granger Sutton Granger G Sutton Wei Tao Sarah Teichmann Yoshiko N Tobari Yoshihiko Tomimura Jason M Tsolas Vera L S Valente Eli Venter J Craig Venter Saverio Vicario Filipe G Vieira Albert J Vilella Alfredo Villasante Brian Walenz Jun Wang Marvin Wasserman Thomas Watts Derek Wilson Richard K Wilson Rod A Wing Mariana F Wolfner Alex Wong Gane Ka-Shu Wong Chung-I Wu Gabriel Wu Daisuke Yamamoto Hsiao-Pei Yang Shiaw-Pyng Yang James A Yorke Kiyohito Yoshida Evgeny Zdobnov Peili Zhang Yu Zhang Aleksey V Zimin Jennifer Baldwin Amr Abdouelleil Jamal Abdulkadir Adal Abebe Brikti Abera Justin Abreu St Christophe Acer Lynne Aftuck Allen Alexander Peter An Erica Anderson Scott Anderson Harindra Arachi Marc Azer Pasang Bachantsang Andrew Barry Tashi Bayul Aaron Berlin Daniel Bessette Toby Bloom Jason Blye Leonid Boguslavskiy Claude Bonnet Boris Boukhgalter Imane Bourzgui Adam Brown Patrick Cahill Sheridon Channer Yama Cheshatsang Lisa Chuda Mieke Citroen Alville Collymore Patrick Cooke Maura Costello Katie D'Aco Riza Daza Georgius De Haan Stuart DeGray Christina DeMaso Norbu Dhargay Kimberly Dooley Erin Dooley Missole Doricent Passang Dorje Kunsang Dorjee Alan Dupes Richard Elong Jill Falk Abderrahim Farina Susan Faro Diallo Ferguson Sheila Fisher Chelsea D Foley Alicia Franke Dennis Friedrich Loryn Gadbois Gary Gearin Christina R Gearin Georgia Giannoukos Tina Goode Joseph Graham Edward Grandbois Sharleen Grewal Kunsang Gyaltsen Nabil Hafez Birhane Hagos Jennifer Hall Charlotte Henson Andrew Hollinger Tracey Honan Monika D Huard Leanne Hughes Brian Hurhula M Erii Husby Asha Kamat Ben Kanga Seva Kashin Dmitry Khazanovich Peter Kisner Krista Lance Marcia Lara William Lee Niall Lennon Frances Letendre Rosie LeVine Alex Lipovsky Xiaohong Liu Jinlei Liu Shangtao Liu Tashi Lokyitsang Yeshi Lokyitsang Rakela Lubonja Annie Lui Pen MacDonald Vasilia Magnisalis Kebede Maru Charles Matthews William McCusker Susan McDonough Teena Mehta James Meldrim Louis Meneus Oana Mihai Atanas Mihalev Tanya Mihova Rachel Mittelman Valentine Mlenga Anna Montmayeur Leonidas Mulrain Adam Navidi Jerome Naylor Tamrat Negash Thu Nguyen Nga Nguyen Robert Nicol Choe Norbu Nyima Norbu Nathaniel Novod Barry O'Neill Sahal Osman Eva Markiewicz Otero L Oyono Christopher Patti Pema Phunkhang Fritz Pierre Margaret Priest Sujaa Raghuraman Filip Rege Rebecca Reyes Cecil Rise Peter Rogov Keenan Ross Elizabeth Ryan Sampath Settipalli Terry Shea Ngawang Sherpa Lu Shi Diana Shih Todd Sparrow Jessica Spaulding John Stalker Nicole Stange-Thomann Sharon Stavropoulos Catherine Stone Christopher Strader Senait Tesfaye Talene Thomson Yama Thoulutsang Dawa Thoulutsang Kerri Topham Ira Topping Tsamla Tsamla Helen Vassiliev Andy Vo Tsering Wangchuk Tsering Wangdi Michael Weiand Jane Wilkinson Adam Wilson Shailendra Yadav Geneva Young Qing Yu Lisa Zembek Danni Zhong Andrew Zimmer Zac Zwirko David B Jaffe Pablo Alvarez Will Brockman Jonathan Butler CheeWhye Chin Sante Gnerre Manfred Grabherr Michael Kleber Evan Mauceli Iain MacCallum

Nature 2007 Nov;450(7167):203-18

Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA.

Comparative analysis of multiple genomes in a phylogenetic framework dramatically improves the precision and sensitivity of evolutionary inference, producing more robust results than single-genome analyses can provide. The genomes of 12 Drosophila species, ten of which are presented here for the first time (sechellia, simulans, yakuba, erecta, ananassae, persimilis, willistoni, mojavensis, virilis and grimshawi), illustrate how rates and patterns of sequence divergence across taxa can illuminate evolutionary processes on a genomic scale. These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution. Despite remarkable similarities among these Drosophila species, we identified many putatively non-neutral changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. These may prove to underlie differences in the ecology and behaviour of these diverse species.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature06341DOI Listing
November 2007

Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes.

Genome Res 2007 Dec 7;17(12):1823-36. Epub 2007 Nov 7.

Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02139, USA.

The availability of sequenced genomes from 12 Drosophila species has enabled the use of comparative genomics for the systematic discovery of functional elements conserved within this genus. We have developed quantitative metrics for the evolutionary signatures specific to protein-coding regions and applied them genome-wide, resulting in 1193 candidate new protein-coding exons in the D. melanogaster genome. We have reviewed these predictions by manual curation and validated a subset by directed cDNA screening and sequencing, revealing both new genes and new alternative splice forms of known genes. We also used these evolutionary signatures to evaluate existing gene annotations, resulting in the validation of 87% of genes lacking descriptive names and identifying 414 poorly conserved genes that are likely to be spurious predictions, noncoding, or species-specific genes. Furthermore, our methods suggest a variety of refinements to hundreds of existing gene models, such as modifications to translation start codons and exon splice boundaries. Finally, we performed directed genome-wide searches for unusual protein-coding structures, discovering 149 possible examples of stop codon readthrough, 125 new candidate ORFs of polycistronic mRNAs, and several candidate translational frameshifts. These results affect >10% of annotated fly genes and demonstrate the power of comparative genomics to enhance our understanding of genome organization, even in a model organism as intensively studied as Drosophila melanogaster.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.6679507DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2099591PMC
December 2007