Publications by authors named "Chen-Shan Chin"

47 Publications

Chromosome-scale, haplotype-resolved assembly of human genomes.

Nat Biotechnol 2020 Dec 7. Epub 2020 Dec 7.

Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.

Haplotype-resolved or phased genome assembly provides a complete picture of genomes and their complex genetic variations. However, current algorithms for phased assembly either do not generate chromosome-scale phasing or require pedigree information, which limits their application. We present a method named diploid assembly (DipAsm) that uses long, accurate reads and long-range conformation data for single individuals to generate a chromosome-scale phased assembly within 1 day. Applied to four public human genomes, PGP1, HG002, NA12878 and HG00733, DipAsm produced haplotype-resolved assemblies with minimum contig length needed to cover 50% of the known genome (NG50) up to 25 Mb and phased ~99.5% of heterozygous sites at 98-99% accuracy, outperforming other approaches in terms of both contiguity and phasing completeness. We demonstrate the importance of chromosome-scale phased assemblies for the discovery of structural variants (SVs), including thousands of new transposon insertions, and of highly polymorphic and medically important regions such as the human leukocyte antigen (HLA) and killer cell immunoglobulin-like receptor (KIR) regions. DipAsm will facilitate high-quality precision medicine and studies of individual haplotype variation and population diversity.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41587-020-0711-0DOI Listing
December 2020

Amplification-free long-read sequencing reveals unforeseen CRISPR-Cas9 off-target activity.

Genome Biol 2020 Dec 1;21(1):290. Epub 2020 Dec 1.

Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden.

Background: One ongoing concern about CRISPR-Cas9 genome editing is that unspecific guide RNA (gRNA) binding may induce off-target mutations. However, accurate prediction of CRISPR-Cas9 off-target activity is challenging. Here, we present SMRT-OTS and Nano-OTS, two novel, amplification-free, long-read sequencing protocols for detection of gRNA-driven digestion of genomic DNA by Cas9 in vitro.

Results: The methods are assessed using the human cell line HEK293, re-sequenced at 18x coverage using highly accurate HiFi SMRT reads. SMRT-OTS and Nano-OTS are first applied to three different gRNAs targeting HEK293 genomic DNA, resulting in a set of 55 high-confidence gRNA cleavage sites identified by both methods. Twenty-five of these sites are not reported by off-target prediction software, either because they contain four or more single nucleotide mismatches or insertion/deletion mismatches, as compared with the human reference. Additional experiments reveal that 85% of Cas9 cleavage sites are also found by other in vitro-based methods and that on- and off-target sites are detectable in gene bodies where short-reads fail to uniquely align. Even though SMRT-OTS and Nano-OTS identify several sites with previously validated off-target editing activity in cells, our own CRISPR-Cas9 editing experiments in human fibroblasts do not give rise to detectable off-target mutations at the in vitro-predicted sites. However, indel and structural variation events are enriched at the on-target sites.

Conclusions: Amplification-free long-read sequencing reveals Cas9 cleavage sites in vitro that would have been difficult to predict using computational tools, including in dark genomic regions inaccessible by short-read sequencing.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-020-02206-wDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7706270PMC
December 2020

A diploid assembly-based benchmark for variants in the major histocompatibility complex.

Nat Commun 2020 09 22;11(1):4794. Epub 2020 Sep 22.

Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD, 20899, USA.

Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assembly is particularly useful - the Major Histocompatibility Complex (MHC). Here, we develop a human genome benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle sample HG002. We assemble a single contig for each haplotype, align them to the reference, call phased small and structural variants, and define a small variant benchmark for the MHC, covering 94% of the MHC and 22368 variants smaller than 50 bp, 49% more variants than a mapping-based benchmark. This benchmark reliably identifies errors in mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-020-18564-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7508831PMC
September 2020

Trajectories of glomerular filtration rate and progression to end stage kidney disease after kidney transplantation.

Kidney Int 2021 01 8;99(1):186-197. Epub 2020 Aug 8.

Université de Paris, INSERM, PARCC, Paris Translational Research Centre for Organ Transplantation, Paris, France; Kidney Transplant Department, Necker Hospital, Assistance Publique - Hôpitaux de Paris, Paris, France. Electronic address:

Although the gold standard of monitoring kidney transplant function relies on glomerular filtration rate (GFR), little is known about GFR trajectories after transplantation, their determinants, and their association with outcomes. To evaluate these parameters we examined kidney transplant recipients receiving care at 15 academic centers. Patients underwent prospective monitoring of estimated GFR (eGFR) measurements, with assessment of clinical, functional, histological and immunological parameters. Additional validation took place in seven randomized controlled trials that included a total of 14,132 patients with 403,497 eGFR measurements. After a median follow-up of 6.5 years, 1,688 patients developed end-stage kidney disease. Using unsupervised latent class mixed models, we identified eight distinct eGFR trajectories. Multinomial regression models identified seven significant determinants of eGFR trajectories including donor age, eGFR, proteinuria, and several significant histological features: graft scarring, graft interstitial inflammation and tubulitis, microcirculation inflammation, and circulating anti-HLA donor specific antibodies. The eGFR trajectories were associated with progression to end stage kidney disease. These trajectories, their determinants and respective associations with end stage kidney disease were similar across cohorts, as well as in diverse clinical scenarios, therapeutic eras and in the seven randomized control trials. Thus, our results provide the basis for a trajectory-based assessment of kidney transplant patients for risk stratification and monitoring.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.kint.2020.07.025DOI Listing
January 2021

Ribbon: Intuitive visualization for complex genomic variation.

Bioinformatics 2020 Aug 7. Epub 2020 Aug 7.

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.

Summary: Ribbon is an alignment visualization tool that shows how alignments are positioned within both the reference and read contexts, giving an intuitive view that enables a better understanding of structural variants and the read evidence supporting them. Ribbon was born out of a need to curate complex structural variant calls and determine whether each was well supported by long-read evidence, and it uses the same intuitive visualization method to shed light on contig alignments from genome-to-genome comparisons.

Availability And Implementation: Ribbon is freely available online at http://genomeribbon.com/ and is open-source at https://github.com/marianattestad/ribbon.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa680DOI Listing
August 2020

Effect of sequence depth and length in long-read assembly of the maize inbred NC358.

Nat Commun 2020 05 8;11(1):2288. Epub 2020 May 8.

Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 11724, USA.

Improvements in long-read data and scaffolding technologies have enabled rapid generation of reference-quality assemblies for complex genomes. Still, an assessment of critical sequence depth and read length is important for allocating limited resources. To this end, we have generated eight assemblies for the complex genome of the maize inbred line NC358 using PacBio datasets ranging from 20 to 75 × genomic depth and with N50 subread lengths of 11-21 kb. Assemblies with ≤30 × depth and N50 subread length of 11 kb are highly fragmented, with even low-copy genic regions showing degradation at 20 × depth. Distinct sequence-quality thresholds are observed for complete assembly of genes, transposable elements, and highly repetitive genomic features such as telomeres, heterochromatic knobs, and centromeres. In addition, we show high-quality optical maps can dramatically improve contiguity in even our most fragmented base assembly. This study provides a useful resource allocation reference to the community as long-read technologies continue to mature.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-020-16037-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7211024PMC
May 2020

Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome.

Nat Biotechnol 2019 10 12;37(10):1155-1162. Epub 2019 Aug 12.

Pacific Biosciences, Menlo Park, CA, USA.

The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the 'genome in a bottle' (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41587-019-0217-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6776680PMC
October 2019

Multi-platform discovery of haplotype-resolved structural variation in human genomes.

Nat Commun 2019 04 16;10(1):1784. Epub 2019 Apr 16.

The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA.

The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-018-08148-zDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6467913PMC
April 2019

Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line.

Genome Res 2018 08 28;28(8):1126-1135. Epub 2018 Jun 28.

Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.

The SK-BR-3 cell line is one of the most important models for HER2+ breast cancers, which affect one in five breast cancer patients. SK-BR-3 is known to be highly rearranged, although much of the variation is in complex and repetitive regions that may be underreported. Addressing this, we sequenced SK-BR-3 using long-read single molecule sequencing from Pacific Biosciences and develop one of the most detailed maps of structural variations (SVs) in a cancer genome available, with nearly 20,000 variants present, most of which were missed by short-read sequencing. Surrounding the important oncogene (also known as ), we discover a complex sequence of nested duplications and translocations, suggesting a punctuated progression. Full-length transcriptome sequencing further revealed several novel gene fusions within the nested genomic variants. Combining long-read genome and transcriptome sequencing enables an in-depth analysis of how SVs disrupt the genome and sheds new light on the complex mechanisms involved in cancer genome evolution.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.231100.117DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6071638PMC
August 2018

Comprehensive analysis of single molecule sequencing-derived complete genome and whole transcriptome of Hyposidra talaca nuclear polyhedrosis virus.

Sci Rep 2018 06 12;8(1):8924. Epub 2018 Jun 12.

Department of Molecular Biology, Genentech Inc., 1 DNA WAY, South San Francisco, CA, 94080, USA.

We sequenced the Hyposidra talaca NPV (HytaNPV) double stranded circular DNA genome using PacBio single molecule sequencing technology. We found that the HytaNPV genome is 139,089 bp long with a GC content of 39.6%. It encodes 141 open reading frames (ORFs) including the 37 baculovirus core genes, 25 genes conserved among lepidopteran baculoviruses, 72 genes known in baculovirus, and 7 genes unique to the HytaNPV genome. It is a group II alphabaculovirus that codes for the F protein and lacks the gp64 gene found in group I alphabaculovirus viruses. Using RNA-seq, we confirmed the expression of the ORFs identified in the HytaNPV genome. Phylogenetic analysis showed HytaNPV to be closest to BusuNPV, SujuNPV and EcobNPV that infect other tea pests, Buzura suppressaria, Sucra jujuba, and Ectropis oblique, respectively. We identified repeat elements and a conserved non-coding baculovirus element in the genome. Analysis of the putative promoter sequences identified motif consistent with the temporal expression of the genes observed in the RNA-seq data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-018-27084-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5997678PMC
June 2018

De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads.

Gigascience 2017 10;6(10):1-16

Laboratory of Neurogenetics of Language, Box 54, The Rockefeller University, New York, NY 10065, USA.

Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna's hummingbird reference, 2 vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult-to-sequence regions, complex repeat structure errors, and allelic differences between the 2 haplotypes. These improvements were validated by single long-genome and transcriptome reads and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/gix085DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5632298PMC
October 2017

Scaffolding of long read assemblies using long range contact information.

BMC Genomics 2017 07 12;18(1):527. Epub 2017 Jul 12.

Pacific Biosciences, 94205 Menlo Park, California, USA.

Background: Long read technologies have revolutionized de novo genome assembly by generating contigs orders of magnitude longer than that of short read assemblies. Although assembly contiguity has increased, it usually does not reconstruct a full chromosome or an arm of the chromosome, resulting in an unfinished chromosome level assembly. To increase the contiguity of the assembly to the chromosome level, different strategies are used which exploit long range contact information between chromosomes in the genome.

Methods: We develop a scalable and computationally efficient scaffolding method that can boost the assembly contiguity to a large extent using genome-wide chromatin interaction data such as Hi-C.

Results: we demonstrate an algorithm that uses Hi-C data for longer-range scaffolding of de novo long read genome assemblies. We tested our methods on the human and goat genome assemblies. We compare our scaffolds with the scaffolds generated by LACHESIS based on various metrics.

Conclusion: Our new algorithm SALSA produces more accurate scaffolds compared to the existing state of the art method LACHESIS.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-017-3879-zDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5508778PMC
July 2017

Improved maize reference genome with single-molecule technologies.

Nature 2017 06 12;546(7659):524-527. Epub 2017 Jun 12.

Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.

Complete and accurate reference genomes and annotations provide fundamental tools for characterization of genetic and functional variation. These resources facilitate the determination of biological processes and support translation of research findings into improved and sustainable agricultural technologies. Many reference genomes for crop plants have been generated over the past decade, but these genomes are often fragmented and missing complex repeat regions. Here we report the assembly and annotation of a reference genome of maize, a genetic and agricultural model species, using single-molecule real-time sequencing and high-resolution optical mapping. Relative to the previous reference genome, our assembly features a 52-fold increase in contig length and notable improvements in the assembly of intergenic spaces and centromeres. Characterization of the repetitive portion of the genome revealed more than 130,000 intact transposable elements, allowing us to identify transposable element lineage expansions that are unique to maize. Gene annotations were updated using 111,000 full-length transcripts obtained by single-molecule real-time sequencing. In addition, comparative optical mapping of two other inbred maize lines revealed a prevalence of deletions in regions of low gene density and maize lineage-specific genes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature22971DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7052699PMC
June 2017

Heterogeneous resistance to quizartinib in acute myeloid leukemia revealed by single-cell analysis.

Blood 2017 07 10;130(1):48-58. Epub 2017 May 10.

Division of Hematology/Oncology and.

Genomic studies have revealed significant branching heterogeneity in cancer. Studies of resistance to tyrosine kinase inhibitor therapy have not fully reflected this heterogeneity because resistance in individual patients has been ascribed to largely mutually exclusive on-target or off-target mechanisms in which tumors either retain dependency on the target oncogene or subvert it through a parallel pathway. Using targeted sequencing from single cells and colonies from patient samples, we demonstrate tremendous clonal diversity in the majority of acute myeloid leukemia (AML) patients with activating internal tandem duplication mutations at the time of acquired resistance to the FLT3 inhibitor quizartinib. These findings establish that clinical resistance to quizartinib is highly complex and reflects the underlying clonal heterogeneity of AML.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1182/blood-2016-04-711820DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5501146PMC
July 2017

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly.

Genome Res 2017 05 10;27(5):849-864. Epub 2017 Apr 10.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.

The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.213611.116DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411779PMC
May 2017

Phased diploid genome assembly with single-molecule real-time sequencing.

Nat Methods 2016 Dec 17;13(12):1050-1054. Epub 2016 Oct 17.

Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA.

While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nmeth.4035DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5503144PMC
December 2016

Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing.

Bioinformatics 2016 07 24;32(13):1921-1924. Epub 2016 Feb 24.

Center for Biomolecular Science and Engineering, University of California, Santa Cruz, CA 95064, USA.

Motivation: Long arrays of near-identical tandem repeats are a common feature of centromeric and subtelomeric regions in complex genomes. These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools. Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, e.g. assembly, long reads allow direct inference of satellite higher order repeat structure. To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets. By operating on reads prior to assembly, our approach provides a more comprehensive set of repeat-structure variants and is not impacted by rearrangements or sequence underrepresentation due to misassembly.

Results: We demonstrate the utility of Alpha-CENTAURI in characterizing repeat structure for alpha satellite containing reads in the hydatidiform mole (CHM1, haploid-like) genome. The pipeline is designed to report local repeat organization summaries for each read, thereby monitoring rearrangements in repeat units, shifts in repeat orientation and sites of array transition into non-satellite DNA, typically defined by transposable element insertion. We validate the method by showing consistency with existing centromere high order repeat references. Alpha-CENTAURI can, in principle, run on any sequence data, offering a method to generate a sequence repeat resolution that could be readily performed using consensus sequences available for other satellite families in genomes without high-quality reference assemblies.

Availability And Implementation: Documentation and source code for Alpha-CENTAURI are freely available at http://github.com/volkansevim/alpha-CENTAURI CONTACT: ali.bashir@mssm.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btw101DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4920115PMC
July 2016

Chromosomal-Level Assembly of the Asian Seabass Genome Using Long Sequence Reads and Multi-layered Scaffolding.

PLoS Genet 2016 Apr 15;12(4):e1005954. Epub 2016 Apr 15.

Reproductive Genomics Group, Temasek Life Sciences Laboratory, Singapore.

We report here the ~670 Mb genome assembly of the Asian seabass (Lates calcarifer), a tropical marine teleost. We used long-read sequencing augmented by transcriptomics, optical and genetic mapping along with shared synteny from closely related fish species to derive a chromosome-level assembly with a contig N50 size over 1 Mb and scaffold N50 size over 25 Mb that span ~90% of the genome. The population structure of L. calcarifer species complex was analyzed by re-sequencing 61 individuals representing various regions across the species' native range. SNP analyses identified high levels of genetic diversity and confirmed earlier indications of a population stratification comprising three clades with signs of admixture apparent in the South-East Asian population. The quality of the Asian seabass genome assembly far exceeds that of any other fish species, and will serve as a new standard for fish genomics.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1005954DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4833346PMC
April 2016

Long-read sequence assembly of the gorilla genome.

Science 2016 Apr;352(6281):aae0344

Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA. Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA.

Accurate sequence and assembly of genomes is a critical first step for studies of genetic variation. We generated a high-quality assembly of the gorilla genome using single-molecule, real-time sequence technology and a string graph de novo assembly algorithm. The new assembly improves contiguity by two to three orders of magnitude with respect to previously released assemblies, recovering 87% of missing reference exons and incomplete gene models. Although regions of large, high-identity segmental duplications remain largely unresolved, this comprehensive assembly provides new biological insight into genetic diversity, structural variation, gene loss, and representation of repeat structures within the gorilla genome. The approach provides a path forward for the routine assembly of mammalian genomes at a level approaching that of the current quality of the human genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.aae0344DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4920363PMC
April 2016

Impact of aspirin on clinical outcomes for African American men with prostate cancer undergoing radiation.

Tumori 2016 Jan-Feb;102(1):65-70. Epub 2015 Sep 30.

Department of Veterans Affairs, New York Harbor Healthcare System, Brooklyn, New York - USA.

Aims And Background: Preclinical and clinical studies have suggested that aspirin (ASA) may exhibit antineoplastic activity. Particularly in prostate cancer, several reports have suggested that ASA plays a role in improved outcomes. Therefore, we studied the role of ASA in a uniquely African American population, which is known to harbor more aggressive and biologically different disease compared to the general population.

Methods: We identified 289 African American men with prostate cancer who were treated with definitive radiation therapy to a dose of ≥7560 cGy. The median follow-up was 76 months. Kaplan-Meier analysis was used to analyze biochemical failure-free survival (bFFS), distant progression-free survival (DMPFS), and prostate cancer-specific survival (PCSS). Multivariate Cox regression was used to analyze the impact of covariates on all endpoints.

Results: There were 147 men who were ASA+ and 142 who were ASA-. The 7-year bFFS was 80.9% for ASA+ men and 70.3% for ASA- men (p = 0.03). On multivariate analysis, ASA use was associated with a significant reduction in biochemical recurrences (hazard ratio [HR] 0.56, 95% confidence interval [CI] 0.34-0.93, p = 0.03). The 7-year DMPFS was 98.4% for ASA+ and 91.8% for ASA- men (p = 0.04). On multivariate analysis, ASA use was associated with a decreased risk of distant metastases (HR 0.23, 95% CI 0.06-0.91, p = 0.04). The 7-year PCSS was 99.3% for ASA+ and 96.9% for ASA- men (p = 0.07).

Conclusions: In this study, ASA use was associated with improved biochemical outcomes and reduced distant metastases. This indicates that ASA appears to play an important antineoplastic role in African American men.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.5301/tj.5000424DOI Listing
July 2016

Assembly and diploid architecture of an individual human genome via single-molecule technologies.

Nat Methods 2015 Aug 29;12(8):780-6. Epub 2015 Jun 29.

Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA.

We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nmeth.3454DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4646949PMC
August 2015

HLA Typing for the Next Generation.

PLoS One 2015 27;10(5):e0127153. Epub 2015 May 27.

Anthony Nolan Research Institute, Royal Free Hospital, London, United Kingdom; UCL Cancer Institute, Royal Free Campus, London, United Kingdom.

Allele-level resolution data at primary HLA typing is the ideal for most histocompatibility testing laboratories. Many high-throughput molecular HLA typing approaches are unable to determine the phase of observed DNA sequence polymorphisms, leading to ambiguous results. The use of higher resolution methods is often restricted due to cost and time limitations. Here we report on the feasibility of using Pacific Biosciences' Single Molecule Real-Time (SMRT) DNA sequencing technology for high-resolution and high-throughput HLA typing. Seven DNA samples were typed for HLA-A, -B and -C. The results showed that SMRT DNA sequencing technology was able to generate sequences that spanned entire HLA Class I genes that allowed for accurate allele calling. Eight novel genomic HLA class I sequences were identified, four were novel alleles, three were confirmed as genomic sequence extensions and one corrected an existing genomic reference sequence. This method has the potential to revolutionize the field of HLA typing. The clinical impact of achieving this level of resolution HLA typing data is likely to considerable, particularly in applications such as organ and blood stem cell transplantation where matching donors and recipients for their HLA is of utmost importance.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127153PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4446346PMC
April 2016

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Nat Biotechnol 2015 Jun 25;33(6):623-30. Epub 2015 May 25.

National Biodefense Analysis and Countermeasures Center, Frederick, Maryland, USA.

Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nbt.3238DOI Listing
June 2015

Extending reference assembly models.

Genome Biol 2015 Jan 24;16:13. Epub 2015 Jan 24.

The human genome reference assembly is crucial for aligning and analyzing sequence data, and for genome annotation, among other roles. However, the models and analysis assumptions that underlie the current assembly need revising to fully represent human sequence diversity. Improved analysis tools and updated data reporting formats are also required.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-015-0587-3DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4305238PMC
January 2015

Assessment of risk of late rectal bleeding for patients with prostate cancer started on anticoagulation before or after radiation treatment.

Anticancer Res 2014 Dec;34(12):7367-72

Department of Veterans Affairs, New York Harbor Healthcare System, Brooklyn, New York, NY, U.S.A. Department of Radiation Oncology, SUNY Downstate Medical Center, Brooklyn, NY, U.S.A.

Aim: To evaluate the risk of late rectal bleeding and its association with the timing and type of anticoagulation use in patients receiving dose-escalated radiation therapy (RT) (≥ 7,560 cGy) for prostate cancer.

Patients And Methods: Between 2003-2010, 465 patients were treated at our Institution with dose-escalated RT and included in this analysis. Patients were placed into the following categories: no anticoagulation use, aspirin during RT, clopidogrel/warfarin during RT, aspirin after completion of RT, clopidogrel/warfarin after completion of RT.

Results: The overall bleeding rate was 7.5%. For those on aspirin during RT, the 4-year freedom from rectal bleeding (FFBS) rate was 91%, compared to 94.7% for patients who were never on anticoagulation (p=0.16). For those on warfarin/clopidogrel during RT the 4-year FFBS rate was 78.2%, compared to 94.7% in those never on anticoagulation (p<0.001). On multivariate analysis, use of warfarin/clopidogrel during radiation treatment were strongly associated with an increased risk of rectal bleeding (multivariate HR=4.84, 95% CI=1.84-12.68, p=0.001). However, initiation of anticoagulation after completion of radiation treatment did not significantly increase the risk of rectal bleeding (multivariate HR=0.78, 95% CI=0.21-2.91, p=0.71).

Conclusion: The use of clopidogrel or warfarin during radiation is associated with significantly increased risk of rectal bleeding. However, initiation of these medications after completion of radiation does not appear to impact such risk.
View Article and Find Full Text PDF

Download full-text PDF

Source
December 2014

Long-read, whole-genome shotgun sequence data for five model organisms.

Sci Data 2014 25;1:140045. Epub 2014 Nov 25.

Pacific Biosciences of California Inc. , 1380 Willow Road, Menlo Park, California 94025, USA.

Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/sdata.2014.45DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4365909PMC
December 2015