Publications by authors named "Birte Kehr"

24 Publications

  • Page 1 of 1

GAMIBHEAR: whole-genome haplotype reconstruction from Genome Architecture Mapping data.

Bioinformatics 2021 Apr 8. Epub 2021 Apr 8.

Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, Hannoversche Str. 28, 10115 Berlin, Germany.

Motivation: Genome Architecture Mapping (GAM) was recently introduced as a digestion- and ligation-free method to detect chromatin conformation. Orthogonal to existing approaches based on chromatin conformation capture (3C), GAM's ability to capture both inter- and intra-chromosomal contacts from low amounts of input data makes it particularly well suited for allele-specific analyses in a clinical setting. Allele-specific analyses are powerful tools to investigate the effects of genetic variants on many cellular phenotypes including chromatin conformation, but require the haplotypes of the individuals under study to be known a-priori. So far however, no algorithm exists for haplotype reconstruction and phasing of genetic variants from GAM data, hindering the allele-specific analysis of chromatin contact points in non-model organisms or individuals with unknown haplotypes.

Results: We present GAMIBHEAR, a tool for accurate haplotype reconstruction from GAM data. GAMIBHEAR aggregates allelic co-observation frequencies from GAM data and employs a GAM-specific probabilistic model of haplotype capture to optimise phasing accuracy. Using a hybrid mouse embryonic stem cell line with known haplotype structure as a benchmark dataset, we assess correctness and completeness of the reconstructed haplotypes, and demonstrate the power of GAMIBHEAR to infer accurate genome-wide haplotypes from GAM data.

Availability: GAMIBHEAR is available as an R package under the open source GPL-2 license at https://bitbucket.org/schwarzlab/gamibhear.

Maintainer: julia.markowski@mdc-berlin.de.

Supplementary Information: Supplementary information is available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btab238DOI Listing
April 2021

PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes.

Nat Commun 2021 02 1;12(1):730. Epub 2021 Feb 1.

Regensburg Center for Interventional Immunology (RCI), Regensburg, Germany.

Thousands of genomic structural variants (SVs) segregate in the human population and can impact phenotypic traits and diseases. Their identification in whole-genome sequence data of large cohorts is a major computational challenge. Most current approaches identify SVs in single genomes and afterwards merge the identified variants into a joint call set across many genomes. We describe the approach PopDel, which directly identifies deletions of about 500 to at least 10,000 bp in length in data of many genomes jointly, eliminating the need for subsequent variant merging. PopDel scales to tens of thousands of genomes as we demonstrate in evaluations on up to 49,962 genomes. We show that PopDel reliably reports common, rare and de novo deletions. On genomes with available high-confidence reference call sets PopDel shows excellent recall and precision. Genotype inheritance patterns in up to 6794 trios indicate that genotypes predicted by PopDel are more reliable than those of previous SV callers. Furthermore, PopDel's running time is competitive with the fastest tested previous tools. The demonstrated scalability and accuracy of PopDel enables routine scans for deletions in large-scale sequencing studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-020-20850-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7851401PMC
February 2021

Lifelong Reduction in LDL (Low-Density Lipoprotein) Cholesterol due to a Gain-of-Function Mutation in .

Circ Genom Precis Med 2021 Feb 14;14(1):e003029. Epub 2020 Dec 14.

deCODE genetics/Amgen, Inc (E.B., K.G., G.H.H., A.S., G.A.A., H.J., G.S., S.G., A.H., G. Thorleifsson, J.S., I.J., O.T.M., G.M., H.S., D.F.G., G. Thorgeirsson, H.H., B.V.H., P.M., G.L.N., P.S., U.T., K.S.), University of Iceland.

Background: Loss-of-function mutations in the LDL (low-density lipoprotein) receptor gene () cause elevated levels of LDL cholesterol and premature cardiovascular disease. To date, a gain-of-function mutation in with a large effect on LDL cholesterol levels has not been described. Here, we searched for sequence variants in that have a large effect on LDL cholesterol levels.

Methods: We analyzed whole-genome sequencing data from 43 202 Icelanders. Single-nucleotide polymorphisms and structural variants including deletions, insertions, and duplications were genotyped using whole-genome sequencing-based data. LDL cholesterol associations were carried out in a sample of >100 000 Icelanders with genetic information (imputed or whole-genome sequencing). Molecular analyses were performed using RNA sequencing and protein expression assays in Epstein-Barr virus-transformed lymphocytes.

Results: We discovered a 2.5-kb deletion (del2.5) overlapping the 3' untranslated region of in 7 heterozygous carriers from a single family. Mean level of LDL cholesterol was 74% lower in del2.5 carriers than in 101 851 noncarriers, a difference of 2.48 mmol/L (96 mg/dL; =8.4×10). Del2.5 results in production of an alternative mRNA isoform with a truncated 3' untranslated region. The truncation leads to a loss of target sites for microRNAs known to repress translation of . In Epstein-Barr virus-transformed lymphocytes derived from del2.5 carriers, expression of alternative mRNA isoform was 1.84-fold higher than the wild-type isoform (=0.0013), and there was 1.79-fold higher surface expression of the LDL receptor than in noncarriers (=0.0086). We did not find a highly penetrant detrimental impact of lifelong very low levels of LDL cholesterol due to del2.5 on health of the carriers.

Conclusions: Del2.5 is the first reported gain-of-function mutation in causing a large reduction in LDL cholesterol. These data point to a role for alternative polyadenylation of mRNA as a potent regulator of LDL receptor expression in humans.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1161/CIRCGEN.120.003029DOI Listing
February 2021

Sequence variants associating with urinary biomarkers.

Hum Mol Genet 2019 04;28(7):1199-1211

Faculty of Medicine, University of Iceland, Reykjavik, Iceland.

Urine dipstick tests are widely used in routine medical care to diagnose kidney and urinary tract and metabolic diseases. Several environmental factors are known to affect the test results, whereas the effects of genetic diversity are largely unknown. We tested 32.5 million sequence variants for association with urinary biomarkers in a set of 150 274 Icelanders with urine dipstick measurements. We detected 20 association signals, of which 14 are novel, associating with at least one of five clinical entities defined by the urine dipstick: glucosuria, ketonuria, proteinuria, hematuria and urine pH. These include three independent glucosuria variants at SLC5A2, the gene encoding the sodium-dependent glucose transporter (SGLT2), a protein targeted pharmacologically to increase urinary glucose excretion in the treatment of diabetes. Two variants associating with proteinuria are in LRP2 and CUBN, encoding the co-transporters megalin and cubilin, respectively, that mediate proximal tubule protein uptake. One of the hematuria-associated variants is a rare, previously unreported 2.5 kb exonic deletion in COL4A3. Of the four signals associated with urine pH, we note that the pH-increasing alleles of two variants (POU2AF1, WDR72) associate significantly with increased risk of kidney stones. Our results reveal that genetic factors affect variability in urinary biomarkers, in both a disease dependent and independent context.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/hmg/ddy409DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6423415PMC
April 2019

Multiple transmissions of de novo mutations in families.

Nat Genet 2018 12 5;50(12):1674-1680. Epub 2018 Nov 5.

deCODE genetics/Amgen Inc., Reykjavik, Iceland.

De novo mutations (DNMs) cause a large proportion of severe rare diseases of childhood. DNMs that occur early may result in mosaicism of both somatic and germ cells. Such early mutations can cause recurrence of disease. We scanned 1,007 sibling pairs from 251 families and identified 878 DNMs shared by siblings (ssDNMs) at 448 genomic sites. We estimated DNM recurrence probability based on parental mosaicism, sharing of DNMs among siblings, parent-of-origin, mutation type and genomic position. We detected 57.2% of ssDNMs in the parental blood. The recurrence probability of a DNM decreases by 2.27% per year for paternal DNMs and 1.78% per year for maternal DNMs. Maternal ssDNMs are more likely to be T>C mutations than paternal ssDNMs, and less likely to be C>T mutations. Depending on the properties of the DNM, the recurrence probability ranges from 0.011% to 28.5%. We have launched an online calculator to allow estimation of DNM recurrence probability for research purposes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41588-018-0259-9DOI Listing
December 2018

Author Correction: The rate of meiotic gene conversion varies by sex and age.

Nat Genet 2018 11;50(11):1616

deCODE Genetics/Amgen, Inc., Reykjavik, Iceland.

In the version of this article published, statements about the impact of insertions and deletions on gene conversions were incorrect. We reported a bias toward deletions, whereas in fact the bias was toward insertions. We are deeply indebted to Laurent Duret and Brice Letcher for noticing this mistake in our manuscript. The following statements are incorrect in the published manuscript.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41588-018-0228-3DOI Listing
November 2018

Parental influence on human germline de novo mutations in 1,548 trios from Iceland.

Nature 2017 09 20;549(7673):519-522. Epub 2017 Sep 20.

deCODE genetics/Amgen Inc., 101 Reykjavik, Iceland.

The characterization of mutational processes that generate sequence diversity in the human genome is of paramount importance both to medical genetics and to evolutionary studies. To understand how the age and sex of transmitting parents affect de novo mutations, here we sequence 1,548 Icelanders, their parents, and, for a subset of 225, at least one child, to 35× genome-wide coverage. We find 108,778 de novo mutations, both single nucleotide polymorphisms and indels, and determine the parent of origin of 42,961. The number of de novo mutations from mothers increases by 0.37 per year of age (95% CI 0.32-0.43), a quarter of the 1.51 per year from fathers (95% CI 1.45-1.57). The number of clustered mutations increases faster with the mother's age than with the father's, and the genomic span of maternal de novo mutation clusters is greater than that of paternal ones. The types of de novo mutation from mothers change substantially with age, with a 0.26% (95% CI 0.19-0.33%) decrease in cytosine-phosphate-guanine to thymine-phosphate-guanine (CpG>TpG) de novo mutations and a 0.33% (95% CI 0.28-0.38%) increase in C>G de novo mutations per year, respectively. Remarkably, these age-related changes are not distributed uniformly across the genome. A striking example is a 20 megabase region on chromosome 8p, with a maternal C>G mutation rate that is up to 50-fold greater than the rest of the genome. The age-related accumulation of maternal non-crossover gene conversions also mostly occurs within these regions. Increased sequence diversity and linkage disequilibrium of C>G variants within regions affected by excess maternal mutations indicate that the underlying mutational process has persisted in humans for thousands of years. Moreover, the regional excess of C>G variation in humans is largely shared by chimpanzees, less by gorillas, and is almost absent from orangutans. This demonstrates that sequence diversity in humans results from evolving interactions between age, sex, mutation type, and genomic location.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nature24018DOI Listing
September 2017

Graphtyper enables population-scale genotyping using pangenome graphs.

Nat Genet 2017 Nov 25;49(11):1654-1660. Epub 2017 Sep 25.

deCODE Genetics/Amgen, Inc., Reykjavik, Iceland.

A fundamental requirement for genetic studies is an accurate determination of sequence variation. While human genome sequence diversity is increasingly well characterized, there is a need for efficient ways to use this knowledge in sequence analysis. Here we present Graphtyper, a publicly available novel algorithm and software for discovering and genotyping sequence variants. Graphtyper realigns short-read sequence data to a pangenome, a variation-aware graph structure that encodes sequence variation within a population by representing possible haplotypes as graph paths. Our results show that Graphtyper is fast, highly scalable, and provides sensitive and accurate genotype calls. Graphtyper genotyped 89.4 million sequence variants in the whole genomes of 28,075 Icelanders using less than 100,000 CPU days, including detailed genotyping of six human leukocyte antigen (HLA) genes. We show that Graphtyper is a valuable tool in characterizing sequence variation in both small and population-scale sequencing studies.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ng.3964DOI Listing
November 2017

Whole genome characterization of sequence diversity of 15,220 Icelanders.

Sci Data 2017 09 21;4:170115. Epub 2017 Sep 21.

deCODE genetics/Amgen Inc., Sturlugata 8, Reykjavik 101, Iceland.

Understanding of sequence diversity is the cornerstone of analysis of genetic disorders, population genetics, and evolutionary biology. Here, we present an update of our sequencing set to 15,220 Icelanders who we sequenced to an average genome-wide coverage of 34X. We identified 39,020,168 autosomal variants passing GATK filters: 31,079,378 SNPs and 7,940,790 indels. Calling de novo mutations (DNMs) is a formidable challenge given the high false positive rate in sequencing datasets relative to the mutation rate. Here we addressed this issue by using segregation of alleles in three-generation families. Using this transmission assay, we controlled the false positive rate and identified 108,778 high quality DNMs. Furthermore, we used our extended family structure and read pair tracing of DNMs to a panel of phased SNPs, to determine the parent of origin of 42,961 DNMs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/sdata.2017.115DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5607473PMC
September 2017

Comprehensive population-wide analysis of Lynch syndrome in Iceland reveals founder mutations in MSH6 and PMS2.

Nat Commun 2017 05 3;8:14755. Epub 2017 May 3.

University of Iceland, Sæmundargata 2, 101 Reykjavík, Iceland.

Lynch syndrome, caused by germline mutations in the mismatch repair genes, is associated with increased cancer risk. Here using a large whole-genome sequencing data bank, cancer registry and colorectal tumour bank we determine the prevalence of Lynch syndrome, associated cancer risks and pathogenicity of several variants in the Icelandic population. We use colorectal cancer samples from 1,182 patients diagnosed between 2000-2009. One-hundred and thirty-two (11.2%) tumours are mismatch repair deficient per immunohistochemistry. Twenty-one (1.8%) have Lynch syndrome while 106 (9.0%) have somatic hypermethylation or mutations in the mismatch repair genes. The population prevalence of Lynch syndrome is 0.442%. We discover a translocation disrupting MLH1 and three mutations in MSH6 and PMS2 that increase endometrial, colorectal, brain and ovarian cancer risk. We find thirteen mismatch repair variants of uncertain significance that are not associated with cancer risk. We find that founder mutations in MSH6 and PMS2 prevail in Iceland unlike most other populations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ncomms14755DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5418568PMC
May 2017

A rare splice donor mutation in the haptoglobin gene associates with blood lipid levels and coronary artery disease.

Hum Mol Genet 2017 06;26(12):2364-2376

deCODE Genetics/Amgen, Inc., Reykjavik, Iceland.

Common sequence variants at the haptoglobin gene (HP) have been associated with blood lipid levels. Through whole-genome sequencing of 8,453 Icelanders, we discovered a splice donor founder mutation in HP (NM_001126102.1:c.190 + 1G > C, minor allele frequency = 0.56%). This mutation occurs on the HP1 allele of the common copy number variant in HP and leads to a loss of function of HP1. It associates with lower levels of haptoglobin (P = 2.1 × 10-54), higher levels of non-high density lipoprotein cholesterol (β = 0.26 mmol/l, P = 2.6 × 10-9) and greater risk of coronary artery disease (odds ratio = 1.30, 95% confidence interval: 1.10-1.54, P = 0.0024). Through haplotype analysis and with RNA sequencing, we provide evidence of a causal relationship between one of the two haptoglobin isoforms, namely Hp1, and lower levels of non-HDL cholesterol. Furthermore, we show that the HP1 allele associates with various other quantitative biological traits.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/hmg/ddx123DOI Listing
June 2017

Diversity in non-repetitive human sequences not found in the reference genome.

Nat Genet 2017 Apr 27;49(4):588-593. Epub 2017 Feb 27.

deCODE Genetics/Amgen, Inc., Reykjavik, Iceland.

Genomes usually contain some non-repetitive sequences that are missing from the reference genome and occur only in a population subset. Such non-repetitive, non-reference (NRNR) sequences have remained largely unexplored in terms of their characterization and downstream analyses. Here we describe 3,791 breakpoint-resolved NRNR sequence variants called using PopIns from whole-genome sequence data of 15,219 Icelanders. We found that over 95% of the 244 NRNR sequences that are 200 bp or longer are present in chimpanzees, indicating that they are ancestral. Furthermore, 149 variant loci are in linkage disequilibrium (r > 0.8) with a genome-wide association study (GWAS) catalog marker, suggesting disease relevance. Additionally, we report an association (P = 3.8 × 10, odds ratio (OR) = 0.92) with myocardial infarction (23,360 cases, 300,771 controls) for a 766-bp NRNR sequence variant. Our results underline the importance of including variation of all complexity levels when searching for variants that associate with disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ng.3801DOI Listing
April 2017

A sequence variant associating with educational attainment also affects childhood cognition.

Sci Rep 2016 11 4;6:36189. Epub 2016 Nov 4.

deCODE Genetics/Amgen, Inc., Reykjavik, Iceland.

Only a few common variants in the sequence of the genome have been shown to impact cognitive traits. Here we demonstrate that polygenic scores of educational attainment predict specific aspects of childhood cognition, as measured with IQ. Recently, three sequence variants were shown to associate with educational attainment, a confluence phenotype of genetic and environmental factors contributing to academic success. We show that one of these variants associating with educational attainment, rs4851266-T, also associates with Verbal IQ in dyslexic children (P = 4.3 × 10, β = 0.16 s.d.). The effect of 0.16 s.d. corresponds to 1.4 IQ points for heterozygotes and 2.8 IQ points for homozygotes. We verified this association in independent samples consisting of adults (P = 8.3 × 10, β = 0.12 s.d., combined P = 2.2 x 10, β = 0.14 s.d.). Childhood cognition is unlikely to be affected by education attained later in life, and the variant explains a greater fraction of the variance in verbal IQ than in educational attainment (0.7% vs 0.12%,. P = 1.0 × 10).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/srep36189DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5095652PMC
November 2016

The rate of meiotic gene conversion varies by sex and age.

Nat Genet 2016 11 19;48(11):1377-1384. Epub 2016 Sep 19.

deCODE Genetics/Amgen, Inc., Reykjavik, Iceland.

Meiotic recombination involves a combination of gene conversion and crossover events that, along with mutations, produce germline genetic diversity. Here we report the discovery of 3,176 SNP and 61 indel gene conversions. Our estimate of the non-crossover (NCO) gene conversion rate (G) is 7.0 for SNPs and 5.8 for indels per megabase per generation, and the GC bias is 67.6%. For indels, we demonstrate a 65.6% preference for the shorter allele. NCO gene conversions from mothers are longer than those from fathers, and G is 2.17 times greater in mothers. Notably, G increases with the age of mothers, but not the age of fathers. A disproportionate number of NCO gene conversions in older mothers occur outside double-strand break (DSB) regions and in regions with relatively low GC content. This points to age-related changes in the mechanisms of meiotic gene conversion in oocytes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5083143PMC
http://dx.doi.org/10.1038/ng.3669DOI Listing
November 2016

popSTR: population-scale detection of STR variants.

Bioinformatics 2017 Dec;33(24):4041-4048

deCODE genetics/Amgen.

Motivation: Microsatellites, also known as short tandem repeats (STRs), are tracts of repetitive DNA sequences containing motifs ranging from two to six bases. Microsatellites are one of the most abundant type of variation in the human genome, after single nucleotide polymorphisms (SNPs) and Indels. Microsatellite analysis has a wide range of applications, including medical genetics, forensics and construction of genetic genealogy. However, microsatellite variations are rarely considered in whole-genome sequencing studies, in large due to a lack of tools capable of analyzing them.

Results: Here we present a microsatellite genotyper, optimized for Illumina WGS data, which is both faster and more accurate than other methods previously presented. There are two main ingredients to our improvements. First we reduce the amount of sequencing data necessary for creating microsatellite profiles by using previously aligned sequencing data. Second, we use population information to train microsatellite and individual specific error profiles. By comparing our genotyping results to genotypes generated by capillary electrophoresis we show that our error rates are 50% lower than those of lobSTR, another program specifically developed to determine microsatellite genotypes.

Availability And Implementation: Source code is available on Github: https://github.com/DecodeGenetics/popSTR.

Contact: snaedis.kristmundsdottir@decode.is or bjarni.halldorsson@decode.is.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btw568DOI Listing
December 2017

chopBAI: BAM index reduction solves I/O bottlenecks in the joint analysis of large sequencing cohorts.

Bioinformatics 2016 07 18;32(14):2202-4. Epub 2016 Mar 18.

deCODE Genetics/Amgen, Reykjavík, Iceland Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland.

Unlabelled: Advances in sequencing capacity have led to the generation of unprecedented amounts of genomic data. The processing of this data frequently leads to I/O bottlenecks, e. g. when analyzing a small genomic region across a large number of samples. The largest I/O burden is, however, often not imposed by the amount of data needed for the analysis but rather by index files that help retrieving this data. We have developed chopBAI, a program that can chop a BAM index (BAI) file into small pieces. The program outputs a list of BAI files each indexing a specified genomic interval. The output files are much smaller in size but maintain compatibility with existing software tools. We show how preprocessing BAI files with chopBAI can lead to a reduction of I/O by more than 95% during the analysis of 10 kb genomic regions, eventually enabling the joint analysis of more than 10 000 individuals.

Availability And Implementation: The software is implemented in C ++, GPL licensed and available at http://github.com/DecodeGenetics/chopBAIContact:birte.kehr@decode.is.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btw149DOI Listing
July 2016

Insertion of an SVA-E retrotransposon into the CASP8 gene is associated with protection against prostate cancer.

Hum Mol Genet 2016 Mar 5;25(5):1008-18. Epub 2016 Jan 5.

deCODE genetics/AMGEN, Sturlugata 8, 101 Reykjavik, Iceland.

Transcriptional and splicing anomalies have been observed in intron 8 of the CASP8 gene (encoding procaspase-8) in association with cutaneous basal-cell carcinoma (BCC) and linked to a germline SNP rs700635. Here, we show that the rs700635[C] allele, which is associated with increased risk of BCC and breast cancer, is protective against prostate cancer [odds ratio (OR) = 0.91, P = 1.0 × 10(-6)]. rs700635[C] is also associated with failures to correctly splice out CASP8 intron 8 in breast and prostate tumours and in corresponding normal tissues. Investigation of rs700635[C] carriers revealed that they have a human-specific short interspersed element-variable number of tandem repeat-Alu (SINE-VNTR-Alu), subfamily-E retrotransposon (SVA-E) inserted into CASP8 intron 8. The SVA-E shows evidence of prior activity, because it has transduced some CASP8 sequences during subsequent retrotransposition events. Whole-genome sequence (WGS) data were used to tag the SVA-E with a surrogate SNP rs1035142[T] (r(2) = 0.999), which showed associations with both the splicing anomalies (P = 6.5 × 10(-32)) and with protection against prostate cancer (OR = 0.91, P = 3.8 × 10(-7)).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/hmg/ddv622DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4754045PMC
March 2016

PopAlu: population-scale detection of Alu polymorphisms.

PeerJ 2015 22;3:e1269. Epub 2015 Sep 22.

deCODE genetics/Amgen , Reykjavík , Iceland ; Institute of Biomedical and Neural Engineering, School of Science and Engineering, Reykjavik University , Reykjavík , Iceland.

Alu elements are sequences of approximately 300 basepairs that together comprise more than 10% of the human genome. Due to their recent origin in primate evolution some Alu elements are polymorphic in humans, present in some individuals while absent in others. We present PopAlu, a tool to detect polymorphic Alu elements on a population scale from paired-end sequencing data. PopAlu uses read pair distance and orientation as well as split reads to identify the location and precise breakpoints of polymorphic Alus. Genotype calling enables us to differentiate between homozygous and heterozygous carriers, making the output of PopAlu suitable for use in downstream analyses such as genome-wide association studies (GWAS). We show on a simulated dataset that PopAlu calls Alu elements inserted and deleted with respect to a reference genome with high accuracy and high precision. Our analysis of real data of a human trio from the 1000 Genomes Project confirms that PopAlu is able to produce highly accurate genotype calls. To our knowledge, PopAlu is the first tool that identifies polymorphic Alu elements from multiple individuals simultaneously, pinpoints the precise breakpoints and calls genotypes with high accuracy.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.7717/peerj.1269DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4582951PMC
September 2015

PopIns: population-scale detection of novel sequence insertions.

Bioinformatics 2016 04 28;32(7):961-7. Epub 2015 Apr 28.

deCODE genetics/Amgen, Reykjavík, Iceland, Institute of Biomedical and Neural Engineering, Reykjavík University, Reykjavík, Iceland.

Motivation: The detection of genomic structural variation (SV) has advanced tremendously in recent years due to progress in high-throughput sequencing technologies. Novel sequence insertions, insertions without similarity to a human reference genome, have received less attention than other types of SVs due to the computational challenges in their detection from short read sequencing data, which inherently involves de novo assembly. De novo assembly is not only computationally challenging, but also requires high-quality data. Although the reads from a single individual may not always meet this requirement, using reads from multiple individuals can increase power to detect novel insertions.

Results: We have developed the program PopIns, which can discover and characterize non-reference insertions of 100 bp or longer on a population scale. In this article, we describe the approach we implemented in PopIns. It takes as input a reads-to-reference alignment, assembles unaligned reads using a standard assembly tool, merges the contigs of different individuals into high-confidence sequences, anchors the merged sequences into the reference genome, and finally genotypes all individuals for the discovered insertions. Our tests on simulated data indicate that the merging step greatly improves the quality and reliability of predicted insertions and that PopIns shows significantly better recall and precision than the recent tool MindTheGap. Preliminary results on a dataset of 305 Icelanders demonstrate the practicality of the new approach.

Availability And Implementation: The source code of PopIns is available from http://github.com/bkehr/popins

Contact: birte.kehr@decode.is

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btv273DOI Listing
April 2016

New basal cell carcinoma susceptibility loci.

Nat Commun 2015 Apr 9;6:6825. Epub 2015 Apr 9.

deCODE Genetics/AMGEN, Sturlugata 8, Reykjavik 101, Iceland.

In an ongoing screen for DNA sequence variants that confer risk of cutaneous basal cell carcinoma (BCC), we conduct a genome-wide association study (GWAS) of 24,988,228 SNPs and small indels detected through whole-genome sequencing of 2,636 Icelanders and imputed into 4,572 BCC patients and 266,358 controls. Here we show the discovery of four new BCC susceptibility loci: 2p24 MYCN (rs57244888[C], OR=0.76, P=4.7 × 10(-12)), 2q33 CASP8-ALS2CR12 (rs13014235[C], OR=1.15, P=1.5 × 10(-9)), 8q21 ZFHX4 (rs28727938[G], OR=0.70, P=3.5 × 10(-12)) and 10p14 GATA3 (rs73635312[A], OR=0.74, P=2.4 × 10(-16)). Fine mapping reveals that two variants correlated with rs73635312[A] occur in conserved binding sites for the GATA3 transcription factor. In addition, expression microarrays and RNA-seq show that rs13014235[C] and a related SNP rs700635[C] are associated with expression of CASP8 splice variants in which sequences from intron 8 are retained.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ncomms7825DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4403348PMC
April 2015

Genome alignment with graph data structures: a comparison.

BMC Bioinformatics 2014 Apr 9;15:99. Epub 2014 Apr 9.

Department of Computer Science, Freie Universität Berlin, Takustr, 9, 14195 Berlin, Germany.

Background: Recent advances in rapid, low-cost sequencing have opened up the opportunity to study complete genome sequences. The computational approach of multiple genome alignment allows investigation of evolutionarily related genomes in an integrated fashion, providing a basis for downstream analyses such as rearrangement studies and phylogenetic inference.Graphs have proven to be a powerful tool for coping with the complexity of genome-scale sequence alignments. The potential of graphs to intuitively represent all aspects of genome alignments led to the development of graph-based approaches for genome alignment. These approaches construct a graph from a set of local alignments, and derive a genome alignment through identification and removal of graph substructures that indicate errors in the alignment.

Results: We compare the structures of commonly used graphs in terms of their abilities to represent alignment information. We describe how the graphs can be transformed into each other, and identify and classify graph substructures common to one or more graphs. Based on previous approaches, we compile a list of modifications that remove these substructures.

Conclusion: We show that crucial pieces of alignment information, associated with inversions and duplications, are not visible in the structure of all graphs. If we neglect vertex or edge labels, the graphs differ in their information content. Still, many ideas are shared among all graph-based approaches. Based on these findings, we outline a conceptual framework for graph-based genome alignment that can assist in the development of future genome alignment tools.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2105-15-99DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4020321PMC
April 2014

NetCoffee: a fast and accurate global alignment approach to identify functionally conserved proteins in multiple networks.

Bioinformatics 2014 Feb 13;30(4):540-8. Epub 2013 Dec 13.

Department of Mathematics and Computer Science, Freie Universität Berlin, Takustrasse 9, 14195 Berlin, Germany and Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany.

Motivation: Owing to recent advancements in high-throughput technologies, protein-protein interaction networks of more and more species become available in public databases. The question of how to identify functionally conserved proteins across species attracts a lot of attention in computational biology. Network alignments provide a systematic way to solve this problem. However, most existing alignment tools encounter limitations in tackling this problem. Therefore, the demand for faster and more efficient alignment tools is growing.

Results: We present a fast and accurate algorithm, NetCoffee, which allows to find a global alignment of multiple protein-protein interaction networks. NetCoffee searches for a global alignment by maximizing a target function using simulated annealing on a set of weighted bipartite graphs that are constructed using a triplet approach similar to T-Coffee. To assess its performance, NetCoffee was applied to four real datasets. Our results suggest that NetCoffee remedies several limitations of previous algorithms, outperforms all existing alignment tools in terms of speed and nevertheless identifies biologically meaningful alignments.

Availability: The source code and data are freely available for download under the GNU GPL v3 license at https://code.google.com/p/netcoffee/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btt715DOI Listing
February 2014

STELLAR: fast and exact local alignments.

BMC Bioinformatics 2011 Oct 5;12 Suppl 9:S15. Epub 2011 Oct 5.

Department of Computer Science, Free University Berlin, Takustr. 9, 14195 Berlin, Germany.

Background: Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches.

Results: We present here the local pairwise aligner STELLAR that has full sensitivity for ε-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments.

Conclusions: STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at http://www.seqan.de/projects/stellar. The source code is freely distributed with the SeqAn C++ library version 1.3 and later at http://www.seqan.de.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2105-12-S9-S15DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3283304PMC
October 2011

Determination of glycan structure from tandem mass spectra.

IEEE/ACM Trans Comput Biol Bioinform 2011 Jul-Aug;8(4):976-86

Faculty for Mathematics and Computer Science, Friedrich-Schiller-Universität Jena, Ernst-Abbe-Platz 2, Jena 07743, Germany.

Glycans are molecules made from simple sugars that form complex tree structures. Glycans constitute one of the most important protein modifications and identification of glycans remains a pressing problem in biology. Unfortunately, the structure of glycans is hard to predict from the genome sequence of an organism. In this paper, we consider the problem of deriving the topology of a glycan solely from tandem mass spectrometry (MS) data. We study, how to generate glycan tree candidates that sufficiently match the sample mass spectrum, avoiding the combinatorial explosion of glycan structures. Unfortunately, the resulting problem is known to be computationally hard. We present an efficient exact algorithm for this problem based on fixed-parameter algorithmics that can process a spectrum in a matter of seconds. We also report some preliminary results of our method on experimental data, combining it with a preliminary candidate evaluation scheme. We show that our approach is fast in applications, and that we can reach very well de novo identification results. Finally, we show how to count the number of glycan topologies for a fixed size or a fixed mass. We generalize this result to count the number of (labeled) trees with bounded out degree, improving on results obtained using Pólya's enumeration theorem.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TCBB.2010.129DOI Listing
September 2011