Publications by authors named "Jan Grau"

33 Publications

Insights gained from a comprehensive all-against-all transcription factor binding motif benchmarking study.

Genome Biol 2020 05 11;21(1):114. Epub 2020 May 11.

School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.

Background: Positional weight matrix (PWM) is a de facto standard model to describe transcription factor (TF) DNA binding specificities. PWMs inferred from in vivo or in vitro data are stored in many databases and used in a plethora of biological applications. This calls for comprehensive benchmarking of public PWM models with large experimental reference sets.

Results: Here we report results from all-against-all benchmarking of PWM models for DNA binding sites of human TFs on a large compilation of in vitro (HT-SELEX, PBM) and in vivo (ChIP-seq) binding data. We observe that the best performing PWM for a given TF often belongs to another TF, usually from the same family. Occasionally, binding specificity is correlated with the structural class of the DNA binding domain, indicated by good cross-family performance measures. Benchmarking-based selection of family-representative motifs is more effective than motif clustering-based approaches. Overall, there is good agreement between in vitro and in vivo performance measures. However, for some in vivo experiments, the best performing PWM is assigned to an unrelated TF, indicating a binding mode involving protein-protein cooperativity.

Conclusions: In an all-against-all setting, we compute more than 18 million performance measure values for different PWM-experiment combinations and offer these results as a public resource to the research community. The benchmarking protocols are provided via a web interface and as docker images. The methods and results from this study may help others make better use of public TF specificity models, as well as public TF binding data sets.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-020-01996-3DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7212583PMC
May 2020

PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting.

PLoS Comput Biol 2019 07 11;15(7):e1007206. Epub 2019 Jul 11.

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.

Plant-pathogenic Xanthomonas bacteria secrete transcription activator-like effectors (TALEs) into host cells, where they act as transcriptional activators on plant target genes to support bacterial virulence. TALEs have a unique modular DNA-binding domain composed of tandem repeats. Two amino acids within each tandem repeat, termed repeat-variable diresidues, bind to contiguous nucleotides on the DNA sequence and determine target specificity. In this paper, we propose a novel approach for TALE target prediction to identify potential virulence targets. Our approach accounts for recent findings concerning TALE targeting, including frame-shift binding by repeats of aberrant lengths, and the flexible strand orientation of target boxes relative to the transcription start of the downstream target gene. The computational model can account for dependencies between adjacent RVD positions. Model parameters are learned from the wealth of quantitative data that have been generated over the last years. We benchmark the novel approach, termed PrediTALE, using RNA-seq data after Xanthomonas infection in rice, and find an overall improvement of prediction performance compared with previous approaches. Using PrediTALE, we are able to predict several novel putative virulence targets. However, we also observe that no target genes are predicted by any prediction tool for several TALEs, which we term orphan TALEs for this reason. We postulate that one explanation for orphan TALEs are incomplete gene annotations and, hence, propose to replace promoterome-wide by genome-wide scans for target boxes. We demonstrate that known targets from promoterome-wide scans may be recovered by genome-wide scans, whereas the latter, combined with RNA-seq data, are able to detect putative targets independent of existing gene annotations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1007206DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6650089PMC
July 2019

DepLogo: visualizing sequence dependencies in R.

Bioinformatics 2019 11;35(22):4812-4814

Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI), Quedlinburg, Germany.

Summary: Statistical dependencies are present in a variety of sequence data, but are not discernible from traditional sequence logos. Here, we present the R package DepLogo for visualizing inter-position dependencies in aligned sequence data as dependency logos. Dependency logos make dependency structures, which correspond to regular co-occurrences of symbols at dependent positions, visually perceptible. To this end, sequences are partitioned based on their symbols at highly dependent positions as measured by mutual information, and each partition obtains its own visual representation. We illustrate the utility of the DepLogo package in several use cases generating dependency logos from DNA, RNA and protein sequences.

Availability And Implementation: The DepLogo R package is available from CRAN and its source code is available at https://github.com/Jstacs/DepLogo.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz507DOI Listing
November 2019

GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data.

Methods Mol Biol 2019 ;1962:161-177

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.

GeMoMa is a homology-based gene prediction program that predicts gene models in target species based on gene models in evolutionary related reference species. GeMoMa utilizes amino acid sequence conservation, intron position conservation, and RNA-seq data to accurately predict protein-coding transcripts. Furthermore, GeMoMa supports the combination of predictions based on several reference species allowing to transfer high-quality annotation of different reference species to a target species. Here, we present a detailed description of GeMoMa modules and the GeMoMa pipeline and how they can be used on the command line to address particular biological problems.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/978-1-4939-9173-0_9DOI Listing
August 2019

Transcriptional Reprogramming of Rice Cells by TALEs.

Front Plant Sci 2019 25;10:162. Epub 2019 Feb 25.

Department of Plant Biotechnology, Institute of Plant Genetics, Leibniz Universität Hannover, Hanover, Germany.

Rice-pathogenic bacteria cause severe harvest loss and challenge a stable food supply. The pathogen virulence relies strongly on bacterial TALE (transcription activator-like effector) proteins that function as transcriptional activators inside the plant cell. To understand the plant targets of TALEs, we determined the genome sequences of the Indian pv. () type strain ICMP 3125 and the strain PXO142 from the Philippines. Their complete TALE repertoire was analyzed and genome-wide TALE targets in rice were characterized. Integrating computational target predictions and rice transcriptomics data, we were able to verify 12 specifically induced target rice genes. The TALEs of the strains were reconstructed and expressed in a TALE-free strain to attribute specific induced genes to individual TALEs. Using reporter assays, we could show that individual TALEs act directly on their target promoters. In particular, we show that TALE classes assigned by AnnoTALE reflect common target genes, and that TALE classes of and the related pathogen pv. share more common target genes than previously believed. Taken together, we establish a detailed picture of TALE-induced plant processes that significantly expands our understanding of virulence strategies and will facilitate the development of novel resistances to overcome this important rice disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fpls.2019.00162DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6397873PMC
February 2019

Allele specific chromatin signals, 3D interactions, and motif predictions for immune and B cell related diseases.

Sci Rep 2019 02 25;9(1):2695. Epub 2019 Feb 25.

Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden.

Several Genome Wide Association Studies (GWAS) have reported variants associated to immune diseases. However, the identified variants are rarely the drivers of the associations and the molecular mechanisms behind the genetic contributions remain poorly understood. ChIP-seq data for TFs and histone modifications provide snapshots of protein-DNA interactions allowing the identification of heterozygous SNPs showing significant allele specific signals (AS-SNPs). AS-SNPs can change a TF binding site resulting in altered gene regulation and are primary candidates to explain associations observed in GWAS and expression studies. We identified 17,293 unique AS-SNPs across 7 lymphoblastoid cell lines. In this set of cell lines we interrogated 85% of common genetic variants in the population for potential regulatory effect and we identified 237 AS-SNPs associated to immune GWAS traits and 714 to gene expression in B cells. To elucidate possible regulatory mechanisms we integrated long-range 3D interactions data to identify putative target genes and motif predictions to identify TFs whose binding may be affected by AS-SNPs yielding a collection of 173 AS-SNPs associated to gene expression and 60 to B cell related traits. We present a systems strategy to find functional gene regulatory variants, the TFs that bind differentially between alleles and novel strategies to detect the regulated genes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-019-39633-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6389883PMC
February 2019

Accurate prediction of cell type-specific transcription factor binding.

Genome Biol 2019 01 10;20(1). Epub 2019 Jan 10.

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Von-Seckendorff-Platz 1, Halle (Saale), 06120, Germany.

Prediction of cell type-specific, in vivo transcription factor binding sites is one of the central challenges in regulatory genomics. Here, we present our approach that earned a shared first rank in the "ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge" in 2017. In post-challenge analyses, we benchmark the influence of different feature sets and find that chromatin accessibility and binding motifs are sufficient to yield state-of-the-art performance. Finally, we provide 682 lists of predicted peaks for a total of 31 transcription factors in 22 primary cell types and tissues and a user-friendly version of our approach, Catchitt, for download.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-018-1614-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6327544PMC
January 2019

A conserved motif promotes HpaB-regulated export of type III effectors from Xanthomonas.

Mol Plant Pathol 2018 11 16;19(11):2473-2487. Epub 2018 Oct 16.

Institute for Biology, Department of Genetics, Martin Luther University Halle-Wittenberg, Halle (Saale), 06120, Germany.

The type III secretion (T3S) system, an essential pathogenicity factor in most Gram-negative plant-pathogenic bacteria, injects bacterial effector proteins directly into the plant cell cytosol. Here, the type III effectors (T3Es) manipulate host cell processes to suppress defence and establish appropriate conditions for bacterial multiplication in the intercellular spaces of the plant tissue. T3E export depends on a secretion signal which is also present in 'non-effectors'. The latter are secreted extracellular components of the T3S apparatus, but are not translocated into the plant cell. How the T3S system discriminates between T3Es and non-effectors is still enigmatic. Previously, we have identified a putative translocation motif (TrM) in several T3Es from Xanthomonas campestris pv. vesicatoria (Xcv). Here, we analysed the TrM of the Xcv effector XopB in detail. Mutation studies showed that the proline/arginine-rich motif is required for efficient type III-dependent secretion and translocation of XopB and determines the dependence of XopB transport on the general T3S chaperone HpaB. Similar results were obtained for other effectors from Xcv. As the arginine residues of the TrM mediate specific binding of XopB to cardiolipin, one of the major lipid components in Xanthomonas membranes, we assume that the association of T3Es to the bacterial membrane prior to secretion supports type III-dependent export.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1111/mpp.12725DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6638074PMC
November 2018

Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi.

BMC Bioinformatics 2018 05 30;19(1):189. Epub 2018 May 30.

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), D-06120, Germany.

Background: Genome annotation is of key importance in many research questions. The identification of protein-coding genes is often based on transcriptome sequencing data, ab-initio or homology-based prediction. Recently, it was demonstrated that intron position conservation improves homology-based gene prediction, and that experimental data improves ab-initio gene prediction.

Results: Here, we present an extension of the gene prediction program GeMoMa that utilizes amino acid sequence conservation, intron position conservation and optionally RNA-seq data for homology-based gene prediction. We show on published benchmark data for plants, animals and fungi that GeMoMa performs better than the gene prediction programs BRAKER1, MAKER2, and CodingQuarry, and purely RNA-seq-based pipelines for transcript identification. In addition, we demonstrate that using multiple reference organisms may help to further improve the performance of GeMoMa. Finally, we apply GeMoMa to four nematode species and to the recently published barley reference genome indicating that current annotations of protein-coding genes may be refined using GeMoMa predictions.

Conclusions: GeMoMa might be of great utility for annotating newly sequenced genomes but also for finding homologs of a specific gene or gene family. GeMoMa has been published under GNU GPL3 and is freely available at http://www.jstacs.de/index.php/GeMoMa .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-018-2203-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5975413PMC
May 2018

Evolution of Transcription Activator-Like Effectors in Xanthomonas oryzae.

Genome Biol Evol 2017 06;9(6):1599-1615

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.

Transcription activator-like effectors (TALEs) are secreted by plant-pathogenic Xanthomonas bacteria into plant cells where they act as transcriptional activators and, hence, are major drivers in reprogramming the plant for the benefit of the pathogen. TALEs possess a highly repetitive DNA-binding domain of typically 34 amino acid (AA) tandem repeats, where AA 12 and 13, termed repeat variable di-residue (RVD), determine target specificity. Different Xanthomonas strains possess different repertoires of TALEs. Here, we study the evolution of TALEs from the level of RVDs determining target specificity down to the level of DNA sequence with focus on rice-pathogenic Xanthomonas oryzae pv. oryzae (Xoo) and Xanthomonas oryzae pv. oryzicola (Xoc) strains. We observe that codon pairs coding for individual RVDs are conserved to a similar degree as the flanking repeat sequence. We find strong indications that TALEs may evolve 1) by base substitutions in codon pairs coding for RVDs, 2) by recombination of N-terminal or C-terminal regions of existing TALEs, or 3) by deletion of individual TALE repeats, and we propose possible mechanisms. We find indications that the reassortment of TALE genes in clusters is mediated by an integron-like mechanism in Xoc. We finally study the effect of the presence/absence and evolutionary modifications of TALEs on transcriptional activation of putative target genes in rice, and find that even single RVD swaps may lead to considerable differences in activation. This correlation allowed a refined prediction of TALE targets, which is the crucial step to decipher their virulence activity.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gbe/evx108DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5512977PMC
June 2017

Dissection of TALE-dependent gene activation reveals that they induce transcription cooperatively and in both orientations.

PLoS One 2017 16;12(3):e0173580. Epub 2017 Mar 16.

Institute of Plant Genetics, Leibniz Universität Hannover, Hannover, Germany.

Plant-pathogenic Xanthomonas bacteria inject transcription activator-like effector proteins (TALEs) into host cells to specifically induce transcription of plant genes and enhance susceptibility. Although the DNA-binding mode is well-understood it is still ambiguous how TALEs initiate transcription and whether additional promoter elements are needed to support this. To systematically dissect prerequisites for transcriptional initiation the activity of one TALE was compared on different synthetic Bs4 promoter fragments. In addition, a large collection of artificial TALEs spanning the OsSWEET14 promoter was compared. We show that the presence of a TALE alone is not sufficient to initiate transcription suggesting the requirement of additional supporting promoter elements. At the OsSWEET14 promoter TALEs can initiate transcription from various positions, in a synergistic manner of multiple TALEs binding in parallel to the promoter, and even by binding in reverse orientation. TALEs are known to shift the transcriptional start site, but our data show that this shift depends on the individual position of a TALE within a promoter context. Our results implicate that TALEs function like classical enhancer-binding proteins and initiate transcription in both orientations which has consequences for in planta target gene prediction and design of artificial activators.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173580PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5354296PMC
September 2017

InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites.

Bioinformatics 2017 02;33(4):580-582

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.

Summary: Recent studies have shown that the traditional position weight matrix model is often insufficient for modeling transcription factor binding sites, as intra-motif dependencies play a significant role for an accurate description of binding motifs. Here, we present the Java application InMoDe, a collection of tools for learning, leveraging and visualizing such dependencies of putative higher order. The distinguishing feature of InMoDe is a robust model selection from a class of parsimonious models, taking into account dependencies only if justified by the data while choosing for simplicity otherwise.

Availability And Implementation: InMoDe is implemented in Java and is available as command line application, as application with a graphical user-interface, and as an integration into Galaxy on the project website at http://www.jstacs.de/index.php/InMoDe .

Contact: ralf.eggeling@cs.helsinki.fi.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btw689DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408807PMC
February 2017

Auxin-induced expression divergence between Arabidopsis species may originate within the TIR1/AFB-AUX/IAA-ARF module.

J Exp Bot 2017 01;68(3):539-552

Institute of Agricultural and Nutritional Sciences, Martin Luther University Halle-Wittenberg, Betty-Heimann, Halle (Saale), Germany.

Auxin is an essential regulator of plant growth and development, and auxin signaling components are conserved among land plants. Yet, a remarkable degree of natural variation in physiological and transcriptional auxin responses has been described among Arabidopsis thaliana accessions. As intraspecies comparisons offer only limited genetic variation, we here inspect the variation of auxin responses between A. thaliana and A. lyrata. This approach allowed the identification of conserved auxin response genes including novel genes with potential relevance for auxin biology. Furthermore, promoter divergences were analyzed for putative sources of variation. De novo motif discovery identified novel and variants of known elements with potential relevance for auxin responses, emphasizing the complex, and yet elusive, code of element combinations accounting for the diversity in transcriptional auxin responses. Furthermore, network analysis revealed correlations of interspecies differences in the expression of AUX/IAA gene clusters and classic auxin-related genes. We conclude that variation in general transcriptional and physiological auxin responses may originate substantially from functional or transcriptional variations in the TIR1/AFB, AUX/IAA, and ARF signaling network. In that respect, AUX/IAA gene expression divergence potentially reflects differences in the manner in which different species transduce identical auxin signals into gene expression responses.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/jxb/erw457DOI Listing
January 2017

Using intron position conservation for homology-based gene prediction.

Nucleic Acids Res 2016 05 17;44(9):e89. Epub 2016 Feb 17.

Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, D-06484 Quedlinburg, Germany.

Annotation of protein-coding genes is very important in bioinformatics and biology and has a decisive influence on many downstream analyses. Homology-based gene prediction programs allow for transferring knowledge about protein-coding genes from an annotated organism to an organism of interest.Here, we present a homology-based gene prediction program called GeMoMa. GeMoMa utilizes the conservation of intron positions within genes to predict related genes in other organisms. We assess the performance of GeMoMa and compare it with state-of-the-art competitors on plant and animal genomes using an extended best reciprocal hit approach. We find that GeMoMa often makes more precise predictions than its competitors yielding a substantially increased number of correct transcripts. Subsequently, we exemplarily validate GeMoMa predictions using Sanger sequencing. Finally, we use RNA-seq data to compare the predictions of homology-based gene prediction programs, and find again that GeMoMa performs well.Hence, we conclude that exploiting intron position conservation improves homology-based gene prediction, and we make GeMoMa freely available as command-line tool and Galaxy integration.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw092DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4872089PMC
May 2016

AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from Xanthomonas genomic sequences.

Sci Rep 2016 Feb 15;6:21077. Epub 2016 Feb 15.

Department of Genetics, Martin Luther University Halle-Wittenberg, Weinbergweg 10, D-06120 Halle (Saale), Germany.

Transcription activator-like effectors (TALEs) are virulence factors, produced by the bacterial plant-pathogen Xanthomonas, that function as gene activators inside plant cells. Although the contribution of individual TALEs to infectivity has been shown, the specific roles of most TALEs, and the overall TALE diversity in Xanthomonas spp. is not known. TALEs possess a highly repetitive DNA-binding domain, which is notoriously difficult to sequence. Here, we describe an improved method for characterizing TALE genes by the use of PacBio sequencing. We present 'AnnoTALE', a suite of applications for the analysis and annotation of TALE genes from Xanthomonas genomes, and for grouping similar TALEs into classes. Based on these classes, we propose a unified nomenclature for Xanthomonas TALEs that reveals similarities pointing to related functionalities. This new classification enables us to compare related TALEs and to identify base substitutions responsible for the evolution of TALE specificities.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/srep21077DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4753510PMC
February 2016

DiffLogo: a comparative visualization of sequence motifs.

BMC Bioinformatics 2015 Nov 17;16:387. Epub 2015 Nov 17.

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.

Background: For three decades, sequence logos are the de facto standard for the visualization of sequence motifs in biology and bioinformatics. Reasons for this success story are their simplicity and clarity. The number of inferred and published motifs grows with the number of data sets and motif extraction algorithms. Hence, it becomes more and more important to perceive differences between motifs. However, motif differences are hard to detect from individual sequence logos in case of multiple motifs for one transcription factor, highly similar binding motifs of different transcription factors, or multiple motifs for one protein domain.

Results: Here, we present DiffLogo, a freely available, extensible, and user-friendly R package for visualizing motif differences. DiffLogo is capable of showing differences between DNA motifs as well as protein motifs in a pair-wise manner resulting in publication-ready figures. In case of more than two motifs, DiffLogo is capable of visualizing pair-wise differences in a tabular form. Here, the motifs are ordered by similarity, and the difference logos are colored for clarity. We demonstrate the benefit of DiffLogo on CTCF motifs from different human cell lines, on E-box motifs of three basic helix-loop-helix transcription factors as examples for comparison of DNA motifs, and on F-box domains from three different families as example for comparison of protein motifs.

Conclusions: DiffLogo provides an intuitive visualization of motif differences. It enables the illustration and investigation of differences between highly similar motifs such as binding patterns of transcription factors for different cell types, treatments, and algorithmic approaches.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-015-0767-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4650857PMC
November 2015

Genome-Wide Identification and Validation of Reference Genes in Infected Tomato Leaves for Quantitative RT-PCR Analyses.

PLoS One 2015 27;10(8):e0136499. Epub 2015 Aug 27.

Institute for Biology, Department of Genetics, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.

The Gram-negative bacterium Xanthomonas campestris pv. vesicatoria (Xcv) causes bacterial spot disease of pepper and tomato by direct translocation of type III effector proteins into the plant cell cytosol. Once in the plant cell the effectors interfere with host cell processes and manipulate the plant transcriptome. Quantitative RT-PCR (qRT-PCR) is usually the method of choice to analyze transcriptional changes of selected plant genes. Reliable results depend, however, on measuring stably expressed reference genes that serve as internal normalization controls. We identified the most stably expressed tomato genes based on microarray analyses of Xcv-infected tomato leaves and evaluated the reliability of 11 genes for qRT-PCR studies in comparison to four traditionally employed reference genes. Three different statistical algorithms, geNorm, NormFinder and BestKeeper, concordantly determined the superiority of the newly identified reference genes. The most suitable reference genes encode proteins with homology to PHD finger family proteins and the U6 snRNA-associated protein LSm7. In addition, we identified pepper orthologs and validated several genes as reliable normalization controls for qRT-PCR analysis of Xcv-infected pepper plants. The newly identified reference genes will be beneficial for future qRT-PCR studies of the Xcv-tomato and Xcv-pepper pathosystems, as well as for the identification of suitable normalization controls for qRT-PCR studies of other plant-pathogen interactions, especially, if related plant species are used in combination with bacterial pathogens.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0136499PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4552032PMC
May 2016

Varying levels of complexity in transcription factor binding motifs.

Nucleic Acids Res 2015 Oct 26;43(18):e119. Epub 2015 Jun 26.

Institute of Computer Science, Martin Luther University Halle-Wittenberg, D-06099 Halle (Saale), Germany.

Binding of transcription factors to DNA is one of the keystones of gene regulation. The existence of statistical dependencies between binding site positions is widely accepted, while their relevance for computational predictions has been debated. Building probabilistic models of binding sites that may capture dependencies is still challenging, since the most successful motif discovery approaches require numerical optimization techniques, which are not suited for selecting dependency structures. To overcome this issue, we propose sparse local inhomogeneous mixture (Slim) models that combine putative dependency structures in a weighted manner allowing for numerical optimization of dependency structure and model parameters simultaneously. We find that Slim models yield a substantially better prediction performance than previous models on genomic context protein binding microarray data sets and on ChIP-seq data sets. To elucidate the reasons for the improved performance, we develop dependency logos, which allow for visual inspection of dependency structures within binding sites. We find that the dependency structures discovered by Slim models are highly diverse and highly transcription factor-specific, which emphasizes the need for flexible dependency models. The observed dependency structures range from broad heterogeneities to sparse dependencies between neighboring and non-neighboring binding site positions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkv577DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4605289PMC
October 2015

PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R.

Bioinformatics 2015 Aug 24;31(15):2595-7. Epub 2015 Mar 24.

Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, Quedlinburg, Germany.

Precision-recall (PR) and receiver operating characteristic (ROC) curves are valuable measures of classifier performance. Here, we present the R-package PRROC, which allows for computing and visualizing both PR and ROC curves. In contrast to available R-packages, PRROC allows for computing PR and ROC curves and areas under these curves for soft-labeled data using a continuous interpolation between the points of PR curves. In addition, PRROC provides a generic plot function for generating publication-quality graphics of PR and ROC curves.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btv153DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4514923PMC
August 2015

Area under precision-recall curves for weighted and unweighted data.

PLoS One 2014 20;9(3):e92209. Epub 2014 Mar 20.

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.

Precision-recall curves are highly informative about the performance of binary classifiers, and the area under these curves is a popular scalar performance measure for comparing different classifiers. However, for many applications class labels are not provided with absolute certainty, but with some degree of confidence, often reflected by weights or soft labels assigned to data points. Computing the area under the precision-recall curve requires interpolating between adjacent supporting points, but previous interpolation schemes are not directly applicable to weighted data. Hence, even in cases where weights were available, they had to be neglected for assessing classifiers using precision-recall curves. Here, we propose an interpolation for precision-recall curves that can also be used for weighted data, and we derive conditions for classification scores yielding the maximum and minimum area under the precision-recall curve. We investigate accordances and differences of the proposed interpolation and previous ones, and we demonstrate that taking into account existing weights of test data is important for the comparison of classifiers.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0092209PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3961324PMC
December 2014

A TAL effector repeat architecture for frameshift binding.

Nat Commun 2014 Mar 11;5:3447. Epub 2014 Mar 11.

Department of Genetics, Martin Luther University Halle-Wittenberg, Weinbergweg 10, D-06120 Halle (Saale), Germany.

Transcription activator-like effectors (TALEs) are important Xanthomonas virulence factors that bind DNA via a unique tandem 34-amino-acid repeat domain to induce expression of plant genes. So far, TALE repeats are described to bind as a consecutive array to a consecutive DNA sequence, in which each repeat independently recognizes a single DNA base. This modular protein architecture enables the design of any desired DNA-binding specificity for biotechnology applications. Here we report that natural TALE repeats of unusual amino-acid sequence length break the strict one repeat-to-one base pair binding mode and introduce a local flexibility to TALE-DNA binding. This flexibility allows TALEs and TALE nucleases to recognize target sequence variants with single nucleotide deletions. The flexibility also allows TALEs to activate transcription at allelic promoters that otherwise confer resistance to the host plant.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ncomms4447DOI Listing
March 2014

A general approach for discriminative de novo motif discovery from high-throughput data.

Nucleic Acids Res 2013 Nov 20;41(21):e197. Epub 2013 Sep 20.

Institute of Computer Science, Martin Luther University Halle-Wittenberg, D-06099 Halle, Saale, Germany, Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, D-06484 Quedlinburg, Germany and Department of Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), D-06466 Seeland OT Gatersleben, Germany.

De novo motif discovery has been an important challenge of bioinformatics for the past two decades. Since the emergence of high-throughput techniques like ChIP-seq, ChIP-exo and protein-binding microarrays (PBMs), the focus of de novo motif discovery has shifted to runtime and accuracy on large data sets. For this purpose, specialized algorithms have been designed for discovering motifs in ChIP-seq or PBM data. However, none of the existing approaches work perfectly for all three high-throughput techniques. In this article, we propose Dimont, a general approach for fast and accurate de novo motif discovery from high-throughput data. We demonstrate that Dimont yields a higher number of correct motifs from ChIP-seq data than any of the specialized approaches and achieves a higher accuracy for predicting PBM intensities from probe sequence than any of the approaches specifically designed for that purpose. Dimont also reports the expected motifs for several ChIP-exo data sets. Investigating differences between in vitro and in vivo binding, we find that for most transcription factors, the motifs discovered by Dimont are in good accordance between techniques, but we also find notable exceptions. We also observe that modeling intra-motif dependencies may increase accuracy, which indicates that more complex motif models are a worthwhile field of research.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkt831DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3834837PMC
November 2013

TALENoffer: genome-wide TALEN off-target prediction.

Bioinformatics 2013 Nov 30;29(22):2931-2. Epub 2013 Aug 30.

Institute of Computer Science and Department of Genetics, Institute of Biology, Martin Luther University Halle-Wittenberg, D-06099 Halle (Saale), Germany.

Summary: Transcription activator-like effector nucleases (TALENs) have become an accepted tool for targeted mutagenesis, but undesired off-targets remain an important issue. We present TALENoffer, a novel tool for the genome-wide prediction of TALEN off-targets. We show that TALENoffer successfully predicts known off-targets of engineered TALENs and yields a competitive runtime, scanning complete mammalian genomes within a few minutes.

Availability: TALENoffer is available as a command line program from http://www.jstacs.de/index.php/TALENoffer and as a Galaxy server at http://galaxy.informatik.uni-halle.de.

Contact: grau@informatik.uni-halle.de
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btt501DOI Listing
November 2013

Computational predictions provide insights into the biology of TAL effector target sites.

PLoS Comput Biol 2013 14;9(3):e1002962. Epub 2013 Mar 14.

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.

Transcription activator-like (TAL) effectors are injected into host plant cells by Xanthomonas bacteria to function as transcriptional activators for the benefit of the pathogen. The DNA binding domain of TAL effectors is composed of conserved amino acid repeat structures containing repeat-variable diresidues (RVDs) that determine DNA binding specificity. In this paper, we present TALgetter, a new approach for predicting TAL effector target sites based on a statistical model. In contrast to previous approaches, the parameters of TALgetter are estimated from training data computationally. We demonstrate that TALgetter successfully predicts known TAL effector target sites and often yields a greater number of predictions that are consistent with up-regulation in gene expression microarrays than an existing approach, Target Finder of the TALE-NT suite. We study the binding specificities estimated by TALgetter and approve that different RVDs are differently important for transcriptional activation. In subsequent studies, the predictions of TALgetter indicate a previously unreported positional preference of TAL effector target sites relative to the transcription start site. In addition, several TAL effectors are predicted to bind to the TATA-box, which might constitute one general mode of transcriptional activation by TAL effectors. Scrutinizing the predicted target sites of TALgetter, we propose several novel TAL effector virulence targets in rice and sweet orange. TAL-mediated induction of the candidates is supported by gene expression microarrays. Validity of these targets is also supported by functional analogy to known TAL effector targets, by an over-representation of TAL effector targets with similar function, or by a biological function related to pathogen infection. Hence, these predicted TAL effector virulence targets are promising candidates for studying the virulence function of TAL effectors. TALgetter is implemented as part of the open-source Java library Jstacs, and is freely available as a web-application and a command line program.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1002962DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3597551PMC
November 2013

Dispom: a discriminative de-novo motif discovery tool based on the jstacs library.

J Bioinform Comput Biol 2013 Feb 21;11(1):1340006. Epub 2013 Jan 21.

Institute of Computer Science, Martin Luther University Halle-Wittenberg, D-06099 Halle/Saale, Germany.

DNA-binding proteins are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in target regions of genomic DNA. However, de-novo discovery of these binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not yet been solved satisfactorily. Here, we present a detailed description and analysis of the de-novo motif discovery tool Dispom, which has been developed for finding binding sites of DNA-binding proteins that are differentially abundant in a set of target regions compared to a set of control regions. Two additional features of Dispom are its capability of modeling positional preferences of binding sites and adjusting the length of the motif in the learning process. Dispom yields an increased prediction accuracy compared to existing tools for de-novo motif discovery, suggesting that the combination of searching for differentially abundant motifs, inferring their positional distributions, and adjusting the motif lengths is beneficial for de-novo motif discovery. When applying Dispom to promoters of auxin-responsive genes and those of ABI3 target genes from Arabidopsis thaliana, we identify relevant binding motifs with pronounced positional distributions. These results suggest that learning motifs, their positional distributions, and their lengths by a discriminative learning principle may aid motif discovery from ChIP-chip and gene expression data. We make Dispom freely available as part of Jstacs, an open-source Java library that is tailored to statistical sequence analysis. To facilitate extensions of Dispom, we describe its implementation using Jstacs in this manuscript. In addition, we provide a stand-alone application of Dispom at http://www.jstacs.de/index.php/Dispom for instant use.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1142/S0219720013400064DOI Listing
February 2013

De-novo discovery of differentially abundant transcription factor binding sites including their positional preference.

PLoS Comput Biol 2011 Feb 10;7(2):e1001070. Epub 2011 Feb 10.

Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.

Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1001070DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3037384PMC
February 2011

Probabilistic approaches to transcription factor binding site prediction.

Methods Mol Biol 2010 ;674:97-119

Institute of Computer Science, Martin Luther University, Halle-Wittenberg, Germany.

Many different computer programs for the prediction of transcription factor binding sites have been developed over the last decades. These programs differ from each other by pursuing different objectives and by taking into account different sources of information. For methods based on statistical approaches, these programs differ at an elementary level from each other by the statistical models used for individual binding sites and flanking sequences and by the learning principles employed for estimating the model parameters. According to our experience, both the models and the learning principles should be chosen with great care, depending on the specific task at hand, but many existing programs do not allow the user to choose them freely. Hence, we developed Jstacs, an object-oriented Java framework for sequence analysis, which allows the user to combine different statistical models and different learning principles in a modular manner with little effort. In this chapter we explain how Jstacs can be used for the recognition of transcription factor binding sites.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/978-1-60761-854-6_7DOI Listing
December 2010

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis.

BMC Bioinformatics 2010 Mar 22;11:149. Epub 2010 Mar 22.

Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.

Background: One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions.

Results: With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the same a-priori information, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites.

Conclusions: We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2105-11-149DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859755PMC
March 2010

Unifying generative and discriminative learning principles.

BMC Bioinformatics 2010 Feb 22;11:98. Epub 2010 Feb 22.

Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.

Background: The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has been payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too.

Results: Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites.

Conclusions: We find that the proposed learning principle helps to improve the recognition of transcription factor binding sites, enabling better computational approaches for extracting as much information as possible from valuable wet-lab data. We make all implementations available in the open-source library Jstacs so that this learning principle can be easily applied to other classification problems in the field of genome and epigenome analysis.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2105-11-98DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2848239PMC
February 2010