Publications by authors named "Johannes Söding"

74 Publications

Protein Sequence Analysis Using the MPI Bioinformatics Toolkit.

Curr Protoc Bioinformatics 2020 12;72(1):e108

Department of Protein Evolution, Max Planck Institute for Developmental Biology, Tübingen, Germany.

The MPI Bioinformatics Toolkit (https://toolkit.tuebingen.mpg.de) provides interactive access to a wide range of the best-performing bioinformatics tools and databases, including the state-of-the-art protein sequence comparison methods HHblits and HHpred. The Toolkit currently includes 35 external and in-house tools, covering functionalities such as sequence similarity searching, prediction of sequence features, and sequence classification. Due to this breadth of functionality, the tight interconnection of its constituent tools, and its ease of use, the Toolkit has become an important resource for biomedical research and for teaching protein sequence analysis to students in the life sciences. In this article, we provide detailed information on utilizing the three most widely accessed tools within the Toolkit: HHpred for the detection of homologs, HHpred in conjunction with MODELLER for structure prediction and homology modeling, and CLANS for the visualization of relationships in large sequence datasets. © 2020 The Authors. Basic Protocol 1: Sequence similarity searching using HHpred Alternate Protocol: Pairwise sequence comparison using HHpred Support Protocol: Building a custom multiple sequence alignment using PSI-BLAST and forwarding it as input to HHpred Basic Protocol 2: Calculation of homology models using HHpred and MODELLER Basic Protocol 3: Cluster analysis using CLANS.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/cpbi.108DOI Listing
December 2020

DescribePROT: database of amino acid-level protein structure and function predictions.

Nucleic Acids Res 2021 01;49(D1):D298-D308

Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.

We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa931DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778963PMC
January 2021

A High-Throughput Screen for Transcription Activation Domains Reveals Their Sequence Features and Permits Prediction by Deep Learning.

Mol Cell 2020 06 15;78(5):890-902.e6. Epub 2020 May 15.

Fred Hutchinson Cancer Research Center, Seattle, WA, USA. Electronic address:

Acidic transcription activation domains (ADs) are encoded by a wide range of seemingly unrelated amino acid sequences, making it difficult to recognize features that promote their dynamic behavior, "fuzzy" interactions, and target specificity. We screened a large set of random 30-mer peptides for AD function in yeast and trained a deep neural network (ADpred) on the AD-positive and -negative sequences. ADpred identifies known acidic ADs within transcription factors and accurately predicts the consequences of mutations. Our work reveals that strong acidic ADs contain multiple clusters of hydrophobic residues near acidic side chains, explaining why ADs often have a biased amino acid composition. ADs likely use a binding mechanism similar to avidity where a minimum number of weak dynamic interactions are required between activator and target to generate biologically relevant affinity and in vivo function. This mechanism explains the basis for fuzzy binding observed between acidic ADs and targets.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.molcel.2020.04.020DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7275923PMC
June 2020

MetaEuk-sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics.

Microbiome 2020 04 3;8(1):48. Epub 2020 Apr 3.

Quantitative and Computational Biology, Max-Planck Institute for Biophysical Chemistry, 37077, Göttingen, Germany.

Background: Metagenomics is revolutionizing the study of microorganisms and their involvement in biological, biomedical, and geochemical processes, allowing us to investigate by direct sequencing a tremendous diversity of organisms without the need for prior cultivation. Unicellular eukaryotes play essential roles in most microbial communities as chief predators, decomposers, phototrophs, bacterial hosts, symbionts, and parasites to plants and animals. Investigating their roles is therefore of great interest to ecology, biotechnology, human health, and evolution. However, the generally lower sequencing coverage, their more complex gene and genome architectures, and a lack of eukaryote-specific experimental and computational procedures have kept them on the sidelines of metagenomics.

Results: MetaEuk is a toolkit for high-throughput, reference-based discovery, and annotation of protein-coding genes in eukaryotic metagenomic contigs. It performs fast searches with 6-frame-translated fragments covering all possible exons and optimally combines matches into multi-exon proteins. We used a benchmark of seven diverse, annotated genomes to show that MetaEuk is highly sensitive even under conditions of low sequence similarity to the reference database. To demonstrate MetaEuk's power to discover novel eukaryotic proteins in large-scale metagenomic data, we assembled contigs from 912 samples of the Tara Oceans project. MetaEuk predicted >12,000,000 protein-coding genes in 8 days on ten 16-core servers. Most of the discovered proteins are highly diverged from known proteins and originate from very sparsely sampled eukaryotic supergroups.

Conclusion: The open-source (GPLv3) MetaEuk software (https://github.com/soedinglab/metaeuk) enables large-scale eukaryotic metagenomics through reference-based, sensitive taxonomic and functional annotation. Video abstract.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s40168-020-00808-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7126354PMC
April 2020

Mechanisms for Active Regulation of Biomolecular Condensates.

Trends Cell Biol 2020 01 18;30(1):4-14. Epub 2019 Nov 18.

Max Planck Institute for Dynamics and Self-Organization, Am Fassberg 17, 37077 Göttingen, Germany.

Liquid-liquid phase separation is a key organizational principle in eukaryotic cells, on par with intracellular membranes. It allows cells to concentrate specific proteins into condensates, increasing reaction rates and achieving switch-like regulation. We propose two active mechanisms that can explain how cells regulate condensate formation and size. In both, the cell regulates the activity of an enzyme, often a kinase, that adds post-translational modifications to condensate proteins. In enrichment inhibition, the enzyme enriches in the condensate and weakens interactions, as seen in stress granules (SGs), Cajal bodies, and P granules. In localization-induction, condensates form around immobilized enzymes that strengthen interactions, as observed in DNA repair, transmembrane signaling, and microtubule assembly. These models can guide studies into the many emerging roles of biomolecular condensates.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.tcb.2019.10.006DOI Listing
January 2020

HH-suite3 for fast remote homology detection and deep protein annotation.

BMC Bioinformatics 2019 Sep 14;20(1):473. Epub 2019 Sep 14.

Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Am Fassberg 11, Munich, 81379, Germany.

Background: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins.

Results: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. These accelerated the search methods HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ∼10× faster than PSI-BLAST and ∼20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over cluster servers using OpenMP and message passing interface (MPI). The free, open-source, GPLv3-licensed software is available at https://github.com/soedinglab/hh-suite .

Conclusion: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-019-3019-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6744700PMC
September 2019

Lysine/RNA-interactions drive and regulate biomolecular condensation.

Nat Commun 2019 07 2;10(1):2909. Epub 2019 Jul 2.

German Center for Neurodegenerative Diseases (DZNE), Von-Siebold-Strasse 3a, 37075, Göttingen, Germany.

Cells form and use biomolecular condensates to execute biochemical reactions. The molecular properties of non-membrane-bound condensates are directly connected to the amino acid content of disordered protein regions. Lysine plays an important role in cellular function, but little is known about its role in biomolecular condensation. Here we show that protein disorder is abundant in protein/RNA granules and lysine is enriched in disordered regions of proteins in P-bodies compared to the entire human disordered proteome. Lysine-rich polypeptides phase separate into lysine/RNA-coacervates that are more dynamic and differ at the molecular level from arginine/RNA-coacervates. Consistent with the ability of lysine to drive phase separation, lysine-rich variants of the Alzheimer's disease-linked protein tau undergo coacervation with RNA in vitro and bind to stress granules in cells. Acetylation of lysine reverses liquid-liquid phase separation and reduces colocalization of tau with stress granules. Our study establishes lysine as an important regulator of cellular condensation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-019-10792-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6606616PMC
July 2019

Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold.

Nat Methods 2019 07 24;16(7):603-606. Epub 2019 Jun 24.

Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany.

The open-source de novo protein-level assembler, Plass ( https://plass.mmseqs.com ), assembles six-frame-translated sequencing reads into protein sequences. It recovers 2-10 times more protein sequences from complex metagenomes and can assemble huge datasets. We assembled two redundancy-filtered reference protein catalogs, 2 billion sequences from 640 soil samples (soil reference protein catalog) and 292 million sequences from 775 marine eukaryotic metatranscriptomes (marine eukaryotic reference catalog), the largest free collections of protein sequences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41592-019-0437-4DOI Listing
July 2019

PROSSTT: probabilistic simulation of single-cell RNA-seq data for complex differentiation processes.

Bioinformatics 2019 09;35(18):3517-3519

Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany.

Summary: Cellular lineage trees can be derived from single-cell RNA sequencing snapshots of differentiating cells. Currently, only datasets with simple topologies are available. To test and further develop tools for lineage tree reconstruction, we need test datasets with known complex topologies. PROSSTT can simulate scRNA-seq datasets for differentiation processes with lineage trees of any desired complexity, noise level, noise model and size. PROSSTT also provides scripts to quantify the quality of predicted lineage trees.

Availability And Implementation: https://github.com/soedinglab/prosstt.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz078DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6748774PMC
September 2019

MMseqs2 desktop and local web server app for fast, interactive sequence searches.

Bioinformatics 2019 08;35(16):2856-2858

Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany.

Summary: The MMseqs2 desktop and web server app facilitates interactive sequence searches through custom protein sequence and profile databases on personal workstations. By eliminating MMseqs2's runtime overhead, we reduced response times to a few seconds at sensitivities close to BLAST.

Availability And Implementation: The app is easy to install for non-experts. GPLv3-licensed code, pre-built desktop app packages for Windows, MacOS and Linux, Docker images for the web server application and a demo web server are available at https://search.mmseqs.com.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty1057DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6691333PMC
August 2019

Bayesian multiple logistic regression for case-control GWAS.

PLoS Genet 2018 12 31;14(12):e1007856. Epub 2018 Dec 31.

Max Planck Institute for Biophysical Chemistry, Göttingen, Germany.

Genetic variants in genome-wide association studies (GWAS) are tested for disease association mostly using simple regression, one variant at a time. Standard approaches to improve power in detecting disease-associated SNPs use multiple regression with Bayesian variable selection in which a sparsity-enforcing prior on effect sizes is used to avoid overtraining and all effect sizes are integrated out for posterior inference. For binary traits, the logistic model has not yielded clear improvements over the linear model. For multi-SNP analysis, the logistic model required costly and technically challenging MCMC sampling to perform the integration. Here, we introduce the quasi-Laplace approximation to solve the integral and avoid MCMC sampling. We expect the logistic model to perform much better than multiple linear regression except when predicted disease risks are spread closely around 0.5, because only close to its inflection point can the logistic function be well approximated by a linear function. Indeed, in extensive benchmarks with simulated phenotypes and real genotypes, our Bayesian multiple LOgistic REgression method (B-LORE) showed considerable improvements (1) when regressing on many variants in multiple loci at heritabilities ≥ 0.4 and (2) for unbalanced case-control ratios. B-LORE also enables meta-analysis by approximating the likelihood functions of individual studies by multivariate normal distributions, using their means and covariance matrices as summary statistics. Our work should make sparse multiple logistic regression attractive also for other applications with binary target variables. B-LORE is freely available from: https://github.com/soedinglab/b-lore.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1007856DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6329526PMC
December 2018

Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction.

PLoS Comput Biol 2018 11 5;14(11):e1006526. Epub 2018 Nov 5.

Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany.

Compensatory mutations between protein residues in physical contact can manifest themselves as statistical couplings between the corresponding columns in a multiple sequence alignment (MSA) of the protein family. Conversely, large coupling coefficients predict residue contacts. Methods for de-novo protein structure prediction based on this approach are becoming increasingly reliable. Their main limitation is the strong systematic and statistical noise in the estimation of coupling coefficients, which has so far limited their application to very large protein families. While most research has focused on improving predictions by adding external information, little progress has been made to improve the statistical procedure at the core, because our lack of understanding of the sources of noise poses a major obstacle. First, we show theoretically that the expectation value of the coupling score assuming no coupling is proportional to the product of the square roots of the column entropies, and we propose a simple entropy bias correction (EntC) that subtracts out this expectation value. Second, we show that the average product correction (APC) includes the correction of the entropy bias, partly explaining its success. Third, we have developed CCMgen, the first method for simulating protein evolution and generating realistic synthetic MSAs with pairwise statistical residue couplings. Fourth, to learn exact statistical models that reliably reproduce observed alignment statistics, we developed CCMpredPy, an implementation of the persistent contrastive divergence (PCD) method for exact inference. Fifth, we demonstrate how CCMgen and CCMpredPy can facilitate the development of contact prediction methods by analysing the systematic noise contributions from phylogeny and entropy. Using the entropy bias correction, we can disentangle both sources of noise and find that entropy contributes roughly twice as much noise as phylogeny.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1006526DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237422PMC
November 2018

An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12.

Sci Rep 2018 07 2;8(1):9939. Epub 2018 Jul 2.

Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

Every two years groups worldwide participate in the Critical Assessment of Protein Structure Prediction (CASP) experiment to blindly test the strengths and weaknesses of their computational methods. CASP has significantly advanced the field but many hurdles still remain, which may require new ideas and collaborations. In 2012 a web-based effort called WeFold, was initiated to promote collaboration within the CASP community and attract researchers from other fields to contribute new ideas to CASP. Members of the WeFold coopetition (cooperation and competition) participated in CASP as individual teams, but also shared components of their methods to create hybrid pipelines and actively contributed to this effort. We assert that the scale and diversity of integrative prediction pipelines could not have been achieved by any individual lab or even by any collaboration among a few partners. The models contributed by the participating groups and generated by the pipelines are publicly available at the WeFold website providing a wealth of data that remains to be tapped. Here, we analyze the results of the 2014 and 2016 pipelines showing improvements according to the CASP assessment as well as areas that require further adjustments and research.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-018-26812-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6028396PMC
July 2018

Clustering huge protein sequence sets in linear time.

Nat Commun 2018 06 29;9(1):2542. Epub 2018 Jun 29.

Quantitative and Computational Biology group, Max-Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077, Göttingen, Germany.

Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. Utilizing this enormous resource would require reducing its redundancy by similarity clustering. However, clustering hundreds of millions of sequences is impractical using current algorithms because their runtimes scale as the input set size N times the number of clusters K, which is typically of similar order as N, resulting in runtimes that increase almost quadratically with N. We developed Linclust, the first clustering algorithm whose runtime scales as N, independent of K. It can also cluster datasets several times larger than the available main memory. We cluster 1.6 billion metagenomic sequence fragments in 10 h on a single server to 50% sequence identity, >1000 times faster than has been possible before. Linclust will help to unlock the great wealth contained in metagenomic and genomic sequence databases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-018-04964-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6026198PMC
June 2018

The BaMM web server for de-novo motif discovery and regulatory sequence analysis.

Nucleic Acids Res 2018 07;46(W1):W215-W220

Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany.

The BaMM web server offers four tools: (i) de-novo discovery of enriched motifs in a set of nucleotide sequences, (ii) scanning a set of nucleotide sequences with motifs to find motif occurrences, (iii) searching with an input motif for similar motifs in our BaMM database with motifs for >1000 transcription factors, trained from the GTRD ChIP-seq database and (iv) browsing and keyword searching the motif database. In contrast to most other servers, we represent sequence motifs not by position weight matrices (PWMs) but by Bayesian Markov Models (BaMMs) of order 4, which we showed previously to perform substantially better in ROC analyses than PWMs or first order models. To address the inadequacy of P- and E-values as measures of motif quality, we introduce the AvRec score, the average recall over the TP-to-FP ratio between 1 and 100. The BaMM server is freely accessible without registration at https://bammmotif.mpibpc.mpg.de.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky431DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6030882PMC
July 2018

A Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core.

J Mol Biol 2018 07 16;430(15):2237-2243. Epub 2017 Dec 16.

Department of Protein Evolution, Max Planck Institute for Developmental Biology, Tübingen D-72076, Germany. Electronic address:

The MPI Bioinformatics Toolkit (https://toolkit.tuebingen.mpg.de) is a free, one-stop web service for protein bioinformatic analysis. It currently offers 34 interconnected external and in-house tools, whose functionality covers sequence similarity searching, alignment construction, detection of sequence features, structure prediction, and sequence classification. This breadth has made the Toolkit an important resource for experimental biology and for teaching bioinformatic inquiry. Recently, we replaced the first version of the Toolkit, which was released in 2005 and had served around 2.5 million queries, with an entirely new version, focusing on improved features for the comprehensive analysis of proteins, as well as on promoting teaching. For instance, our popular remote homology detection server, HHpred, now allows pairwise comparison of two sequences or alignments and offers additional profile HMMs for several model organisms and domain databases. Here, we introduce the new version of our Toolkit and its application to the analysis of proteins.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jmb.2017.12.007DOI Listing
July 2018

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.

Nat Biotechnol 2017 11 16;35(11):1026-1028. Epub 2017 Oct 16.

Quantitative and Computational Biology group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nbt.3988DOI Listing
November 2017

WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs.

Bioinformatics 2017 Oct;33(19):3113-3114

Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, 37077?Göttingen, Germany.

Summary: WIsH predicts prokaryotic hosts of phages from their genomic sequences. It achieves 63% mean accuracy when predicting the host genus among 20 genera for 3 kbp-long phage contigs. Over the best current tool, WisH shows much improved accuracy on phage sequences of a few kbp length and runs hundreds of times faster, making it suited for metagenomics studies.

Availability And Implementation: OpenMP-parallelized GPL-licensed C ++ code available at https://github.com/soedinglab/wish.

Contact: clovis.galiez@mpibpc.mpg.de or soeding@mpibpc.mpg.de.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btx383DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870724PMC
October 2017

Genome-wide Analysis of RNA Polymerase II Termination at Protein-Coding Genes.

Mol Cell 2017 Apr 16;66(1):38-49.e6. Epub 2017 Mar 16.

Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany. Electronic address:

At the end of protein-coding genes, RNA polymerase (Pol) II undergoes a concerted transition that involves 3'-processing of the pre-mRNA and transcription termination. Here, we present a genome-wide analysis of the 3'-transition in budding yeast. We find that the 3'-transition globally requires the Pol II elongation factor Spt5 and factors involved in the recognition of the polyadenylation (pA) site and in endonucleolytic RNA cleavage. Pol II release from DNA occurs in a narrow termination window downstream of the pA site and requires the "torpedo" exonuclease Rat1 (XRN2 in human). The Rat1-interacting factor Rai1 contributes to RNA degradation downstream of the pA site. Defects in the 3'-transition can result in increased transcription at downstream genes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.molcel.2017.02.009DOI Listing
April 2017

Big-data approaches to protein structure prediction.

Authors:
Johannes Söding

Science 2017 01;355(6322):248-249

Quantitative and Computational Biology, Max-Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.aal4512DOI Listing
January 2017

Uniclust databases of clustered and deeply annotated protein sequences and alignments.

Nucleic Acids Res 2017 01 28;45(D1):D170-D176. Epub 2016 Nov 28.

Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany

We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 and UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2 software for fast and sensitive protein sequence searching and clustering. Uniclust sequences are annotated with matches to Pfam, SCOP domains, and proteins in the PDB, using our HHblits homology detection tool. Due to its high sensitivity, Uniclust contains 17% more Pfam domain annotations than UniProt. Uniboost MSAs of three diversities are built by enriching the Uniclust30 MSAs with local sequence matches from MMseqs2 profile searches through Uniclust30. All databases can be downloaded from the Uniclust server at uniclust.mmseqs.com. Users can search clusters by keywords and explore their MSAs, taxonomic representation, and annotations. Uniclust is updated every two months with the new UniProt release.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw1081DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5614098PMC
January 2017

Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences.

Nucleic Acids Res 2016 07 9;44(13):6055-69. Epub 2016 Jun 9.

Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany

Position weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k - 1 act as priors for those of order k This Bayesian Markov model (BaMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BaMMs achieve significantly (P    =  1/16) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BaMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26-101%. BaMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BaMMs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw521DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5291271PMC
July 2016

The MPI bioinformatics Toolkit as an integrative platform for advanced protein sequence and structure analysis.

Nucleic Acids Res 2016 Jul 29;44(W1):W410-5. Epub 2016 Apr 29.

Department of Protein Evolution, Max Planck Institute for Developmental Biology, Tübingen D-72076, Germany

The MPI Bioinformatics Toolkit (http://toolkit.tuebingen.mpg.de) is an open, interactive web service for comprehensive and collaborative protein bioinformatic analysis. It offers a wide array of interconnected, state-of-the-art bioinformatics tools to experts and non-experts alike, developed both externally (e.g. BLAST+, HMMER3, MUSCLE) and internally (e.g. HHpred, HHblits, PCOILS). While a beta version of the Toolkit was released 10 years ago, the current production-level release has been available since 2008 and has serviced more than 1.6 million external user queries. The usage of the Toolkit has continued to increase linearly over the years, reaching more than 400 000 queries in 2015. In fact, through the breadth of its tools and their tight interconnection, the Toolkit has become an excellent platform for experimental scientists as well as a useful resource for teaching bioinformatic inquiry to students in the life sciences. In this article, we report on the evolution of the Toolkit over the last ten years, focusing on the expansion of the tool repertoire (e.g. CS-BLAST, HHblits) and on infrastructural work needed to remain operative in a changing web environment.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw348DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4987908PMC
July 2016

Modulations of DNA Contacts by Linker Histones and Post-translational Modifications Determine the Mobility and Modifiability of Nucleosomal H3 Tails.

Mol Cell 2016 Jan 14;61(2):247-59. Epub 2016 Jan 14.

Laboratory of Chromatin Biochemistry, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany. Electronic address:

Post-translational histone modifications and linker histone incorporation regulate chromatin structure and genome activity. How these systems interface on a molecular level is unclear. Using biochemistry and NMR spectroscopy, we deduced mechanistic insights into the modification behavior of N-terminal histone H3 tails in different nucleosomal contexts. We find that linker histones generally inhibit modifications of different H3 sites and reduce H3 tail dynamics in nucleosomes. These effects are caused by modulations of electrostatic interactions of H3 tails with linker DNA and largely depend on the C-terminal domains of linker histones. In agreement, linker histone occupancy and H3 tail modifications segregate on a genome-wide level. Charge-modulating modifications such as phosphorylation and acetylation weaken transient H3 tail-linker DNA interactions, increase H3 tail dynamics, and, concomitantly, enhance general modifiability. We propose that alterations of H3 tail-linker DNA interactions by linker histones and charge-modulating modifications execute basal control mechanisms of chromatin function.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.molcel.2015.12.015DOI Listing
January 2016

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.

Bioinformatics 2016 05 6;32(9):1323-30. Epub 2016 Jan 6.

Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany, Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany and.

Motivation: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence databases by similarity clustering improves speed and sensitivity of iterative searches. But existing tools cannot efficiently cluster databases of the size of UniProt to 50% maximum pairwise sequence identity or below. Furthermore, in metagenomics experiments typically large fractions of reads cannot be matched to any known sequence anymore because searching with sensitive but relatively slow tools (e.g. BLAST or HMMER3) through comprehensive databases such as UniProt is becoming too costly.

Results: MMseqs (Many-against-Many sequence searching) is a software suite for fast and deep clustering and searching of large datasets, such as UniProt, or 6-frame translated metagenomics sequencing reads. MMseqs contains three core modules: a fast and sensitive prefiltering module that sums up the scores of similar k-mers between query and target sequences, an SSE2- and multi-core-parallelized local alignment module, and a clustering module.In our homology detection benchmarks, MMseqs is much more sensitive and 4-30 times faster than UBLAST and RAPsearch, respectively, although it does not reach BLAST sensitivity yet. Using its cascaded clustering workflow, MMseqs can cluster large databases down to ∼30% sequence identity at hundreds of times the speed of BLASTclust and much deeper than CD-HIT and USEARCH. MMseqs can also update a database clustering in linear instead of quadratic time. Its much improved sensitivity-speed trade-off should make MMseqs attractive for a wide range of large-scale sequence analysis tasks.

Availability And Implementation: MMseqs is open-source software available under GPL at https://github.com/soedinglab/MMseqs

Contact: martin.steinegger@mpibpc.mpg.de, soeding@mpibpc.mpg.de

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btw006DOI Listing
May 2016

A vocabulary of ancient peptides at the origin of folded proteins.

Elife 2015 Dec 14;4:e09410. Epub 2015 Dec 14.

Department of Protein Evolution, Max Planck Institute for Developmental Biology, Tübingen, Germany.

The seemingly limitless diversity of proteins in nature arose from only a few thousand domain prototypes, but the origin of these themselves has remained unclear. We are pursuing the hypothesis that they arose by fusion and accretion from an ancestral set of peptides active as co-factors in RNA-dependent replication and catalysis. Should this be true, contemporary domains may still contain vestiges of such peptides, which could be reconstructed by a comparative approach in the same way in which ancient vocabularies have been reconstructed by the comparative study of modern languages. To test this, we compared domains representative of known folds and identified 40 fragments whose similarity is indicative of common descent, yet which occur in domains currently not thought to be homologous. These fragments are widespread in the most ancient folds and enriched for iron-sulfur- and nucleic acid-binding. We propose that they represent the observable remnants of a primordial RNA-peptide world.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.7554/eLife.09410DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4739770PMC
December 2015

Automatic Prediction of Protein 3D Structures by Probabilistic Multi-template Homology Modeling.

PLoS Comput Biol 2015 Oct 23;11(10):e1004343. Epub 2015 Oct 23.

Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany; Gene Center, Ludwig-Maximilians-Universität München Munich, Munich, Germany.

Homology modeling predicts the 3D structure of a query protein based on the sequence alignment with one or more template proteins of known structure. Its great importance for biological research is owed to its speed, simplicity, reliability and wide applicability, covering more than half of the residues in protein sequence space. Although multiple templates have been shown to generally increase model quality over single templates, the information from multiple templates has so far been combined using empirically motivated, heuristic approaches. We present here a rigorous statistical framework for multi-template homology modeling. First, we find that the query proteins' atomic distance restraints can be accurately described by two-component Gaussian mixtures. This insight allowed us to apply the standard laws of probability theory to combine restraints from multiple templates. Second, we derive theoretically optimal weights to correct for the redundancy among related templates. Third, a heuristic template selection strategy is proposed. We improve the average GDT-ha model quality score by 11% over single template modeling and by 6.5% over a conventional multi-template approach on a set of 1000 query proteins. Robustness with respect to wrong constraints is likewise improved. We have integrated our multi-template modeling approach with the popular MODELLER homology modeling software in our free HHpred server http://toolkit.tuebingen.mpg.de/hhpred and also offer open source software for running MODELLER with the new restraints at https://bitbucket.org/soedinglab/hh-suite.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1004343DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4619893PMC
October 2015

bbcontacts: prediction of β-strand pairing from direct coupling patterns.

Bioinformatics 2015 Jun 23;31(11):1729-37. Epub 2015 Jan 23.

Gene Center, LMU Munich, Feodor-Lynen-Strasse 25, 81377 Munich, Germany and Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany Gene Center, LMU Munich, Feodor-Lynen-Strasse 25, 81377 Munich, Germany and Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany.

Motivation: It has recently become possible to build reliable de novo models of proteins if a multiple sequence alignment (MSA) of at least 1000 homologous sequences can be built. Methods of global statistical network analysis can explain the observed correlations between columns in the MSA by a small set of directly coupled pairs of columns. Strong couplings are indicative of residue-residue contacts, and from the predicted contacts a structure can be computed. Here, we exploit the structural regularity of paired β-strands that leads to characteristic patterns in the noisy matrices of couplings. The β-β contacts should be detected more reliably than single contacts, reducing the required number of sequences in the MSAs.

Results: bbcontacts predicts β-β contacts by detecting these characteristic patterns in the 2D map of coupling scores using two hidden Markov models (HMMs), one for parallel and one for antiparallel contacts. β-bulges are modelled as indel states. In contrast to existing methods, bbcontacts uses predicted instead of true secondary structure. On a standard set of 916 test proteins, 34% of which have MSAs with < 1000 sequences, bbcontacts achieves 50% precision for contacting β-β residue pairs at 50% recall using predicted secondary structure and 64% precision at 64% recall using true secondary structure, while existing tools achieve around 45% precision at 45% recall using true secondary structure.

Availability And Implementation: bbcontacts is open source software (GNU Affero GPL v3) available at https://bitbucket.org/soedinglab/bbcontacts .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btv041DOI Listing
June 2015

Context similarity scoring improves protein sequence alignments in the midnight zone.

Bioinformatics 2015 Mar 22;31(5):674-81. Epub 2014 Oct 22.

Gene Center, LMU Munich, 81377 Munich and Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany Gene Center, LMU Munich, 81377 Munich and Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany.

Motivation: High-quality protein sequence alignments are essential for a number of downstream applications such as template-based protein structure prediction. In addition to the similarity score between sequence profile columns, many current profile-profile alignment tools use extra terms that compare 1D-structural properties such as secondary structure and solvent accessibility, which are predicted from short profile windows around each sequence position. Such scores add non-redundant information by evaluating the conservation of local patterns of hydrophobicity and other amino acid properties and thus exploiting correlations between profile columns.

Results: Here, instead of predicting and comparing known 1D properties, we follow an agnostic approach. We learn in an unsupervised fashion a set of maximally conserved patterns represented by 13-residue sequence profiles, without the need to know the cause of the conservation of these patterns. We use a maximum likelihood approach to train a set of 32 such profiles that can best represent patterns conserved within pairs of remotely homologs, structurally aligned training profiles. We include the new context score into our Hmm-Hmm alignment tool hhsearch and improve especially the quality of difficult alignments significantly.

Conclusion: The context similarity score improves the quality of homology models and other methods that depend on accurate pairwise alignments.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btu697DOI Listing
March 2015

Transcriptome maps of mRNP biogenesis factors define pre-mRNA recognition.

Mol Cell 2014 Sep;55(5):745-57

Max-Planck-Institute for Biophysical Chemistry, Am Faßberg 11, 37077 Göttingen, Germany; Gene Center Munich and Department of Biochemistry, Center for Integrated Protein Science CIPSM, Ludwig-Maximilians-Universität München, Feodor-Lynen-Straße 25, 81377 Munich, Germany. Electronic address:

Biogenesis of eukaryotic messenger ribonucleoprotein complexes (mRNPs) involves the synthesis, splicing, and 3' processing of pre-mRNA, and the assembly of mature mRNPs for nuclear export. We mapped 23 mRNP biogenesis factors onto the yeast transcriptome, providing 10(4)-10(6) high-confidence RNA interaction sites per factor. The data reveal how mRNP biogenesis factors recognize pre-mRNA elements in vivo. They define conserved interactions between splicing factors and pre-mRNA introns, including the recognition of intron-exon junctions and the branchpoint. They also identify a unified arrangement of 3' processing factors at pre-mRNA polyadenylation (pA) sites in yeast and human, which results from an A-U sequence bias at pA sites. Global data analysis indicates that 3' processing factors have roles in splicing and RNA surveillance, and that they couple mRNP biogenesis events to restrict nuclear export to mature mRNPs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.molcel.2014.08.005DOI Listing
September 2014