Publications by authors named "William Stafford Noble"

140 Publications

HiCRep.py : Fast comparison of Hi-C contact matrices in Python.

Bioinformatics 2021 Feb 11. Epub 2021 Feb 11.

Department of Genome Sciences, University of Washington, Seattle, 98040, United States.

Motivation: Hi-C is the most widely used assay for investigating genome-wide 3D organization of chromatin. When working with Hi-C data, it is often useful to calculate the similarity between contact matrices in order to asses experimental reproducibility or to quantify relationships among Hi-C data from related samples. The HiCRep algorithm has been widely adopted for this task, but the existing R implementation suffers from run time limitations on high resolution Hi-C data or on large single-cell Hi-C datasets.

Results: We introduce a Python implementation of HiCRep and demonstrate that it is much faster and consume much less memory than the existing R implementation. Furthermore, we give examples of HiCRep's ability to accurately distinguish replicates from non-replicates and to reveal cell type structure among collections of Hi-C data.

Availability: HiCRep.py and its documentation are available with a GPL license at https://github.com/Noble-Lab/hicrep. The software may be installed automatically using the pip package installer.

Supplementary Information: Supplementary methods and results are included in an appendix at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btab097DOI Listing
February 2021

A pitfall for machine learning methods aiming to predict across cell types.

Genome Biol 2020 Nov 19;21(1):282. Epub 2020 Nov 19.

Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA.

Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-020-02177-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7678316PMC
November 2020

Prioritizing transcriptomic and epigenomic experiments by using an optimization strategy that leverages imputed data.

Bioinformatics 2020 Sep 23. Epub 2020 Sep 23.

Department of Genome Sciences, University of Washington, Seattle, United States.

Motivation: Successful science often involves not only performing experiments well, but also choosing well among many possible experiments. In a hypothesis generation setting, choosing an experiment well means choosing an experiment whose results are interesting or novel. In this work, we formalize this selection procedure in the context of genomics and epigenomics data generation. Specifically, we consider the task faced by a scientific consortium such as the National Institutes of Health ENCODE Consortium, whose goal is to characterize all of the functional elements in the human genome. Given a list of possible cell types or tissue types ("biosamples") and a list of possible high throughput sequencing assays, where at least one experiment has been performed in each biosample and for each assay, we ask "Which experiments should ENCODE perform next?"

Results: We demonstrate how to represent this task as a submodular optimization problem, where the goal is to choose a panel of experiments that maximize the facility location function. A key aspect of our approach is that we use imputed data, rather than experimental data, to directly answer the posed question. We find that, across several evaluations, our method chooses a panel of experiments that span a diversity of biochemical activity. Finally, we propose two modifications the facility location function, including a novel submodular-supermodular function, that allow incorporation of domain knowledge or constraints into the optimization procedure.

Availability And Implementation: Our method is available as a Python package at https://github.com/jmschrei/kiwano and can be installed using the command pip install kiwano. The source code used here and the similarity matrix can be found at http://doi.org/10.5281/zenodo.3708538.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa830DOI Listing
September 2020

Capturing cell type-specific chromatin compartment patterns by applying topic modeling to single-cell Hi-C data.

PLoS Comput Biol 2020 09 18;16(9):e1008173. Epub 2020 Sep 18.

Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America.

Single-cell Hi-C (scHi-C) interrogates genome-wide chromatin interaction in individual cells, allowing us to gain insights into 3D genome organization. However, the extremely sparse nature of scHi-C data poses a significant barrier to analysis, limiting our ability to tease out hidden biological information. In this work, we approach this problem by applying topic modeling to scHi-C data. Topic modeling is well-suited for discovering latent topics in a collection of discrete data. For our analysis, we generate nine different single-cell combinatorial indexed Hi-C (sci-Hi-C) libraries from five human cell lines (GM12878, H1Esc, HFF, IMR90, and HAP1), consisting over 19,000 cells. We demonstrate that topic modeling is able to successfully capture cell type differences from sci-Hi-C data in the form of "chromatin topics." We further show enrichment of particular compartment structures associated with locus pairs in these topics.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1008173DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7526900PMC
September 2020

Completing the ENCODE3 compendium yields accurate imputations across a variety of assays and human biosamples.

Genome Biol 2020 03 30;21(1):82. Epub 2020 Mar 30.

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.

Recent efforts to describe the human epigenome have yielded thousands of epigenomic and transcriptomic datasets. However, due primarily to cost, the total number of such assays that can be performed is limited. Accordingly, we applied an imputation approach, Avocado, to a dataset of 3814 tracks of data derived from the ENCODE compendium, including measurements of chromatin accessibility, histone modification, transcription, and protein binding. Avocado shows significant improvements in imputing protein binding compared to the top models in the ENCODE-DREAM challenge. Additionally, we show that the Avocado model allows for efficient addition of new assays and biosamples to a pre-trained model.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-020-01978-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7104481PMC
March 2020

Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome.

Genome Biol 2020 03 30;21(1):81. Epub 2020 Mar 30.

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.

The human epigenome has been experimentally characterized by thousands of measurements for every basepair in the human genome. We propose a deep neural network tensor factorization method, Avocado, that compresses this epigenomic data into a dense, information-rich representation. We use this learned representation to impute epigenomic data more accurately than previous methods, and we show that machine learning models that exploit this representation outperform those trained directly on epigenomic data on a variety of genomics tasks. These tasks include predicting gene expression, promoter-enhancer interactions, replication timing, and an element of 3D chromatin architecture.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-020-01977-6DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7104480PMC
March 2020

Measuring significant changes in chromatin conformation with ACCOST.

Nucleic Acids Res 2020 03;48(5):2303-2311

Department of Genome Sciences, University of Washington, Seattle, WA 98195-5065, USA.

Chromatin conformation assays such as Hi-C cannot directly measure differences in 3D architecture between cell types or cell states. For this purpose, two or more Hi-C experiments must be carried out, but direct comparison of the resulting Hi-C matrices is confounded by several features of Hi-C data. Most notably, the genomic distance effect, whereby contacts between pairs of genomic loci that are proximal along the chromosome exhibit many more Hi-C contacts that distal pairs of loci, dominates every Hi-C matrix. Furthermore, the form that this distance effect takes often varies between different Hi-C experiments, even between replicate experiments. Thus, a statistical confidence measure designed to identify differential Hi-C contacts must accurately account for the genomic distance effect or risk being misled by large-scale but artifactual differences. ACCOST (Altered Chromatin COnformation STatistics) accomplishes this goal by extending the statistical model employed by DEseq, re-purposing the 'size factors,' which were originally developed to account for differences in read depth between samples, to instead model the genomic distance effect. We show via analysis of simulated and real data that ACCOST provides unbiased statistical confidence estimates that compare favorably with competing methods such as diffHiC, FIND and HiCcompare. ACCOST is freely available with an Apache license at https://bitbucket.org/noblelab/accost.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa069DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7049724PMC
March 2020

Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features.

PLoS Comput Biol 2019 09 11;15(9):e1007329. Epub 2019 Sep 11.

Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America.

Empirical evidence suggests that the malaria parasite Plasmodium falciparum employs a broad range of mechanisms to regulate gene transcription throughout the organism's complex life cycle. To better understand this regulatory machinery, we assembled a rich collection of genomic and epigenomic data sets, including information about transcription factor (TF) binding motifs, patterns of covalent histone modifications, nucleosome occupancy, GC content, and global 3D genome architecture. We used these data to train machine learning models to discriminate between high-expression and low-expression genes, focusing on three distinct stages of the red blood cell phase of the Plasmodium life cycle. Our results highlight the importance of histone modifications and 3D chromatin architecture in Plasmodium transcriptional regulation and suggest that AP2 transcription factors may play a limited regulatory role, perhaps operating in conjunction with epigenetic factors.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1007329DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6756558PMC
September 2019

A unified encyclopedia of human functional DNA elements through fully automated annotation of 164 human cell types.

Genome Biol 2019 08 28;20(1):180. Epub 2019 Aug 28.

Department of Genome Sciences, University of Washington, Seattle, USA.

Semi-automated genome annotation methods such as Segway take as input a set of genome-wide measurements such as of histone modification or DNA accessibility and output an annotation of genomic activity in the target cell type. Here we present annotations of 164 human cell types using 1615 data sets. To produce these annotations, we automated the label interpretation step to produce a fully automated annotation strategy. Using these annotations, we developed a measure of the importance of each genomic position called the "conservation-associated activity score." We further combined all annotations into a single, cell type-agnostic encyclopedia that catalogs all human regulatory elements.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-019-1784-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6714098PMC
August 2019

Extremely Fast and Accurate Open Modification Spectral Library Searching of High-Resolution Mass Spectra Using Feature Hashing and Graphics Processing Units.

J Proteome Res 2019 10 30;18(10):3792-3799. Epub 2019 Aug 30.

Department of Genome Sciences , University of Washington , Seattle , Washington 98195 , United States.

Open modification searching (OMS) is a powerful search strategy to identify peptides with any type of modification. OMS works by using a very wide precursor mass window to allow modified spectra to match against their unmodified variants, after which the modification types can be inferred from the corresponding precursor mass differences. A disadvantage of this strategy, however, is the large computational cost, because each query spectrum has to be compared against a multitude of candidate peptides. We have previously introduced the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. Here we demonstrate how this candidate selection procedure can be further optimized using graphics processing units. Additionally, we introduce a feature hashing scheme to convert high-resolution spectra to low-dimensional vectors. On the basis of these algorithmic advances, along with low-level code optimizations, the new version of ANN-SoLo is up to an order of magnitude faster than its initial version. This makes it possible to efficiently perform open searches on a large scale to gain a deeper understanding about the protein modification landscape. We demonstrate the computational efficiency and identification performance of ANN-SoLo based on a large data set of the draft human proteome. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at https://github.com/bittremieux/ANN-SoLo .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.jproteome.9b00291DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6886738PMC
October 2019

Speeding Up Percolator.

J Proteome Res 2019 09 23;18(9):3353-3359. Epub 2019 Aug 23.

Science for Life Laboratory , KTH-Royal Institute of Technology , Solna 171 65 , Sweden.

The processing of peptide tandem mass spectrometry data involves matching observed spectra against a sequence database. The ranking and calibration of these peptide-spectrum matches can be improved substantially using a machine learning postprocessor. Here, we describe our efforts to speed up one widely used postprocessor, Percolator. The improved software is dramatically faster than the previous version of Percolator, even when using relatively few processors. We tested the new version of Percolator on a data set containing over 215 million spectra and recorded an overall reduction to 23% of the running time as compared to the unoptimized code. We also show that the memory footprint required by these speedups is modest relative to that of the original version of Percolator.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.jproteome.9b00288DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6884961PMC
September 2019

Massively parallel profiling and predictive modeling of the outcomes of CRISPR/Cas9-mediated double-strand break repair.

Nucleic Acids Res 2019 09;47(15):7989-8003

Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.

Non-homologous end-joining (NHEJ) plays an important role in double-strand break (DSB) repair of DNA. Recent studies have shown that the error patterns of NHEJ are strongly biased by sequence context, but these studies were based on relatively few templates. To investigate this more thoroughly, we systematically profiled ∼1.16 million independent mutational events resulting from CRISPR/Cas9-mediated cleavage and NHEJ-mediated DSB repair of 6872 synthetic target sequences, introduced into a human cell line via lentiviral infection. We find that: (i) insertions are dominated by 1 bp events templated by sequence immediately upstream of the cleavage site, (ii) deletions are predominantly associated with microhomology and (iii) targets exhibit variable but reproducible diversity with respect to the number and relative frequency of the mutational outcomes to which they give rise. From these data, we trained a model that uses local sequence context to predict the distribution of mutational outcomes. Exploiting the bias of NHEJ outcomes towards microhomology mediated events, we demonstrate the programming of deletion patterns by introducing microhomology to specific locations in the vicinity of the DSB site. We anticipate that our results will inform investigations of DSB repair mechanisms as well as the design of CRISPR/Cas9 experiments for diverse applications including genome-wide screens, gene therapy, lineage tracing and molecular recording.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkz487DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6735782PMC
September 2019

Response to comments on 'Empirical comparison of web-based antimicrobial peptide prediction tools'.

Bioinformatics 2019 08;35(15):2695-2696

Department of Genome Sciences, University of Washington, Seattle, WA, USA.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty1024DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6662286PMC
August 2019

Averaging Strategy To Reduce Variability in Target-Decoy Estimates of False Discovery Rate.

J Proteome Res 2019 02 3;18(2):585-593. Epub 2019 Jan 3.

Department of Genome Sciences , University of Washington , Foege Building S220B, 3720 15th Avenue NE , Seattle , Washington 98195-5065 , United States.

Decoy database search with target-decoy competition (TDC) provides an intuitive, easy-to-implement method for estimating the false discovery rate (FDR) associated with spectrum identifications from shotgun proteomics data. However, the procedure can yield different results for a fixed data set analyzed with different decoy databases, and this decoy-induced variability is particularly problematic for smaller FDR thresholds, data sets, or databases. The average TDC (aTDC) protocol combats this problem by exploiting multiple independently shuffled decoy databases to provide an FDR estimate with reduced variability. We provide a tutorial introduction to aTDC, describe an improved variant of the protocol that offers increased statistical power, and discuss how to deploy aTDC in practice using the Crux software toolkit.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.jproteome.8b00802DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6919216PMC
February 2019

Controlling the FDR in imperfect matches to an incomplete database.

J Am Stat Assoc 2018 28;113(523):973-982. Epub 2018 Jun 28.

Departments of Genome Sciences and of Computer Science and Engineering, University of Washington.

We consider the problem of controlling the FDR among discoveries from searching an incomplete database. This problem differs from the classical multiple testing setting because there are two different types of false discoveries: those arising from objects that have no match in the database and those that are incorrectly matched. We show that commonly used FDR controlling procedures are inadequate for this setup, a special case of which is tandem mass spectrum identification. We then derive a novel FDR controlling approach which extensive simulations suggest is unbiased. We also compare its performance with problem-specific as well as general FDR controlling procedures using both simulated and real mass spectrometry data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1080/01621459.2017.1375931DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6287756PMC
June 2018

Joint Precursor Elution Profile Inference via Regression for Peptide Detection in Data-Independent Acquisition Mass Spectra.

J Proteome Res 2019 01 26;18(1):86-94. Epub 2018 Oct 26.

In data independent acquisition (DIA) mass spectrometry, precursor scans are interleaved with wide-window fragmentation scans, resulting in complex fragmentation spectra containing multiple coeluting peptide species. In this setting, detecting the isotope distribution profiles of intact peptides in the precursor scans can be a critical initial step in accurate peptide detection and quantification. This peak detection step is particularly challenging when the isotope peaks associated with two different peptide species overlap-or interfere-with one another. We propose a regression model, called Siren, to detect isotopic peaks in precursor DIA data that can explicitly account for interference. We validate Siren's peak-calling performance on a variety of data sets by counting how many of the peaks Siren identifies are associated with confidently detected peptides. In particular, we demonstrate that substituting the Siren regression model in place of the existing peak-calling step in DIA-Umpire leads to improved overall rates of peptide detection.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.jproteome.8b00365DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6465123PMC
January 2019

Calibration Using a Single-Point External Reference Material Harmonizes Quantitative Mass Spectrometry Proteomics Data between Platforms and Laboratories.

Anal Chem 2018 11 23;90(21):13112-13117. Epub 2018 Oct 23.

Mass spectrometry (MS) measurements are not inherently calibrated. Researchers use various calibration methods to assign meaning to arbitrary signal intensities and improve precision. Internal calibration (IC) methods use internal standards (IS) such as synthesized or recombinant proteins or peptides to calibrate MS measurements by comparing endogenous analyte signal to the signal from known IS concentrations spiked into the same sample. However, recent work suggests that using IS as IC introduces quantitative biases that affect comparison across studies because of the inability of IS to capture all sources of variation present throughout an MS workflow. Here, we describe a single-point external calibration strategy to calibrate signal intensity measurements to a common reference material, placing MS measurements on the same scale and harmonizing signal intensities between instruments, acquisition methods, and sites. We demonstrate data harmonization between laboratories and methodologies using this generalizable approach.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.analchem.8b04581DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6854904PMC
November 2018

Combining High-Resolution and Exact Calibration To Boost Statistical Power: A Well-Calibrated Score Function for High-Resolution MS2 Data.

J Proteome Res 2018 11 18;17(11):3644-3656. Epub 2018 Oct 18.

Department of Genome Sciences , University of Washington , Seattle , Washington 98195 , United States.

To achieve accurate assignment of peptide sequences to observed fragmentation spectra, a shotgun proteomics database search tool must make good use of the very high-resolution information produced by state-of-the-art mass spectrometers. However, making use of this information while also ensuring that the search engine's scores are well calibrated, that is, that the score assigned to one spectrum can be meaningfully compared to the score assigned to a different spectrum, has proven to be challenging. Here we describe a database search score function, the "residue evidence" (res-ev) score, that achieves both of these goals simultaneously. We also demonstrate how to combine calibrated res-ev scores with calibrated XCorr scores to produce a "combined p value" score function. We provide a benchmark consisting of four mass spectrometry data sets, which we use to compare the combined p value to the score functions used by several existing search engines. Our results suggest that the combined p value achieves state-of-the-art performance, generally outperforming MS Amanda and Morpheus and performing comparably to MS-GF+. The res-ev and combined p-value score functions are freely available as part of the Tide search engine in the Crux mass spectrometry toolkit ( http://crux.ms ).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.jproteome.8b00206DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6342018PMC
November 2018

Integrative detection and analysis of structural variation in cancer genomes.

Nat Genet 2018 10 10;50(10):1388-1398. Epub 2018 Sep 10.

Department of Biochemistry and Molecular Biology, College of Medicine, The Pennsylvania State University, Hershey, PA, USA.

Structural variants (SVs) can contribute to oncogenesis through a variety of mechanisms. Despite their importance, the identification of SVs in cancer genomes remains challenging. Here, we present a framework that integrates optical mapping, high-throughput chromosome conformation capture (Hi-C), and whole-genome sequencing to systematically detect SVs in a variety of normal or cancer samples and cell lines. We identify the unique strengths of each method and demonstrate that only integrative approaches can comprehensively identify SVs in the genome. By combining Hi-C and optical mapping, we resolve complex SVs and phase multiple SV events to a single haplotype. Furthermore, we observe widespread structural variation events affecting the functions of noncoding sequences, including the deletion of distal regulatory sequences, alteration of DNA replication timing, and the creation of novel three-dimensional chromatin structural domains. Our results indicate that noncoding SVs may be underappreciated mutational drivers in cancer genomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41588-018-0195-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6301019PMC
October 2018

Fast Open Modification Spectral Library Searching through Approximate Nearest Neighbor Indexing.

J Proteome Res 2018 10 13;17(10):3463-3474. Epub 2018 Sep 13.

Department of Mathematics and Computer Science , University of Antwerp , 2020 Antwerp , Belgium.

Open modification searching (OMS) is a powerful search strategy that identifies peptides carrying any type of modification by allowing a modified spectrum to match against its unmodified variant by using a very wide precursor mass window. A drawback of this strategy, however, is that it leads to a large increase in search time. Although performing an open search can be done using existing spectral library search engines by simply setting a wide precursor mass window, none of these tools have been optimized for OMS, leading to excessive runtimes and suboptimal identification results. We present the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. This approach is combined with a cascade search strategy to maximize the number of identified unmodified and modified spectra while strictly controlling the false discovery rate as well as a shifted dot product score to sensitively match modified spectra to their unmodified counterparts. ANN-SoLo achieves state-of-the-art performance in terms of speed and the number of identifications. On a previously published human cell line data set, ANN-SoLo confidently identifies more spectra than SpectraST or MSFragger and achieves a speedup of an order of magnitude compared with SpectraST. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at https://github.com/bittremieux/ANN-SoLo .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.jproteome.8b00359DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6173621PMC
October 2018

Unsupervised embedding of single-cell Hi-C data.

Bioinformatics 2018 07;34(13):i96-i104

Department of Genome Sciences, University of Washington, Seattle, WA, USA.

Motivation: Single-cell Hi-C (scHi-C) data promises to enable scientists to interrogate the 3D architecture of DNA in the nucleus of the cell, studying how this structure varies stochastically or along developmental or cell-cycle axes. However, Hi-C data analysis requires methods that take into account the unique characteristics of this type of data. In this work, we explore whether methods that have been developed previously for the analysis of bulk Hi-C data can be applied to scHi-C data. We apply methods designed for analysis of bulk Hi-C data to scHi-C data in conjunction with unsupervised embedding.

Results: We find that one of these methods, HiCRep, when used in conjunction with multidimensional scaling (MDS), strongly outperforms three other methods, including a technique that has been used previously for scHi-C analysis. We also provide evidence that the HiCRep/MDS method is robust to extremely low per-cell sequencing depth, that this robustness is improved even further when high-coverage and low-coverage cells are projected together, and that the method can be used to jointly embed cells from multiple published datasets.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty285DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022597PMC
July 2018

Changes in genome organization of parasite-specific gene families during the Plasmodium transmission stages.

Nat Commun 2018 05 15;9(1):1910. Epub 2018 May 15.

Department of Molecular, Cell and Systems Biology, University of California Riverside, 900 University Ave, Riverside, CA, 92521, USA.

The development of malaria parasites throughout their various life cycle stages is coordinated by changes in gene expression. We previously showed that the three-dimensional organization of the Plasmodium falciparum genome is strongly associated with gene expression during its replication cycle inside red blood cells. Here, we analyze genome organization in the P. falciparum and P. vivax transmission stages. Major changes occur in the localization and interactions of genes involved in pathogenesis and immune evasion, host cell invasion, sexual differentiation, and master regulation of gene expression. Furthermore, we observe reorganization of subtelomeric heterochromatin around genes involved in host cell remodeling. Depletion of heterochromatin protein 1 (PfHP1) resulted in loss of interactions between virulence genes, confirming that PfHP1 is essential for maintenance of the repressive center. Our results suggest that the three-dimensional genome structure of human malaria parasites is strongly connected with transcriptional activity of specific gene families throughout the life cycle.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-018-04295-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5954139PMC
May 2018

PREDICTD PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition.

Nat Commun 2018 04 11;9(1):1402. Epub 2018 Apr 11.

Department of Genome Sciences, University of Washington, Foege Building S-250, Box 355065, 3720 15th Ave NE, Seattle, WA, 98195, USA.

The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project seek to characterize the epigenome in diverse cell types using assays that identify, for example, genomic regions with modified histones or accessible chromatin. These efforts have produced thousands of datasets but cannot possibly measure each epigenomic factor in all cell types. To address this, we present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to computationally impute missing experiments. PREDICTD leverages an elegant model called "tensor decomposition" to impute many experiments simultaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining the two methods yields further improvement. We show that PREDICTD data captures enhancer activity at noncoding human accelerated regions. PREDICTD provides reference imputed data and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, both promising technologies for bioinformatics.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-018-03635-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5895786PMC
April 2018

GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs.

Bioinformatics 2018 08;34(16):2701-2707

Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.

Motivation: The three-dimensional organization of chromatin plays a critical role in gene regulation and disease. High-throughput chromosome conformation capture experiments such as Hi-C are used to obtain genome-wide maps of three-dimensional chromatin contacts. However, robust estimation of data quality and systematic comparison of these contact maps is challenging due to the multi-scale, hierarchical structure of chromatin contacts and the resulting properties of experimental noise in the data. Measuring concordance of contact maps is important for assessing reproducibility of replicate experiments and for modeling variation between different cellular contexts.

Results: We introduce a concordance measure called DIfferences between Smoothed COntact maps (GenomeDISCO) for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments. The key idea is to smooth contact maps using random walks on the contact map graph, before estimating concordance. We use simulated datasets to benchmark GenomeDISCO's sensitivity to different types of noise that affect chromatin contact maps. When applied to a large collection of Hi-C datasets, GenomeDISCO accurately distinguishes biological replicates from samples obtained from different cell types. GenomeDISCO also generalizes to other chromosome conformation capture assays, such as HiChIP.

Availability And Implementation: Software implementing GenomeDISCO is available at https://github.com/kundajelab/genomedisco.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty164DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6084597PMC
August 2018

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Proteins 2018 04 1;86(4):454-466. Epub 2018 Feb 1.

Department of Genome Sciences, University of Washington, Seattle, Washington.

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/prot.25461DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5835207PMC
April 2018

Progressive calibration and averaging for tandem mass spectrometry statistical confidence estimation: Why settle for a single decoy?

Res Comput Mol Biol 2017 May 12;10229:99-116. Epub 2017 Apr 12.

Department of Genome Sciences, Department of Computer Science and Engineering, University of Washington.

Estimating the false discovery rate (FDR) among a list of tandem mass spectrum identifications is mostly done through target-decoy competition (TDC). Here we offer two new methods that can use an arbitrarily small number of additional randomly drawn decoy databases to improve TDC. Specifically, "Partial Calibration" utilizes a new meta-scoring scheme that allows us to gradually benefit from the increase in the number of identifications calibration yields and "Averaged TDC" (a-TDC) reduces the liberal bias of TDC for small FDR values and its variability throughout. Combining a-TDC with "Progressive Calibration" (PC), which attempts to find the "right" number of decoys required for calibration we see substantial impact in real datasets: when analyzing the data it typically yields almost the entire 17% increase in discoveries that "full calibration" yields (at FDR level 0.05) using 60 times fewer decoys. Our methods are further validated using a novel realistic simulation scheme and importantly, they apply more generally to the problem of controlling the FDR among discoveries from searching an incomplete database.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/978-3-319-56970-3_7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5758044PMC
May 2017

MetaGOmics: A Web-Based Tool for Peptide-Centric Functional and Taxonomic Analysis of Metaproteomics Data.

Proteomes 2017 Dec 27;6(1). Epub 2017 Dec 27.

Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.

Metaproteomics is the characterization of all proteins being expressed by a community of organisms in a complex biological sample at a single point in time. Applications of metaproteomics range from the comparative analysis of environmental samples (such as ocean water and soil) to microbiome data from multicellular organisms (such as the human gut). Metaproteomics research is often focused on the quantitative functional makeup of the metaproteome and which organisms are making those proteins. That is: What are the functions of the currently expressed proteins? How much of the metaproteome is associated with those functions? And, which microorganisms are expressing the proteins that perform those functions? However, traditional protein-centric functional analysis is greatly complicated by the large size, redundancy, and lack of biological annotations for the protein sequences in the database used to search the data. To help address these issues, we have developed an algorithm and web application (dubbed "MetaGOmics") that automates the quantitative functional (using Gene Ontology) and taxonomic analysis of metaproteomics data and subsequent visualization of the results. MetaGOmics is designed to overcome the shortcomings of traditional proteomics analysis when used with metaproteomics data. It is easy to use, requires minimal input, and fully automates most steps of the analysis-including comparing the functional makeup between samples. MetaGOmics is freely available at https://www.yeastrc.org/metagomics/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/proteomes6010002DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5874761PMC
December 2017

Segway 2.0: Gaussian mixture models and minibatch training.

Bioinformatics 2018 02;34(4):669-671

Princess Margaret Cancer Centre, Toronto, ON M5G 1L7, Canada.

Summary: Segway performs semi-automated genome annotation, discovering joint patterns across multiple genomic signal datasets. We discuss a major new version of Segway and highlight its ability to model data with substantially greater accuracy. Major enhancements in Segway 2.0 include the ability to model data with a mixture of Gaussians, enabling capture of arbitrarily complex signal distributions, and minibatch training, leading to better learned parameters.

Availability And Implementation: Segway and its source code are freely available for download at http://segway.hoffmanlab.org. We have made available scripts (https://doi.org/10.5281/zenodo.802939) and datasets (https://doi.org/10.5281/zenodo.802906) for this paper's analysis.

Contact: michael.hoffman@utoronto.ca.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btx603DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860603PMC
February 2018

Ten simple rules for writing a response to reviewers.

PLoS Comput Biol 2017 Oct 12;13(10):e1005730. Epub 2017 Oct 12.

Departments of Genome Sciences and Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1005730DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5638205PMC
October 2017

HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient.

Genome Res 2017 11 30;27(11):1939-1949. Epub 2017 Aug 30.

Bioinformatics and Genomics Program, Pennsylvania State University, University Park, Pennsylvania 16802, USA.

Hi-C is a powerful technology for studying genome-wide chromatin interactions. However, current methods for assessing Hi-C data reproducibility can produce misleading results because they ignore spatial features in Hi-C data, such as domain structure and distance dependence. We present HiCRep, a framework for assessing the reproducibility of Hi-C data that systematically accounts for these features. In particular, we introduce a novel similarity measure, the stratum adjusted correlation coefficient (SCC), for quantifying the similarity between Hi-C interaction matrices. Not only does it provide a statistically sound and reliable evaluation of reproducibility, SCC can also be used to quantify differences between Hi-C contact matrices and to determine the optimal sequencing depth for a desired resolution. The measure consistently shows higher accuracy than existing approaches in distinguishing subtle differences in reproducibility and depicting interrelationships of cell lineages. The proposed measure is straightforward to interpret and easy to compute, making it well-suited for providing standardized, interpretable, automatable, and scalable quality control. The freely available R package HiCRep implements our approach.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.220640.117DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5668950PMC
November 2017