Publications by authors named "Srinivas Aluru"

40 Publications

A comprehensive evaluation of long read error correction methods.

BMC Genomics 2020 Dec 21;21(Suppl 6):889. Epub 2020 Dec 21.

School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA.

Background: Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used.

Results: In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research.

Conclusions: Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-020-07227-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7751105PMC
December 2020

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.

BMC Bioinformatics 2020 Nov 18;21(Suppl 6):404. Epub 2020 Nov 18.

Institute for Data Engineering and Science, Georiga Institute of Technology, 756 W Peachtree Street NW, Atlanta, USA.

Background: Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACS, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACS takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACS have been introduced.

Results: In this paper, we present a novel linear-time heuristic to approximate ACS, which is faster than computing the exact ACS while being closer to the exact ACS values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction.

Conclusions: Our method produces a better approximation for ACS and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-020-03738-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7672814PMC
November 2020

An SVM-based method for assessment of transcription factor-DNA complex models.

BMC Bioinformatics 2018 Dec 21;19(Suppl 20):506. Epub 2018 Dec 21.

Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA.

Background: Atomic details of protein-DNA complexes can provide insightful information for better understanding of the function and binding specificity of DNA binding proteins. In addition to experimental methods for solving protein-DNA complex structures, protein-DNA docking can be used to predict native or near-native complex models. A docking program typically generates a large number of complex conformations and predicts the complex model(s) based on interaction energies between protein and DNA. However, the prediction accuracy is hampered by current approaches to model assessment, especially when docking simulations fail to produce any near-native models.

Results: We present here a Support Vector Machine (SVM)-based approach for quality assessment of the predicted transcription factor (TF)-DNA complex models. Besides a knowledge-based protein-DNA interaction potential DDNA3, we applied several structural features that have been shown to play important roles in binding specificity between transcription factors and DNA molecules to quality assessment of complex models. To address the issue of unbalanced positive and negative cases in the training dataset, we applied hard-negative mining, an iterative training process that selects an initial training dataset by combining all of the positive cases and a random sample from the negative cases. Results show that the SVM model greatly improves prediction accuracy (84.2%) over two knowledge-based protein-DNA interaction potentials, orientation potential (60.8%) and DDNA3 (68.4%). The improvement is achieved through reducing the number of false positive predictions, especially for the hard docking cases, in which a docking algorithm fails to produce any near-native complex models.

Conclusions: A learning-based SVM scoring model with structural features for specific protein-DNA binding and an atomic-level protein-DNA interaction potential DDNA3 significantly improves prediction accuracy of complex models by successfully identifying cases without near-native structural models.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-018-2538-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302363PMC
December 2018

High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.

Nat Commun 2018 11 30;9(1):5114. Epub 2018 Nov 30.

School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA.

A fundamental question in microbiology is whether there is continuum of genetic diversity among genomes, or clear species boundaries prevail instead. Whole-genome similarity metrics such as Average Nucleotide Identity (ANI) help address this question by facilitating high resolution taxonomic analysis of thousands of genomes from diverse phylogenetic lineages. To scale to available genomes and beyond, we present FastANI, a new method to estimate ANI using alignment-free approximate sequence mapping. FastANI is accurate for both finished and draft genomes, and is up to three orders of magnitude faster compared to alignment-based approaches. We leverage FastANI to compute pairwise ANI values among all prokaryotic genomes available in the NCBI database. Our results reveal clear genetic discontinuity, with 99.8% of the total 8 billion genome pairs analyzed conforming to >95% intra-species and <83% inter-species ANI values. This discontinuity is manifested with or without the most frequently sequenced species, and is robust to historic additions in the genome databases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-018-07641-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6269478PMC
November 2018

A fast adaptive algorithm for computing whole-genome homology maps.

Bioinformatics 2018 09;34(17):i748-i756

School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA.

Motivation: Whole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements.

Results: We introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory-usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be >97% on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length ≥1 Kbp and ≥90% identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser's segmental duplication annotation.

Availability And Implementation: https://github.com/marbl/MashMap.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty597DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6129286PMC
September 2018

Fast de Bruijn Graph Compaction in Distributed Memory Environments.

IEEE/ACM Trans Comput Biol Bioinform 2020 Jan-Feb;17(1):136-148. Epub 2018 Jul 31.

De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph. Unitigs are used as building blocks for generating longer sequences in many assemblers, and can facilitate graph compression. Chain compaction, by which unitigs are generated, remains a critical computational task. In this paper, we present a distributed memory parallel algorithm for simultaneous compaction of all chains in bi-directed de Bruijn graphs. The key advantages of our algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logarithmic number of iterations in the length of the longest cycle. Our algorithm scales to thousands of computational cores, and can compact a whole genome de Bruijn graph from a human sequence read set in 7.3 seconds using 7680 distributed memory cores, and in 12.9 minutes using 64 shared memory cores. It is 3.7× and 2.0× faster than equivalent steps in the state-of-the-art tools for distributed and shared memory environments, respectively. An implementation of the algorithm is available at https://github.com/ParBLiSS/bruno.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TCBB.2018.2858797DOI Listing
January 2021

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.

J Comput Biol 2018 07 30;25(7):766-779. Epub 2018 Apr 30.

2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland.

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2018.0036DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6067103PMC
July 2018

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.

IEEE/ACM Trans Comput Biol Bioinform 2019 Jul-Aug;16(4):1117-1131. Epub 2017 Oct 9.

Counting and indexing fixed length substrings, or $k$k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the datasets at the current generation rate of 1.8 terabases per 3-day experiment from a single sequencer. We present Kmerind, a high performance parallel $k$k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Kmerind's $k$k-mer counter performs similarly or better than the best existing $k$k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts $k$k-mers in a 120 GB sequence read dataset in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1 percent of the $k$k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. Kmerind is the first $k$k-mer indexing library for distributed memory environments, and the first extensible library for general $k$k-mer indexing and counting. Kmerind is available at https://github.com/ParBLiSS/kmerind.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TCBB.2017.2760829DOI Listing
February 2020

A greedy alignment-free distance estimator for phylogenetic inference.

BMC Bioinformatics 2017 Jun 7;18(Suppl 8):238. Epub 2017 Jun 7.

Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, 30332, GA, USA.

Background: Alignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mismatches in the common substrings.

Results: We present ALFRED-G, a greedy alignment-free distance estimator for phylogenetic tree reconstruction based on the concept of the generalized ACS approach. In this algorithm, we have investigated a new heuristic to efficiently compute the lengths of common strings with mismatches allowed, and have further applied this heuristic to phylogeny reconstruction. Performance evaluation using real sequence datasets shows that our heuristic is able to reconstruct comparable, or even more accurate, phylogenetic tree topologies than the kmacs heuristic algorithm at highly competitive speed.

Conclusions: ALFRED-G is an alignment-free heuristic for evolutionary distance estimation between two biological sequences. This algorithm is implemented in C++ and has been incorporated into our open-source ALFRED software package ( http://alurulab.cc.gatech.edu/phylo ).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-017-1658-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5471951PMC
June 2017

Efficient detection of viral transmissions with Next-Generation Sequencing data.

BMC Genomics 2017 05 24;18(Suppl 4):372. Epub 2017 May 24.

Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, Atlanta, GA, USA.

Background: Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood are difficult to detect and investigate. Molecular analysis has been frequently used in the study of HCV outbreaks and transmission chains; helping identify a cluster of sequences as linked by transmission if their genetic distances are below a previously defined threshold. However, HCV exists as a population of numerous variants in each infected individual and it has been observed that minority variants in the source are often the ones responsible for transmission, a situation that precludes the use of a single sequence per individual because many such transmissions would be missed. The use of Next-Generation Sequencing immensely increases the sensitivity of transmission detection but brings a considerable computational challenge because all sequences need to be compared among all pairs of samples.

Methods: We developed a three-step strategy that filters pairs of samples according to different criteria: (i) a k-mer bloom filter, (ii) a Levenhstein filter and (iii) a filter of identical sequences. We applied these three filters on a set of samples that cover the spectrum of genetic relationships among HCV cases, from being part of the same transmission cluster, to belonging to different subtypes.

Results: Our three-step filtering strategy rapidly removes 85.1% of all the pairwise sample comparisons and 91.0% of all pairwise sequence comparisons, accurately establishing which pairs of HCV samples are below the relatedness threshold.

Conclusions: We present a fast and efficient three-step filtering strategy that removes most sequence comparisons and accurately establishes transmission links of any threshold-based method. This highly efficient workflow will allow a faster response and molecular detection capacity, improving the rate of detection of viral transmissions with molecular data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-017-3732-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5461558PMC
May 2017

RD26 mediates crosstalk between drought and brassinosteroid signalling pathways.

Nat Commun 2017 02 24;8:14573. Epub 2017 Feb 24.

Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011, USA.

Brassinosteroids (BRs) regulate plant growth and stress responses via the BES1/BZR1 family of transcription factors, which regulate the expression of thousands of downstream genes. BRs are involved in the response to drought, however the mechanistic understanding of interactions between BR signalling and drought response remains to be established. Here we show that transcription factor RD26 mediates crosstalk between drought and BR signalling. When overexpressed, BES1 target gene RD26 can inhibit BR-regulated growth. Global gene expression studies suggest that RD26 can act antagonistically to BR to regulate the expression of a subset of BES1-regulated genes, thereby inhibiting BR function. We show that RD26 can interact with BES1 protein and antagonize BES1 transcriptional activity on BR-regulated genes and that BR signalling can also repress expression of RD26 and its homologues and inhibit drought responses. Our results thus reveal a mechanism coordinating plant growth and drought tolerance.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ncomms14573DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5333127PMC
February 2017

Microarray Data Processing Techniques for Genome-Scale Network Inference from Large Public Repositories.

Microarrays (Basel) 2016 Sep 19;5(3). Epub 2016 Sep 19.

School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA.

Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5040970PMC
http://dx.doi.org/10.3390/microarrays5030023DOI Listing
September 2016

SNVSniffer: an integrated caller for germline and somatic single-nucleotide and indel mutations.

BMC Syst Biol 2016 08 1;10 Suppl 2:47. Epub 2016 Aug 1.

Institute of Computer Science, Johannes Gutenberg University Mainz, Mainz, 55128, Germany.

Background: Various approaches to calling single-nucleotide variants (SNVs) or insertion-or-deletion (indel) mutations have been developed based on next-generation sequencing (NGS). However, most of them are dedicated to a particular type of mutation, e.g. germline SNVs in normal cells, somatic SNVs in cancer/tumor cells, or indels only. In the literature, efficient and integrated callers for both germline and somatic SNVs/indels have not yet been extensively investigated.

Results: We present SNVSniffer, an efficient and integrated caller identifying both germline and somatic SNVs/indels from NGS data. In this algorithm, we propose the use of Bayesian probabilistic models to identify SNVs and investigate a multiple ungapped alignment approach to call indels. For germline variant calling, we model allele counts per site to follow a multinomial conditional distribution. For somatic variant calling, we rely on paired tumor-normal pairs from identical individuals and introduce a hybrid subtraction and joint sample analysis approach by modeling tumor-normal allele counts per site to follow a joint multinomial conditional distribution. A comprehensive performance evaluation has been conducted using a diversity of variant calling benchmarks. For germline variant calling, SNVSniffer demonstrates highly competitive accuracy with superior speed in comparison with the state-of-the-art FaSD, GATK and SAMtools. For somatic variant calling, our algorithm achieves comparable or even better accuracy, at fast speed, than the leading VarScan2, SomaticSniper, JointSNVMix2 and MuTect.

Conclusions: SNVSniffers demonstrates the feasibility to develop integrated solutions to fast and efficient identification of germline and somatic variants. Nonetheless, accurate discovery of genetic variations is critical yet challenging, and still requires substantially more research efforts being devoted. SNVSniffer and synthetic samples are publicly available at http://snvsniffer.sourceforge.net .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12918-016-0300-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4977481PMC
August 2016

ALFRED: A Practical Method for Alignment-Free Distance Computation.

J Comput Biol 2016 06 3;23(6):452-60. Epub 2016 May 3.

1 College of Computing, Georgia Institute of Technology , Atlanta, Georgia .

Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity [Formula: see text], the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of [Formula: see text]. By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2015.0217DOI Listing
June 2016

A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem.

J Comput Biol 2016 06 8;23(6):472-82. Epub 2016 Apr 8.

College of Computing, Georgia Institute of Technology , Atlanta, Georgia .

Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of this research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate k ≥ 1 mismatches have O(n(2)) worst-case time complexity, where n is the total length of the input sequences. On the other hand, accounting for mismatches has shown to lead to much improved classification, while heuristics can improve practical performance. In this article, we close the gap by presenting the first provably efficient algorithm for the k-mismatch average common string (ACSk) problem that takes O(n) space and O(n log(k) n) time in the worst case for any constant k. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applied to other complex approximate sequence matching problems.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2015.0235DOI Listing
June 2016

Discovering Motifs in Biological Sequences Using the Micron Automata Processor.

IEEE/ACM Trans Comput Biol Bioinform 2016 Jan-Feb;13(1):99-111

Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l, d) motif search problem of identifying one or more motifs of length l present in at least q of the n given sequences, with each occurrence differing from the motif in at most d substitutions. The problem is known to be NP-complete, and the largest solved instance reported to date is (26,11). We propose a novel algorithm for the (l,d) motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA). This solution is designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We demonstrate the capability for solving much larger instances of the (l, d) motif search problem using the resources available within a single automata processor board, by estimating run-times for problem instances (39,18) and (40,17). The paper serves as a useful guide to solving problems using this new accelerator technology.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TCBB.2015.2430313DOI Listing
December 2016

In search of perfect reads.

BMC Bioinformatics 2015 7;16 Suppl 17:S7. Epub 2015 Dec 7.

Background: Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates. Taking advantage of the high coverage sampling used in many applications, several error correction algorithms have been developed to improve data quality further. However, correcting errors in high coverage sequence data requires significant computing resources.

Methods: We propose a different approach to handle erroneous sequence data. Presently, error rates of high-throughput platforms such as the Illumina HiSeq are within 1%. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can significantly impact the run-time complexity of applications. We present a simple and fast k-spectrum analysis based method to identify error-free reads. The filtration process to identify and weed out erroneous reads can be customized at several levels of stringency depending upon the downstream application need.

Results: Our experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the average coverage by about 7%, we found the remaining reads provide as uniform a coverage as the original dataset. We demonstrate the effectiveness of our approach on an example downstream application: we show that an error correction algorithm, Reptile, which rely on collectively analyzing the reads in a dataset to identify and correct erroneous bases, instead use reads predicted to be perfect by our method to correct the other reads, the overall accuracy improves further by up to 10%.

Conclusions: Thanks to the continuous technological improvements, the coverage and accuracy of reads from dominant sequencing platforms have now reached an extent where we can envision just filtering out reads with errors, thus making error correction less important. Our algorithm is a first attempt to propose and demonstrate this new paradigm. Moreover, our demonstration is applicable to any error correction algorithm as a downstream application, this in turn gives a new class of error correcting algorithms as a by product.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2105-16-S17-S7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4674851PMC
August 2016

Guest Editors’ Introduction: Selected Papers from ACM-BCB 2013.

IEEE/ACM Trans Comput Biol Bioinform 2015 Jan-Feb;12(1):2-3

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/tcbb.2015.2389551DOI Listing
June 2016

Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel® Xeon Phi™ Coprocessor.

IEEE/ACM Trans Comput Biol Bioinform 2015 Sep-Oct;12(5):1008-20

Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/TCBB.2015.2415931DOI Listing
July 2016

Fast and accurate construction of ultra-dense consensus genetic maps using evolution strategy optimization.

PLoS One 2015 13;10(4):e0122485. Epub 2015 Apr 13.

Institute of Evolution, University of Haifa, Haifa, Israel.

Our aim was to develop a fast and accurate algorithm for constructing consensus genetic maps for chip-based SNP genotyping data with a high proportion of shared markers between mapping populations. Chip-based genotyping of SNP markers allows producing high-density genetic maps with a relatively standardized set of marker loci for different mapping populations. The availability of a standard high-throughput mapping platform simplifies consensus analysis by ignoring unique markers at the stage of consensus mapping thereby reducing mathematical complicity of the problem and in turn analyzing bigger size mapping data using global optimization criteria instead of local ones. Our three-phase analytical scheme includes automatic selection of ~100-300 of the most informative (resolvable by recombination) markers per linkage group, building a stable skeletal marker order for each data set and its verification using jackknife re-sampling, and consensus mapping analysis based on global optimization criterion. A novel Evolution Strategy optimization algorithm with a global optimization criterion presented in this paper is able to generate high quality, ultra-dense consensus maps, with many thousands of markers per genome. This algorithm utilizes "potentially good orders" in the initial solution and in the new mutation procedures that generate trial solutions, enabling to obtain a consensus order in reasonable time. The developed algorithm, tested on a wide range of simulated data and real world data (Arabidopsis), outperformed two tested state-of-the-art algorithms by mapping accuracy and computation time.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0122485PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4395089PMC
March 2016

Large-scale metagenomic sequence clustering on map-reduce clusters.

J Bioinform Comput Biol 2013 Feb 25;11(1):1340001. Epub 2012 Dec 25.

Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.

Taxonomic clustering of species from millions of DNA fragments sequenced from their genomes is an important and frequently arising problem in metagenomics. In this paper, we present a parallel algorithm for taxonomic clustering of large metagenomic samples with support for overlapping clusters. We develop sketching techniques, akin to those created for web document clustering, to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all comparison. We formulate the metagenomic classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud ready implementation. We show that the resulting framework can produce high quality clustering of metagenomic samples consisting of millions of reads, in reasonable time limits, when executed on a modest size cluster.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1142/S0219720013400015DOI Listing
February 2013

Reverse engineering and analysis of large genome-scale gene networks.

Nucleic Acids Res 2013 Jan 4;41(1):e24. Epub 2012 Oct 4.

Department of Genetics, Iowa State University, Ames, IA 50011, USA.

Reverse engineering the whole-genome networks of complex multicellular organisms continues to remain a challenge. While simpler models easily scale to large number of genes and gene expression datasets, more accurate models are compute intensive limiting their scale of applicability. To enable fast and accurate reconstruction of large networks, we developed Tool for Inferring Network of Genes (TINGe), a parallel mutual information (MI)-based program. The novel features of our approach include: (i) B-spline-based formulation for linear-time computation of MI, (ii) a novel algorithm for direct permutation testing and (iii) development of parallel algorithms to reduce run-time and facilitate construction of large networks. We assess the quality of our method by comparison with ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) and GeneNet and demonstrate its unique capability by reverse engineering the whole-genome network of Arabidopsis thaliana from 3137 Affymetrix ATH1 GeneChips in just 9 min on a 1024-core cluster. We further report on the development of a new software Gene Network Analyzer (GeNA) for extracting context-specific subnetworks from a given set of seed genes. Using TINGe and GeNA, we performed analysis of 241 Arabidopsis AraCyc 8.0 pathways, and the results are made available through the web.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gks904DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3592423PMC
January 2013

A survey of error-correction methods for next-generation sequencing.

Brief Bioinform 2013 Jan 6;14(1):56-66. Epub 2012 Apr 6.

The Broad institute, 7 Cambridge Center, Cambridge, MA 02142, USA.

Unlabelled: Error Correction is important for most next-generation sequencing applications because highly accurate sequenced reads will likely lead to higher quality results. Many techniques for error correction of sequencing data from next-gen platforms have been developed in the recent years. However, compared with the fast development of sequencing technologies, there is a lack of standardized evaluation procedure for different error-correction methods, making it difficult to assess their relative merits and demerits. In this article, we provide a comprehensive review of many error-correction methods, and establish a common set of benchmark data and evaluation criteria to provide a comparative assessment. We present experimental results on quality, run-time, memory usage and scalability of several error-correction methods. Apart from providing explicit recommendations useful to practitioners, the review serves to identify the current state of the art and promising directions for future research.

Availability: All error-correction programs used in this article are downloaded from hosting websites. The evaluation tool kit is publicly available at: http://aluru-sun.ece.iastate.edu/doku.php?id=ecr.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bib/bbs015DOI Listing
January 2013

Repeat-aware modeling and correction of short read errors.

BMC Bioinformatics 2011 Feb 15;12 Suppl 1:S52. Epub 2011 Feb 15.

Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa 50011, USA.

Background: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content.

Results: We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content.

Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at "http://aluru-sun.ece.iastate.edu/doku.php?id = redeem".

Conclusions: We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2105-12-S1-S52DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3044310PMC
February 2011

A brassinosteroid transcriptional network revealed by genome-wide identification of BESI target genes in Arabidopsis thaliana.

Plant J 2011 Feb 10;65(4):634-46. Epub 2011 Jan 10.

Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA.

Brassinosteroids (BRs) are important regulators for plant growth and development. BRs signal to control the activities of the BES1 and BZR1 family transcription factors. The transcriptional network through which BES1 and BZR regulate large number of target genes is mostly unknown. By combining chromatin immunoprecipitation coupled with Arabidopsis tiling arrays (ChIP-chip) and gene expression studies, we have identified 1609 putative BES1 target genes, 404 of which are regulated by BRs and/or in gain-of-function bes1-D mutant. BES1 targets contribute to BR responses and interactions with other hormonal or light signaling pathways. Computational modeling of gene expression data using Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNe) reveals that BES1-targeted transcriptional factors form a gene regulatory network (GRN). Mutants of many genes in the network displayed defects in BR responses. Moreover, we found that BES1 functions to inhibit chloroplast development by repressing the expression of GLK1 and GLK2 transcription factors, confirming a hypothesis generated from the GRN. Our results thus provide a global view of BR regulated gene expression and a GRN that guides future studies in understanding BR-regulated plant growth.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1111/j.1365-313X.2010.04449.xDOI Listing
February 2011

Reptile: representative tiling for short read error correction.

Bioinformatics 2010 Oct 16;26(20):2526-33. Epub 2010 Aug 16.

Department of Electrical and Computer Engineering, Iowa State University, Ames IA 50011, USA.

Motivation: Error correction is critical to the success of next-generation sequencing applications, such as resequencing and de novo genome sequencing. It is especially important for high-throughput short-read sequencing, where reads are much shorter and more abundant, and errors more frequent than in traditional Sanger sequencing. Processing massive numbers of short reads with existing error correction methods is both compute and memory intensive, yet the results are far from satisfactory when applied to real datasets.

Results: We present a novel approach, termed Reptile, for error correction in short-read data from next-generation sequencing. Reptile works with the spectrum of k-mers from the input reads, and corrects errors by simultaneously examining: (i) Hamming distance-based correction possibilities for potentially erroneous k-mers; and (ii) neighboring k-mers from the same read for correct contextual information. By not needing to store input data, Reptile has the favorable property that it can handle data that does not fit in main memory. In addition to sequence data, Reptile can make use of available quality score information. Our experiments show that Reptile outperforms previous methods in the percentage of errors removed from the data and the accuracy in true base assignment. In addition, a significant reduction in run time and memory usage have been achieved compared with previous methods, making it more practical for short-read error correction when sampling larger genomes.

Availability: Reptile is implemented in C++ and is available through the link: http://aluru-sun.ece.iastate.edu/doku.php?id=software

Contact: aluru@iastate.edu.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btq468DOI Listing
October 2010

The B73 maize genome: complexity, diversity, and dynamics.

Authors:
Patrick S Schnable Doreen Ware Robert S Fulton Joshua C Stein Fusheng Wei Shiran Pasternak Chengzhi Liang Jianwei Zhang Lucinda Fulton Tina A Graves Patrick Minx Amy Denise Reily Laura Courtney Scott S Kruchowski Chad Tomlinson Cindy Strong Kim Delehaunty Catrina Fronick Bill Courtney Susan M Rock Eddie Belter Feiyu Du Kyung Kim Rachel M Abbott Marc Cotton Andy Levy Pamela Marchetto Kerri Ochoa Stephanie M Jackson Barbara Gillam Weizu Chen Le Yan Jamey Higginbotham Marco Cardenas Jason Waligorski Elizabeth Applebaum Lindsey Phelps Jason Falcone Krishna Kanchi Thynn Thane Adam Scimone Nay Thane Jessica Henke Tom Wang Jessica Ruppert Neha Shah Kelsi Rotter Jennifer Hodges Elizabeth Ingenthron Matt Cordes Sara Kohlberg Jennifer Sgro Brandon Delgado Kelly Mead Asif Chinwalla Shawn Leonard Kevin Crouse Kristi Collura Dave Kudrna Jennifer Currie Ruifeng He Angelina Angelova Shanmugam Rajasekar Teri Mueller Rene Lomeli Gabriel Scara Ara Ko Krista Delaney Marina Wissotski Georgina Lopez David Campos Michele Braidotti Elizabeth Ashley Wolfgang Golser HyeRan Kim Seunghee Lee Jinke Lin Zeljko Dujmic Woojin Kim Jayson Talag Andrea Zuccolo Chuanzhu Fan Aswathy Sebastian Melissa Kramer Lori Spiegel Lidia Nascimento Theresa Zutavern Beth Miller Claude Ambroise Stephanie Muller Will Spooner Apurva Narechania Liya Ren Sharon Wei Sunita Kumari Ben Faga Michael J Levy Linda McMahan Peter Van Buren Matthew W Vaughn Kai Ying Cheng-Ting Yeh Scott J Emrich Yi Jia Ananth Kalyanaraman An-Ping Hsia W Brad Barbazuk Regina S Baucom Thomas P Brutnell Nicholas C Carpita Cristian Chaparro Jer-Ming Chia Jean-Marc Deragon James C Estill Yan Fu Jeffrey A Jeddeloh Yujun Han Hyeran Lee Pinghua Li Damon R Lisch Sanzhen Liu Zhijie Liu Dawn Holligan Nagel Maureen C McCann Phillip SanMiguel Alan M Myers Dan Nettleton John Nguyen Bryan W Penning Lalit Ponnala Kevin L Schneider David C Schwartz Anupma Sharma Carol Soderlund Nathan M Springer Qi Sun Hao Wang Michael Waterman Richard Westerman Thomas K Wolfgruber Lixing Yang Yeisoo Yu Lifang Zhang Shiguo Zhou Qihui Zhu Jeffrey L Bennetzen R Kelly Dawe Jiming Jiang Ning Jiang Gernot G Presting Susan R Wessler Srinivas Aluru Robert A Martienssen Sandra W Clifton W Richard McCombie Rod A Wing Richard K Wilson

Science 2009 Nov;326(5956):1112-5

Center for Plant Genomics, Iowa State University, Ames, IA 50011, USA.

We report an improved draft nucleotide sequence of the 2.3-gigabase genome of maize, an important crop plant and model for biological research. Over 32,000 genes were predicted, of which 99.8% were placed on reference chromosomes. Nearly 85% of the genome is composed of hundreds of families of transposable elements, dispersed nonuniformly across the genome. These were responsible for the capture and amplification of numerous gene fragments and affect the composition, sizes, and positions of centromeres. We also report on the correlation of methylation-poor regions with Mu transposon insertions and recombination, and copy number variants with insertions and/or deletions, as well as how uneven gene losses between duplicated regions were involved in returning an ancient allotetraploid to a genetically diploid state. These analyses inform and set the stage for further investigations to improve our understanding of the domestication and agricultural improvements of maize.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.1178534DOI Listing
November 2009

Detailed analysis of a contiguous 22-Mb region of the maize genome.

PLoS Genet 2009 Nov 20;5(11):e1000728. Epub 2009 Nov 20.

Arizona Genomics Institute, School of Plant Sciences and Department of Ecology, BIO5 Institute for Collaborative Research, University of Arizona, Tucson, Arizona, United States of America.

Most of our understanding of plant genome structure and evolution has come from the careful annotation of small (e.g., 100 kb) sequenced genomic regions or from automated annotation of complete genome sequences. Here, we sequenced and carefully annotated a contiguous 22 Mb region of maize chromosome 4 using an improved pseudomolecule for annotation. The sequence segment was comprehensively ordered, oriented, and confirmed using the maize optical map. Nearly 84% of the sequence is composed of transposable elements (TEs) that are mostly nested within each other, of which most families are low-copy. We identified 544 gene models using multiple levels of evidence, as well as five miRNA genes. Gene fragments, many captured by TEs, are prevalent within this region. Elimination of gene redundancy from a tetraploid maize ancestor that originated a few million years ago is responsible in this region for most disruptions of synteny with sorghum and rice. Consistent with other sub-genomic analyses in maize, small RNA mapping showed that many small RNAs match TEs and that most TEs match small RNAs. These results, performed on approximately 1% of the maize genome, demonstrate the feasibility of refining the B73 RefGen_v1 genome assembly by incorporating optical map, high-resolution genetic map, and comparative genomic data sets. Such improvements, along with those of gene and repeat annotation, will serve to promote future functional genomic and phylogenomic research in maize and other grasses.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1000728DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2773423PMC
November 2009

Guest Editors' Introduction, Special Issue on High-Performance Computational Biology.

Parallel Comput 2008 Nov;34(11):613-615

College of Computing, Georgia Institute of Technology, Atlanta, GA 30332.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.parco.2008.10.001DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2621331PMC
November 2008

Parallel short sequence assembly of transcriptomes.

BMC Bioinformatics 2009 Jan 30;10 Suppl 1:S14. Epub 2009 Jan 30.

Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA.

Background: The de novo assembly of genomes and transcriptomes from short sequences is a challenging problem. Because of the high coverage needed to assemble short sequences as well as the overhead of modeling the assembly problem as a graph problem, the methods for short sequence assembly are often validated using data from BACs or small sized prokaryotic genomes.

Results: We present a parallel method for transcriptome assembly from large short sequence data sets. Our solution uses a rigorous graph theoretic framework and tames the computational and space complexity using parallel computers. First, we construct a distributed bidirected graph that captures overlap information. Next, we compact all chains in this graph to determine long unique contigs using undirected parallel list ranking, a problem for which we present an algorithm. Finally, we process this compacted distributed graph to resolve unique regions that are separated by repeats, exploiting the naturally occurring coverage variations arising from differential expression.

Conclusion: We demonstrate the validity of our method using a synthetic high coverage data set generated from the predicted coding regions of Zea mays. We assemble 925 million sequences consisting of 40 billion nucleotides in a few minutes on a 1024 processor Blue Gene/L. Our method is the first fully distributed method for assembling a non-hierarchical short sequence data set and can scale to large problem sizes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2105-10-S1-S14DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648799PMC
January 2009