Publications by authors named "Martin C Herbordt"

10 Publications

  • Page 1 of 1

Real-time data analysis for medical diagnosis using FPGA-accelerated neural networks.

BMC Bioinformatics 2018 Dec 21;19(Suppl 18):490. Epub 2018 Dec 21.

Computer Architecture and Automated Design Lab, Boston University, Boston, MA, USA.

Background: Real-time analysis of patient data during medical procedures can provide vital diagnostic feedback that significantly improves chances of success. With sensors becoming increasingly fast, frameworks such as Deep Neural Networks are required to perform calculations within the strict timing constraints for real-time operation. However, traditional computing platforms responsible for running these algorithms incur a large overhead due to communication protocols, memory accesses, and static (often generic) architectures. In this work, we implement a low-latency Multi-Layer Perceptron (MLP) processor using Field Programmable Gate Arrays (FPGAs). Unlike CPUs and Graphics Processing Units (GPUs), our FPGA-based design can directly interface sensors, storage devices, display devices and even actuators, thus reducing the delays of data movement between ports and compute pipelines. Moreover, the compute pipelines themselves are tailored specifically to the application, improving resource utilization and reducing idle cycles. We demonstrate the effectiveness of our approach using mass-spectrometry data sets for real-time cancer detection.

Results: We demonstrate that correct parameter sizing, based on the application, can reduce latency by 20% on average. Furthermore, we show that in an application with tightly coupled data-path and latency constraints, having a large amount of computing resources can actually reduce performance. Using mass-spectrometry benchmarks, we show that our proposed FPGA design outperforms both CPU and GPU implementations, with an average speedup of 144x and 21x, respectively.

Conclusion: In our work, we demonstrate the importance of application-specific optimizations in order to minimize latency and maximize resource utilization for MLP inference. By directly interfacing and processing sensor data with ultra-low latency, FPGAs can perform real-time analysis during procedures and provide diagnostic feedback that can be critical to achieving higher percentages of successful patient outcomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-018-2505-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302367PMC
December 2018

GPU Optimizations for a Production Molecular Docking Code.

IEEE Conf High Perform Extreme Comput 2014 Sep;2014

Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4652941PMC
http://dx.doi.org/10.1109/HPEC.2014.7040981DOI Listing
September 2014

3D FFTs on a Single FPGA.

Proc IEEE Int Symp Field Program Cust Comput Mach 2014 May;2014:68-71

The 3D FFT is critical in many physical simulations and image processing applications. On FPGAs, however, the 3D FFT was thought to be inefficient relative to other methods such as convolution-based implementations of multi-grid. We find the opposite: a simple design, operating at a conservative frequency, takes 4s for 16, 21s for 32, and 215s for 64 single precision data points. The first two of these compare favorably with the 25s and 29s obtained running on a current Nvidia GPU. Some broader significance is that this is a critical piece in implementing a large scale FPGA-based MD engine: even a single FPGA is capable of keeping the FFT off of the critical path for a large fraction of possible MD simulations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/FCCM.2014.28DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4652940PMC
May 2014

Parallel Discrete Molecular Dynamics Simulation With Speculation and In-Order Commitment.

J Comput Phys 2011 Jul;230(17):6563-6582

Computer Architecture and Automated Design Laboratory, Department of Electrical and Computer Engineering, Boston University; Boston, MA 02215, www.bu.edu/caadlab.

Discrete molecular dynamics simulation (DMD) uses simplified and discretized models enabling simulations to advance by event rather than by timestep. DMD is an instance of discrete event simulation and so is difficult to scale: even in this multi-core era, all reported DMD codes are serial. In this paper we discuss the inherent difficulties of scaling DMD and present our method of parallelizing DMD through event-based decomposition. Our method is microarchitecture inspired: speculative processing of events exposes parallelism, while in-order commitment ensures correctness. We analyze the potential of this parallelization method for shared-memory multiprocessors. Achieving scalability required extensive experimentation with scheduling and synchronization methods to mitigate serialization. The speed-up achieved for a variety of system sizes and complexities is nearly 6× on an 8-core and over 9× on a 12-core processor. We present and verify analytical models that account for the achieved performance as a function of available concurrency and architectural limitations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jcp.2011.05.001DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3148765PMC
July 2011

Molecular Dynamics Simulations on High-Performance Reconfigurable Computing Systems.

ACM Trans Reconfigurable Technol Syst 2010 Nov;3(4)

Computer Architecture and Automated Design Laboratory; Department of Electrical and Computer Engineering; Boston University; Boston, MA 02215; Web: http://www.bu.edu/caadlab.

The acceleration of molecular dynamics (MD) simulations using high-performance reconfigurable computing (HPRC) has been much studied. Given the intense competition from multicore and GPUs, there is now a question whether MD on HPRC can be competitive. We concentrate here on the MD kernel computation: determining the short-range force between particle pairs. In one part of the study, we systematically explore the design space of the force pipeline with respect to arithmetic algorithm, arithmetic mode, precision, and various other optimizations. We examine simplifications and find that some have little effect on simulation quality. In the other part, we present the first FPGA study of the filtering of particle pairs with nearly zero mutual force, a standard optimization in MD codes. There are several innovations, including a novel partitioning of the particle space, and new methods for filtering and mapping work onto the pipelines. As a consequence, highly efficient filtering can be implemented with only a small fraction of the FPGA's resources. Overall, we find that, for an Altera Stratix-III EP3ES260, 8 force pipelines running at nearly 200 MHz can fit on the FPGA, and that they can perform at 95% efficiency. This results in an 80-fold per core speed-up for the short-range force, which is likely to make FPGAs highly competitive for MD.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1145/1862648.1862653DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3109751PMC
November 2010

Explicit Design of FPGA-Based Coprocessors for Short-Range Force Computations in Molecular Dynamics Simulations.

Parallel Comput 2008 May;34(4-5):261-277

Computer Architecture and Automated Design Lab, Department of Electrical and Computer Engineering, Boston University; Boston, MA 02215.

FPGA-based acceleration of molecular dynamics simulations (MD) has been the subject of several recent studies. The short-range force computation, which dominates the execution time, is the primary focus. Here we combine: a high level of FPGA-specific design including cell lists, systematically determined interpolation and precision, handling of exclusion, and support for MD simulations of up to 256K particles. The target system consists of a standard PC with a 2004-era COTS FPGA board. There are several innovations: new microarchitectures for several major components, including the cell list processor and the off-chip memory controller; and a novel arithmetic mode. Extensive experimentation was required to optimize precision, interpolation order, interpolation mode, table sizes, and simulation quality. We obtain a substantial speed-up over a highly tuned production MD code.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.parco.2008.01.007DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2440579PMC
May 2008

Single Pass Streaming BLAST on FPGAs.

Parallel Comput 2007 Nov;33(10-11):741-756

Department of Electrical and Computer Engineering, Boston University; Boston, MA 02215, Web: http://www.bu.edu/caadlab .

Approximate string matching is fundamental to bioinformatics and has been the subject of numerous FPGA acceleration studies. We address issues with respect to FPGA implementations of both BLAST- and dynamic-programming- (DP) based methods. Our primary contribution is a new algorithm for emulating the seeding and extension phases of BLAST. This operates in a single pass through a database at streaming rate, and with no preprocessing other than loading the query string. Moreover, it emulates parameters turned to maximum possible sensitivity with no slowdown. While current DP-based methods also operate at streaming rate, generating results can be cumbersome. We address this with a new structure for data extraction. We present results from several implementations showing order of magnitude acceleration over serial reference code. A simple extension assures compatibility with NCBI BLAST.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.parco.2007.09.003DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2598392PMC
November 2007

Computing Models for FPGA-Based Accelerators.

Comput Sci Eng 2008 Oct;10(6):35-45

Boston University.

Field-programmable gate arrays are widely considered as accelerators for compute-intensive applications. A critical phase of FPGA application development is finding and mapping to the appropriate computing model. FPGA computing enables models with highly flexible fine-grained parallelism and associative operations such as broadcast and collective response. Several case studies demonstrate the effectiveness of using these computing models in developing FPGA applications for molecular modeling.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/MCSE.2008.143DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3096930PMC
October 2008

Families of FPGA-Based Accelerators for Approximate String Matching.

Microprocess Microsyst 2007 Mar;31(2):135-145

Department of Electrical and Computer Engineering Boston University.

Dynamic programming for approximate string matching is a large family of different algorithms, which vary significantly in purpose, complexity, and hardware utilization. Many implementations have reported impressive speed-ups, but have typically been point solutions - highly specialized and addressing only one or a few of the many possible options. The problem to be solved is creating a hardware description that implements a broad range of behavioral options without losing efficiency due to feature bloat. We report a set of three component types that address different parts of the approximate string matching problem. This allows each application to choose the feature set required, then make maximum use of the FPGA fabric according to that application's specific resource requirements. Multiple, interchangeable implementations are available for each component type. We show that these methods allow the efficient generation of a large, if not complete, family of accelerators for this application. This flexibility was obtained while retaining high performance: We have evaluated a sample against serial reference codes and found speed-ups of from 150× to 400× over a high-end PC.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.micpro.2006.04.001DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3096528PMC
March 2007

Achieving High Performance with FPGA-Based Computing.

Computer (Long Beach Calif) 2007 Mar;40(3):50-57

Boston University.

Numerous application areas, including bioinformatics and computational biology, demand increasing amounts of processing capability. In many cases, the computation cores and data types are suited to field-programmable gate arrays. The challenge is identifying the design techniques that can extract high performance potential from the FPGA fabric.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/MC.2007.79DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098506PMC
March 2007
-->