[PDF] Fused Floating Point Arithmetic For Application Specific Processors - eBooks Review

Fused Floating Point Arithmetic For Application Specific Processors


Fused Floating Point Arithmetic For Application Specific Processors
DOWNLOAD

Download Fused Floating Point Arithmetic For Application Specific Processors PDF/ePub or read online books in Mobi eBooks. Click Download or Read Online button to get Fused Floating Point Arithmetic For Application Specific Processors book now. This website allows unlimited access to, at the time of writing, more than 1.5 million titles, including hundreds of thousands of titles in various foreign languages. If the content not found or just blank you must refresh this page





Fused Floating Point Arithmetic For Application Specific Processors


Fused Floating Point Arithmetic For Application Specific Processors
DOWNLOAD
Author : Jae Hong Min
language : en
Publisher:
Release Date : 2013

Fused Floating Point Arithmetic For Application Specific Processors written by Jae Hong Min and has been published by this book supported file pdf, txt, epub, kindle and other format this book has been release on 2013 with categories.


Floating-point computer arithmetic units are used for modern-day computers for 2D/3D graphic and scientific applications due to their wider dynamic range than a fixed-point number system with the same word-length. However, the floating-point arithmetic unit has larger area, power consumption, and latency than a fixed-point arithmetic unit. It has become a big issue in modern low-power processors due to their limited power and performance margins. Therefore, fused architectures have been developed to improve floating-point operations. This dissertation introduces new improved fused architectures for add-subtract, sum-of-squares, and magnitude operations for graphics, scientific, and signal processing. A low-power dual-path fused floating-point add-subtract unit is introduced and compared with previous fused add-subtract units such as the single path and the high-speed dual-path fused add-subtract unit. The high-speed dual-path fused add-subtract unit has less latency compared with the single-path unit at a cost of large power consumption. To reduce the power consumption, an alternative dual-path architecture is applied to the fused add-subtract unit. The significand addition, subtraction and round units are performed after the far/close path. The power consumption of the proposed design is lower than the high-speed dual-path fused add-subtract unit at a cost in latency; however, the proposed fused unit is faster than the single-path fused unit. High-performance and low-power floating-point fused architectures for a two-term sum-of-squares computation are introduced and compared with discrete units. The fused architectures include pre/post-alignment, partial carry-sum width, and enhanced rounding. The fused floating-point sum-of-squares units with the post-alignment, 26 bit partial carry-sum width, and enhanced rounding system have less power-consumption, area, and latency compared with discrete parallel dot-product and sum-of-squares units. Hardware tradeoffs are presented between the fused designs in terms of power consumption, area, and latency. For example, the enhanced rounding processing reduces latency with a moderate cost of increased power consumption and area. A new type of fused architecture for magnitude computation with less power consumption, area, and latency than conventional discrete floating-point units is proposed. Compared with the discrete parallel magnitude unit realized with conventional floating-point squarers, an adder, and a square-root unit, the fused floating-point magnitude unit has less area, latency, and power consumption. The new design includes new designs for enhanced exponent, compound add/round, and normalization units. In addition, a pipelined structure for the fused magnitude unit is shown.



Improved Architectures For Fused Floating Point Arithmetic Units


Improved Architectures For Fused Floating Point Arithmetic Units
DOWNLOAD
Author : Jongwook Sohn
language : en
Publisher:
Release Date : 2013

Improved Architectures For Fused Floating Point Arithmetic Units written by Jongwook Sohn and has been published by this book supported file pdf, txt, epub, kindle and other format this book has been release on 2013 with categories.


Most general purpose processors (GPP) and application specific processors (ASP) use the floating-point arithmetic due to its wide and precise number system. However, the floating-point operations require complex processes such as alignment, normalization and rounding. To reduce the overhead, fused floating-point arithmetic units are introduced. In this dissertation, improved architectures for three fused floating-point arithmetic units are proposed: 1) Fused floating-point add-subtract unit, 2) Fused floating-point two-term dot product unit, and 3) Fused floating-point three-term adder. Also, the three fused floating-point units are implemented for both single and double precision and evaluated in terms of the area, power consumption, latency and throughput. To improve the performance of the fused floating-point add-subtract unit, a new alignment scheme, fast rounding, two dual-path algorithms and pipelining are applied. The improved fused floating-point two-term dot product unit applies several optimizations: a new alignment scheme, early normalization and fast rounding, four-input leading zero anticipation (LZA), dual-path algorithm and pipelining. The proposed fused floating-point three-term adder applies a new exponent compare and significand alignment scheme, double reduction, early normalization and fast rounding, three-input LZA and pipelining to improve the performance.



Fused Floating Point Arithmetic For Dsp


Fused Floating Point Arithmetic For Dsp
DOWNLOAD
Author : Hani Hasan Mustafa Saleh
language : en
Publisher:
Release Date : 2009

Fused Floating Point Arithmetic For Dsp written by Hani Hasan Mustafa Saleh and has been published by this book supported file pdf, txt, epub, kindle and other format this book has been release on 2009 with Floating-point arithmetic categories.


Floating-point arithmetic is attractive for the implementation for a variety of Digital Signal Processing (DSP) applications because it allows the designer and user to concentrate on the algorithms and architecture without worrying about numerical issues. In the past, many DSP applications used fixed point arithmetic due to the high cost (in delay, silicon area, and power consumption) of floating-point arithmetic units. In the realization of modern general purpose processors, fused floating-point multiply add units have become attractive since their delay and silicon area is often less than that of a discrete floating-point multiplier followed by a floating point adder. Further the accuracy is improved by the fused implementation since rounding is performed only once (after the multiplication and addition). This work extends the consideration of fused floating-point arithmetic to operations that are frequently encountered in DSP. The Fast Fourier Transform is a case in point since it uses a complex butterfly operation. For a radix-2 implementation, the butterfly consists of a complex multiply and the complex addition and subtraction of the same pair of data. For a radix-4 implementation, the butterfly consists of three complex multiplications and eight complex additions and subtractions. Both of these butterfly operations can be implemented with two fused primitives, a fused two-term dot-product unit and a fused add-subtract unit. The fused two-term dot-product multiplies two sets of operands and adds the products as a single operation. The two products do not need to be rounded (only the sum is normalized and rounded) which reduces the delay by about 15% while reducing the silicon area by about 33%. For the add-subtract unit, much of the complexity of a discrete implementation comes from the need to compare the operand exponents and align the significands prior to the add and the subtract operations. For the fused implementation, sharing the comparison and alignment greatly reduces the complexity. The delay and the arithmetic results are the same as if the operations are performed in the conventional manner with a floating-point adder and a separate floating-point subtracter. In this case, the fused implementation is about 20% smaller than the discrete equivalent.



Handbook Of Floating Point Arithmetic


Handbook Of Floating Point Arithmetic
DOWNLOAD
Author : Jean-Michel Muller
language : en
Publisher: Springer Science & Business Media
Release Date : 2009-11-11

Handbook Of Floating Point Arithmetic written by Jean-Michel Muller and has been published by Springer Science & Business Media this book supported file pdf, txt, epub, kindle and other format this book has been release on 2009-11-11 with Mathematics categories.


Floating-point arithmetic is the most widely used way of implementing real-number arithmetic on modern computers. However, making such an arithmetic reliable and portable, yet fast, is a very difficult task. As a result, floating-point arithmetic is far from being exploited to its full potential. This handbook aims to provide a complete overview of modern floating-point arithmetic. So that the techniques presented can be put directly into practice in actual coding or design, they are illustrated, whenever possible, by a corresponding program. The handbook is designed for programmers of numerical applications, compiler designers, programmers of floating-point algorithms, designers of arithmetic operators, and more generally, students and researchers in numerical analysis who wish to better understand a tool used in their daily work and research.



Handbook Of Floating Point Arithmetic


Handbook Of Floating Point Arithmetic
DOWNLOAD
Author : Jean-Michel Muller
language : en
Publisher: Birkhäuser
Release Date : 2018-05-02

Handbook Of Floating Point Arithmetic written by Jean-Michel Muller and has been published by Birkhäuser this book supported file pdf, txt, epub, kindle and other format this book has been release on 2018-05-02 with Mathematics categories.


Floating-point arithmetic is the most widely used way of implementing real-number arithmetic on modern computers. However, making such an arithmetic reliable and portable, yet fast, is a very difficult task. As a result, floating-point arithmetic is far from being exploited to its full potential. This handbook aims to provide a complete overview of modern floating-point arithmetic. So that the techniques presented can be put directly into practice in actual coding or design, they are illustrated, whenever possible, by a corresponding program. The handbook is designed for programmers of numerical applications, compiler designers, programmers of floating-point algorithms, designers of arithmetic operators, and more generally, students and researchers in numerical analysis who wish to better understand a tool used in their daily work and research.



Application Specific Arithmetic


Application Specific Arithmetic
DOWNLOAD
Author : Florent de Dinechin
language : en
Publisher: Springer Nature
Release Date :

Application Specific Arithmetic written by Florent de Dinechin and has been published by Springer Nature this book supported file pdf, txt, epub, kindle and other format this book has been release on with categories.




Custom Floating Point Arithmetic For Integer Processors


Custom Floating Point Arithmetic For Integer Processors
DOWNLOAD
Author : Jingyan Jourdan
language : en
Publisher:
Release Date : 2012

Custom Floating Point Arithmetic For Integer Processors written by Jingyan Jourdan and has been published by this book supported file pdf, txt, epub, kindle and other format this book has been release on 2012 with categories.


Media processing applications typically involve numerical blocks that exhibit regular floating-point computation patterns. For processors whose architecture supports only integer arithmetic, these patterns can be profitably turned into custom operators, coming in addition to the five basic ones (+, -, X, / and √), but achieving better performance by treating more operations. This thesis addresses the design of such custom operators as well as the techniques developed in the compiler to select them in application codes. We have designed optimized implementations for a set of custom operators which includes squaring, scaling, adding two nonnegative terms, fused multiply-add, fused square-add (x*x+z, with z>=0), two-dimensional dot products (DP2), sums of two squares, as well as simultaneous addition/subtraction and sine/cosine. With novel algorithms targeting high instruction-level parallelism and detailed here for squaring, scaling, DP2, and sin/cos, we achieve speedups of up to 4.2x for individual custom operators even when subnormal numbers are fully supported. Furthermore, we introduce the optimizations developed in the ST231 C/C++ compiler for selecting such operators. Most of the selections are achieved at high level, using syntactic criteria. However, for fused square-add, we also enhance the framework of integer range analysis to support floating-point variables in order to prove the required positivity condition z>= 0. Finally, we provide quantitative evidence of the benefits to support this selection of custom operations: on DSP kernels and benchmarks, our approach allows us to be up to 1.59x faster compared to the sole usage of basic ones.



Embedded Systems Design With Special Arithmetic And Number Systems


Embedded Systems Design With Special Arithmetic And Number Systems
DOWNLOAD
Author : Amir Sabbagh Molahosseini
language : en
Publisher: Springer
Release Date : 2017-03-20

Embedded Systems Design With Special Arithmetic And Number Systems written by Amir Sabbagh Molahosseini and has been published by Springer this book supported file pdf, txt, epub, kindle and other format this book has been release on 2017-03-20 with Technology & Engineering categories.


This book introduces readers to alternative approaches to designing efficient embedded systems using unconventional number systems. The authors describe various systems that can be used for designing efficient embedded and application-specific processors, such as Residue Number System, Logarithmic Number System, Redundant Binary Number System Double-Base Number System, Decimal Floating Point Number System and Continuous Valued Number System. Readers will learn the strategies and trade-offs of using unconventional number systems in application-specific processors and be able to apply and design appropriate arithmetic operations from these number systems to boost the performance of digital systems.



Reconfigurable Computing Architectures Tools And Applications


Reconfigurable Computing Architectures Tools And Applications
DOWNLOAD
Author : Oliver Choy
language : en
Publisher: Springer Science & Business Media
Release Date : 2012-03-02

Reconfigurable Computing Architectures Tools And Applications written by Oliver Choy and has been published by Springer Science & Business Media this book supported file pdf, txt, epub, kindle and other format this book has been release on 2012-03-02 with Computers categories.


This book constitutes the refereed proceedings of the 8th International Symposium on Reconfigurable Computing: Architectures, Tools and Applications, ARC 2012, held in Hongkong, China, in March 2012. The 35 revised papers presented, consisting of 25 full papers and 10 poster papers were carefully reviewed and selected from 44 submissions. The topics covered are applied RC design methods and tools, applied RC architectures, applied RC applications and critical issues in applied RC.



Methods For Reducing Floating Point Computation Overhead


Methods For Reducing Floating Point Computation Overhead
DOWNLOAD
Author : Jon Pimentel
language : en
Publisher:
Release Date : 2017

Methods For Reducing Floating Point Computation Overhead written by Jon Pimentel and has been published by this book supported file pdf, txt, epub, kindle and other format this book has been release on 2017 with categories.


Despite floating-point (FP) being the most commonly used method for real number representation, certain architectures are still limited to fixed-point arithmetic due to the large area and power requirements of FP hardware. A software library, which emulates FP functions, is typically implemented when FP calculations need to be performed on a platform with a fixed-point datapath. However, software implementations of FP operations, despite not requiring any additional area, suffer from a low throughput. Conversely, hardware FP implementations provide high throughput, but require a large amount of additional area and consequently increase leakage. Therefore, it is desirable to increase the FP throughput provided by a software implementation without incurring the area overhead of a full hardware floating-point unit (FPU). Furthermore, the widths of data words in digital processors have a direct impact on area in application-specific ICs (ASICs) and field-programmable gate arrays (FPGAs). Circuit area impacts energy dissipation per workload and chip cost. Graphics and image processing workloads are very FP intensive, however, little exploration has been done into modifying FP word width and observing its effect on image quality and chip area. This dissertation first presents hybrid FP implementations, which improve software FP performance without incurring the area overhead of full hardware FPUs. The proposed implementations are synthesized in 65 nm complementary metal oxide semiconductor (CMOS) technology and integrated into small fixed-point processors which use a reduced instruction set computing (RISC)-like architecture. Unsigned, shift-carry, and leading zero detection (USL) support is added to the processors to augment the existing instruction set architecture (ISA) and increase FP throughput with little area overhead. Two variations of hybrid implementations are created. USL support is additional general purpose hardware that is not specific to FP workloads (e.g., unsigned operation support), custom FP-specific (CFP) hardware is specifically for FP workload acceleration (e.g., exponent calculation logic). The first, hybrid implementations with USL support, increase software FP throughput per core by 2.18x for addition/subtraction, 1.29x for multiplication, 3.07–4.05x for division, and 3.11–3.81x for square root, and use 90.7–94.6% less area than dedicated fused multiply-add (FMA) hardware. The second type of hybrid implementations, those with CFP hardware, increase throughput per core over a fixed-point software kernel by 3.69–7.28x for addition/subtraction, 1.22–2.03x for multiplication, 14.4x for division, and 31.9x for square root, and use 77.3–97.0% less area than dedicated fused multiply-add hardware. The circuit area and throughput are found for 38 multiply-add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. 33 multiply-add implementations are presented which improve throughput per core versus a fixed-point software implementation by 1.11–15.9x and use 38.2–95.3% less area than dedicated FMA hardware.In addition to proposing hybrid FP implementations, this dissertation investigates the effects of modifying FP word width. For the second portion of this dissertation, FP exponent and mantissa widths are independently varied for the seven major computational blocks of an airborne synthetic aperture radar (SAR) image formation engine. This image formation engine uses the backprojection algorithm. SAR imaging uses pulses of microwave energy to provide day, night, and all-weather imaging and can be used for reconnaissance, navigation, and environment monitoring. The backprojection algorithm is a frequently used tomographic reconstruction method similar to that used in computed tomography (CT) imaging. Additionally, trigonometric function evaluation, interpolation, and Fourier transforms are common to SAR backprojection and other biomedical image formation algorithms. The circuit area in 65 nm CMOS and the peak signal-to-noise ratio (PSNR) and structural similarity index metric (SSIM) are found for 572 design points. With word width reductions of 46.9–79.7%, images with a 0.99 SSIM are created with imperceptible image quality degradation and a 1.9–11.4x area reduction. The third portion of this dissertation covers the physical design of two many-core chips in 32 nm PD-SOI, KiloCore and KiloCore2. In the first portion of this section, the design of KiloCore is covered, while the second portion details the adjustments made to the flow for the tape-out of KiloCore2. KiloCore features 1000 cores capable of independent program execution. The maximum clock frequency for the cores on KiloCore range from 1.70 GHz to 1.87 GHz at 1.10 V. KiloCore compares favorably against many other many-core and multi-core chips, as well as low power processors. At a supply voltage of 0.56 V, processors require 5.8 pJ per operation at a clock frequency of 115 MHz. KiloCore2 has 700 cores, 697 of which are programmable processor tiles, and three which are hardware accelerators (a fast Fourier transform (FFT) accelerator, and two Viterbi decoders). The assembled printed circuit boards (PCBs) with packaged KiloCore2 chips are expected to be ready in July. The fourth portion of this dissertation explores implementing a scientific kernel on a many-core array, namely sparse matrix-vector multiplication. Twenty-three functionally equivalent sparse matrix times dense vector multiplication implementations are created for a fine-grained many-core platform with FP capabilities. These implementations are considered against two central processing unit (CPU) chips and two graphics processing unit (GPU) chips. The designs for the many-core array, CPUs, and GPUs are evaluated using the metrics of throughput per area and throughput per watt when operating on a set of five unstructured sparse matrices of varying dimensions, sourced from a wide range of domains including directed weighted graphs, computational fluid dynamics, circuit simulation, thermal problems (e.g., heat exchanger design), and eigenvalue/model reduction problems. Results using unscheduled and unoptimized code demonstrate that the implementations on the many-core platform increase power efficiency by up to 14.0x versus the CPU implementations, and by up to 27.9x versus the GPU implementations. Additionally, the implementations on the many-core platform increase area efficiency by as much as 17.8x versus the CPU implementations, and up to 36.6x versus the GPU implementations.