Ôîðóì » Äèñêóññèè » Boewie wichislitel'nie kompleksi » Îòâåòèòü

Boewie wichislitel'nie kompleksi

milstar: http://drops.dagstuhl.de/opus/volltexte/2006/732/pdf/06141.AthanasPeter.Paper.732.pdf Although an FPGA’s clock rate rarely exceeds one-tenth that of a PC, hardware implemented digital filters can process data at ###################################################################### many times that of software implementations [4] ################################### . Additional performance gains have been described for cryptography [5], network packet filtering [6], target recognition [7] and pattern matching [8], among other ########################################################################## applications. A. Present Day Cost-Performance Comparison Owing to the prevalence of IEEE standard floating-point in a wide range of applications, several researchers have designed IEEE 754 compliant floating-point accelerator cores constructed out of the Xilinx Virtex-II Pro FPGA’s configurable logic and dedicated integer multipliers [16-18]. Dou et al published one of the highest performance benchmarks of 15.6 GFLOPS by placing 39 floating-point processing elements on a theoretical Xilinx XC2VP125 FPGA [19]. Interpolating their results for the largest production Xilinx Virtex-II Pro device, the XC2VP100, produces 12.4 GFLOPS, compared to the peak 6.4 GFLOPS achievable for a 3.2 GHz Intel Pentium processor. Assuming that the Pentium can sustain 50% of its peak, the FPGA outperforms the processor by a factor of four for matrix multiplication. One of the earlier projects demonstrated a 23x speedup on a 2-D FFT through the use of a custom 18-bit floating-point format [26]. More recent work has focused on parameterizible libraries of floating-point units that can be tailored to the task at hand [27-29]. By using a custom floating-point format sized to match the width’s of the FPGA’s internal integer multipliers, a speedup of 44 was achieved for a hydrodynamics simulation [30] using four large FPGAs. Nakasato and Hamada’s 38 GFLOPS of performance is impressive, even from a cost-performance standpoint. For the cost of their PROGRAPE-3 board, estimated at $15,000, it is likely that a 15-node processor cluster could be constructed producing 196 single precision peak GFLOPS. Even in the unlikely scenario that this cluster could sustain the same 10% of peak performance obtained by Nakasato and Hamada’s for their software implementation, the PROGRAPE-3 design would still achieve a 2x speedup. As in many FPGA to CPU comparisons, it is likely that the analysis unfairly favors the FPGA solution. Hardware implementations require specialized skills in digital design and vendor-specific tool flows. Development time and costs are significantly higher than for software. Many comparisons in literature spend significantly more time optimizing the hardware implementations than they do optimizing their software implementations. Previous research has demonstrated significant compiler inefficiency for common HPCfunctions [31]. For the DGEMM matrix multiplication function, a hand-coded version outperformed the ############################################### compiler by greater than eight times. ############################ A to- tal of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth 1 is s 1700 nozkami i wisokoj stoimost'ju porjadka 8000 $ segodnja http://ce.et.tudelft.nl/~george/publications/Conf/FPGA05/FPGA05Dou.pd http://www.xilinx.com/publications/matrix/virtexmatrix.pd Xilinx Vertex FPGA

Îòâåòîâ - 76, ñòð: 1 2 3 4 All



ïîëíàÿ âåðñèÿ ñòðàíèöû