Skip to main content

"FPGA vs GPU in Machine Learning"

 

       Machine learning is a method of data analysis that automates analytical model building. It learns from the data and predicts the output. It originated from pattern recognition and theoretical prediction such that computers can learn from the data. They learn from previous computations to produce reliable, repeatable decisions and results.
Power consumption due to high computation has drastically decreased the performance of algorithms. To overcome this problem this various accelerators like FPGA and GPU are used. The 5 level strategy was used to program the FPGA which also becomes easier for the developer to understand and master the language. The performance of the GPU and FPGA were compared by comparing the normalized operation per cycle or each pipeline, and effective parallel factor (effective para factor), to compare the performance of GPU and FPGA accelerator designs respectively. It resulted in concluding the todays FPGA perform better that we can consume 1/10 of GPU power.


Following is the table showing the performance of systems:
Comparing parameter for FPGA and GPU:

  The cost of the high-end FPGAs limits them to specific niche applications, while the power burning of the high end GPUs avoids using them for a number of markets and critical systems. This suggests the selection of FPGA or GPU remains linked to the end user application in machine learning algorithms.

A comparative analysis for non-standard precision

A compute intensive program, matrix-matrix multiply, is selected as a benchmark and implemented for various different matrix sizes. The results show that for large-enough matrices, GPUs out-perform FPGA based implementations but for some smaller matrix sizes, specialized  FPGA floating-point operators for half and double-double precision can deliver higher throughput than implementation on a GPU.

1)An investigation into how half (16-bit) and double-double (128-bit) precision floating point operations can be efficiently implemented on both GPUs and FPGAs. 
2)A comparison of the performance achievable by GPUs and FPGAs for half and double-double precision computations. 
GPU implementations outperforms FPGA for larger data sizes but underperform for smaller sizes where the memory latency and kernel start overhead become significant. FPGAs have good vendor support for custom floating-point formats and we would expect this gap to increase further, in favor of FPGA implementation if even more exotic number representation were selected.

High Productivity Computing
For high productivity computing ,heterogeneous or co-processor architecture plays an important role. High performance computing is the use of parallel processing for the fast execution of advanced application program.in 2004 it was replaced with the productivity and formed the high productivity computing.to compare the performance of product NVIDIA GPU and MULTIPLE FPGA SUPERCOMPUTER released in 2009.

The benchmarks  set was as follows:

1) Batch generation of pseudo-random numbers.

 2) Dense square matrix multiplication.

 3) Sum of large vectors of random numbers.

 4) Second order N-body simulation.

A Hybrid GPU-FPGA-based Computing Platform for Machine Learning :

         A hybrid GPU-FPGA based computing platform to tackle the high-density computing problem of machine learning. The training part of a machine learning application is implemented on GPU and the inferencing part is implemented on FPGA. 

       For evaluating this design methodology the LeNet-5 is chosen as benchmark algorithm. By adopting this design methodology, we improved our LeNet-5 machine learning model’s accuracy from 99.05% to 99.13%, and successfully preserved the accuracy (99.13%) when transplanting the model from the GPU platform to the FPGA platform.   

      The experiment results show that the GPU training speed is average 8.8x faster than the CPU and the FPGA inferencing speed is average 44.4x faster than the CPU and is average 6342x faster than the GPU.


Can FPGA beat CPU in accelerating next generation deep neural network ?

Current-generation Deep Neural Networks (DNNs), such as Alex Net and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism). Current FPGAs offer superior energy efficiency (Ops/Watt), but they do not offer the performance of today’s GPUs on DNNs. 

 Case study presented on Ternary ResNet, which relies on sparse GEMM on 2-bit weights, and achieved accuracy within ~1% of the full-precision ResNet. On Ternary ResNet, the Stratix 10 FPGA is projected to deliver 60% better performance over Titan X Pascal GPU, while being 2.3x better in performance/watt. Hence after comparing ,we can say that FPGAs may become the platform of choice for accelerating DNNs. 

The evaluation of a selection of emerging DNN algorithms on two generations of FPGAs (Arria 10 and Stratix 10) and the latest Titan X GPU shows that current trends in DNN algorithms may favor FPGAs, and that FPGAs may even offer superior performance.


Deep-Learning Inferencing with High-Performance Hardware Accelerators

As FPGAs become more readily available on cloud services like Amazon Web Services F1 platform, FPGA frameworks for accelerating convolutional neural networks (CNN), which are used in many machine-learning apps, have begun to emerge for accelerated-application development. 

A machine-learning inferencing app was developed to leverage many different HPC architectures and frameworks, designed to compare these technologies to one another. CNNs such as Alex net and a custom 14-layer version of Google net were used to classify handwritten Chinese characters. The Caffe framework was used to leverage Xilinx FPGAs, NVIDIA GPUs. In this GPU performed best as compared to FPGA 

Result: 

  1. Maximum Throughput Performance of Frameworks/Devices at Batch Size for Maximum Performance:

2) Efficiency of Frameworks/Devices at Batch Size for Maximum Performance

Object Detection on FPGAs and GPUs by Using Accelerated Deep Learning:
  Object detection and recognition procedures were performed using FPGA to eliminate the effect of high stability, high power and large computational load problems. Real time object detection has been made using both USB GPU and FPGA. The result shows FPGA and movidius will be successfully used of object detection and recognition.

Application based performance measurement:
Comparison of GPU and FPGA based hardware platform for ultrasonic flaw detection using support vector machine
       A research on ultrasonic flaw detection algorithm based on the Support Vector Machine (SVM) classifier, proposed algorithm is based on sub band decomposition of ultrasonic signals followed by classification with a trained SVM model that uses sub band filter outputs as feature inputs. Target host platforms include an FPGA-based Xilinx ZedBoard, a GPU- based Tegra System-on-Chip (SoC) and a high-performance computing (HPC) server with GPU accelerators.

 Ultrasonic non-destructive testing (NDT) applications are often comprised of computationally advanced algorithms which require dedicated and specialized hardware architectures for real-time operation in the field. The execution time and scalability is low for FPGA as compared to embedded GPU in the ultrasonic flaw detection using SVM.

Comparison of FPGA and GPU implementations of Real-time Stereo Vision:

       The FPGA implementation also uses a custom circuit to back-track in parallel with cost computation for the succeeding line. In contrast, the GPU only makes use of a single thread for backtracking to save some costly memory transfers. 

   The compared performance and energy of sliding window applications when implemented on FPGAs, GPUs, and multicore devices, under a variety of different use cases. For most cases, the FPGA provided significantly faster performance, except for small inputs sizes, with speedups up to 11x and 57x compared to GPUs and multicores, respectively. GPUs provided the best performance when the basic sliding-window functionality could be replaced by frequency-domain algorithms. FPGAs provided the best energy efficiency in almost all situations, and were in some cases orders of magnitude better than other devices. 

FPGAs to Face the IoT Revolution:
       Basically, FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU and GPU-based implementations. It involves Different techniques, for implementing DNNs on FPGA with high performance and energy efficiency. FPGA design is complex and time-consuming, but the advent of high level synthesis (HLS) has significantly reduced design and verification effort . HLS-based design techniques not only improve the design productivity, but also improve the ability to implement and explore architecture optimizations including data quantization, parallelism extraction, pipelining, unrolling and memory partitioning.
Conclusion
Machine learning algorithm can be implemented using FPGA or GPU. Different studies have been carried out by performing different parameters by comparing both FPGA and GPU. After evaluation, FPGA found to be more efficient than GPU provided the FPGA hardware manufactured by the vendor is easy to use. Selection of FPGA vs GPU in machine learning will remain linked to the end-user application, available budget, and development capacity and easy to used hardware by the vendor.


Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. What is the difference between fpga and gph architecture?

    ReplyDelete
    Replies
    1. Field programmable gate arrays (FPGAs) are integrated circuits with a programmable hardware fabric. Unlike graphics processing units (GPUs) or ASICs, the circuitry inside an FPGA chip is not hard etched—it can be reprogrammed as needed.

      Delete

Post a Comment

Popular posts from this blog

6 Uses for Natural Language Processing in Healthcare

   What is NLP & How Does It Work? Natural language processing is a specialized branch of artificial intelligence that enables computers to understand and interpret human speech. The way it works is this: NLP systems pre-process data by first “cleaning” the dataset. This essentially involves organizing the data into a more logical format — for example, breaking down text into smaller semantic units , or “tokens,” in a process known as tokenization. Pre-processing simply makes the dataset easier for the NLP system to interpret. From there, the system applies algorithms to the text in order to interpret it. The two primary algorithms used in NLP are  rule-based systems , which interpret text based on predefined grammatical rules, and  machine learning models , which use statistical methods and “learn” over time by being fed training data. Despite being a major technological advancement — one that stands at the crossroads of computer science and linguistics — NLP is...