COMPARISON OF GPU AND CPU EFFICIENCY WHILE SOLVING HEAT CONDUCTION PROBLEMS

. Overview of GPU usage while solving different engineering problems, comparison between CPU and GPU computations and overview of the heat conduction problem are provided in this paper. The Jacobi iterative algorithm was implemented by using Python, TensorFlow GPU library and NVIDIA CUDA technology. Numerical experiments were conducted with 6 CPUs and 4 GPUs. The fastest used GPU completed the calculations 19 times faster than the slowest CPU. On average, GPU was from 9 to 11 times faster than CPU. Significant relative speed-up in GPU calculations starts when the matrix contains at least 400 2 floating-point numbers.


Introduction
The first generation of Graphics Processing Units (GPUs) was created at the end of the 20th century to fulfil the demands of computer games. Starting from shadowing algorithms, like Shadow Mapping (Williams, 1978) and Shadow Volume (Crow, 1977), devices and coding possibilities were becoming more and more sophisticated to give birth for the second generation of GPUs with shaders, small programs, consisting of 20 lines of GPU assembler code. Loops, branching, etc., made their way to the GPU programming with the understanding that GPU can not only be used for game graphics but is also a powerful calculation tool that allows reducing the execution time of computationally intensive applications.
General-Purpose computing on Graphics Processing Units (GPGPU) is now supported by many platforms. GPU manufacturers, NVIDIA, and AMD provide necessary functions and libraries to enable GPU calculations. These calculations can be performed on GPU only if it's possible to split a problem into smaller parts which can be solved concurrently. Also, it is important to mention that on average CPUs are usually more efficient than GPUs when data size isn't big enough to effectively use all GPU cores. The primary task of this paper is to compare GPU and CPU calculations efficiency while solving heat conduction problems with different amount of data.

Prior and related works
GPUs are widely used in Machine Learning because they allow teaching the models concurrently. Kuckuk and Köstler (2018) used GPU to model shallow water equation which allowed to calculate large, time-consuming systems via Piz Daint supercomputer. Filonenko et al. (2018) applied GPU to detect fumes from a real-time camera. Lu et al. (2019) used GPU to serve the medical Drug-Drug Interaction (DDI) system, which collects information from 150,000 publication-wide PubMed database. Warrena et al. (2019) enhanced Finite-Difference Time-Domain (FDTD) electromagnetic modelling. Fambrini et al. (2018) used GPU calculations for the JSEG algorithm optimization. Bohacek et al. (2019) used CUDA to solve the inverse heat conduction problem. They suggested 3 solutions and made a comparison with classical OpenFOAM (FDIC) and ANSYS Fluent (AMG). GPU solution appeared to be the best one and increases the calculations speed up to 15 times.

Heat conduction problem
The heat conduction problem arises when a body isn't equally heated. The heat equation allows us to find the temperature at each point of the observed body at the specified point in time. This problem has 3 types based on the dimensions of the body: -One-dimensional, where only x coordinate of a uniform rod and time are used: T = f(x, t); -Two-dimensional, where the heated object is planar and x and y coordinates are used accordingly: T = f(x, y, t) ( Figure 1); -Three-dimensional, where space bodies are heated: T = f (x, y, z, t). The heat conduction problem has 2 more types: -Stationary -temperature does not depend on time (when the thermal field does not evolve over time).
The main purpose of those problems is to find the temperature at each point of the body; -Non-stationary -temperature is not constant over time. The task is to determine how temperature is changing for each point of the body. In this paper, the stationary two-dimensional problem is solved.

Heat equation
The following equation was named after Poisson and is widely used in Physics to calculate different potential fields, for example, electric, pressure, etc.
where u(x, y) is the temperature at the point (x, y), γ marks a boundary, μ(x, y) is the temperature at the boundary point (x, y), f(x, y) defines a heat source. The equation is solved by the finite-difference method. To use approximation by finite differences, a uniform discrete grid for this problem has been chosen: A discrete solution U ij = U(x i , y j ) needs to be found. The temperature at the boundary grid points is calculated by the (2) equation. In order to calculate the temperature at the inner points, the (1) differential equation at each point is replaced by the algebraic equation. It is achieved by approximating the derivatives with finite differences which are calculated by using three-point stencil method in vertical and horizontal directions ( Figure 2).
. Discrete grid and scheme stencil Thus, the system of linear equations is the following: The system is made up of (N -1) 2 equations.

Jacobi method
Eventually, to solve the heat conduction problem there is a need to solve the system of linear equations. This can be done in many ways, but the Jacobi method has been selected for this paper. Jacobi method is an iterative algorithm for determining the solutions of a diagonally dominant system of linear equations. Each element is calculated approximately by using this equation: The process is iterated until it converges. Jacobi method converges slower than, for example, Krylov or Gauss-Seidel (Amador & Gomes, 2012). On the other hand, a big advantage of this algorithm is its suitability for concurrent calculations (Margaris et al., 2014) and thus is an effective option for GPU calculations. Created by Jacobi (2009), the algorithm started to be used only a hundred years later when computers were invented.

Experiments
The code has been written in Python, using Numpy and TensorFlow GPU libraries. Mainly, NVIDIA GPUs were used. To run TensorFlow GPU on NVIDIA, Compute Capability of processing unit must be 3.0 or higher, CUDA and cuDNN (NVIDIA CUDA Deep Neural Network) have to be installed.
One of the TensorFlow GPU advantages is that code is suitable not only for NVIDIA but also for the AMD devices. If the TensorFlow ROCm library is installed instead of TensorFlow GPU, the same code can be run without making any changes. Same with CPU and GPU -the selected device is passed to the function as a parameter and no code changes are needed. Two experiments were conducted: a heat conduction problem with and without the heat source.

Equipment
GPUs used to conduct the experiments: 1 In order to get the most from a vectorized computation both on CPU and GPU, the matrix form of the computation was chosen. The form of a single equation (4) can be reformulated in the matrix form if we "extract" 4 submatrices from u (L: without a left column, R: without a right column, T: without a top row, B: without a bottom row). Then the equation becomes Note that TensorFlow doesn't recompute f · h 2 each iteration due to the constant propagation technique. . 10]; 2. Boundary condition: µ(x, y) ≡ 1; 3. Without a heat source: f(x, y) ≡ 0; 4. Accuracy of the calculations: ε = 2·10 -5 ; 5. Device: CPU or GPU. We see in Table 1 and Figure 3 that if N < 400, the CPU solves this problem faster than the GPU as there is no need for the CPU to transfer data and the calculations are faster up to 2 times but it is just several seconds. The bigger the N, the faster the GPU solves this problem compared to the CPU. Maximum GPU speed up for N = [100; 1000] is up to 6.4 times. Figure 4 shows the CPU and GPU calculation speed comparison when N is between 500 and 1000. Experiment with N = 1000 for the first experiment was conducted with several different devices. Out of these, even the slowest NVIDIA GTX 860M GPU solves the problem 1.3 times faster than the fastest Intel i7-6700K CPU. The fastest GPU solves the problem 19.6 times faster than the slowest CPU. On average, GPU devices solve this problem 5.9 times faster than CPU devices. Tables 2 and 3 show average calculation time for CPU and GPU devices when N = 1000.    If the grid size is further increased to N = 5000, the problem is solved in ~3 hours with the CPU and in ~16 minutes with the GPU. Thus, the GPU is 11 times faster than the CPU with this grid size (Table 4 and Figure 5).

First experiment: heat conduction problem without a heat source
Maximum calculations speed increase for [1000; 10000] range is 11.7 times. With N = 10000 CPU needed ~12h and GPU ~1h to solve the problem (Figure 6).
NVIDIA GTX 1060 belongs to gaming series. The calculation would be faster with a specialized unit from Tesla or Quadro series, for example, Tesla M40.

Second experiment: heat conduction problem with a heat source
Second experiment parameters: -N, where N 2 is a grid size. N = 100i, where i = [1; 10]; -Boundary condition: µ(x, y) ≡ 0; -With a heat source: f(x, y) = -exp(-10((x -0.5) 2 + (y -0.5) 2 )); -Accuracy of the calculations: ε = 10 -7 ; -Device: CPU or GPU. The second experiment is very similar to the first one. However, now a heat source function is added. This increases the number of calculations per step. Also, the accuracy of the calculations is lower so that the calculations run longer and the comparison can also be run more precisely.
We see in Table 5 and Figure 7 that the CPU is faster than the GPU only for N < 300. The higher amount of actions means that calculation time is longer for both CPU and GPU. For example, when N = 100 the calculations of the second experiment run ~1.5 times longer on both devices compared to the first experiment.
The proportion of CPU/GPU calculation duration of the first and second experiments is very similar. For example, if N = 1000 then the calculations on NVIDIA GTX 1060 are 6.3 times faster than on Intel i7-7700 even if the calculations ran 8.5 times longer compared to the first experiment (Figure 8).

Results comparison
The results of these experiments are similar to (Bohacek et al., 2019) -while solving small matrices CPU was 2 times faster than GPU. In the paper they also mention that by using commercial packages like ANSYS FLUENT 14.5 or OpenFOAM for small matrices the calculations can be performed from 40 to 50 times faster than by using simple CPU code. Since these packages support parallelism, they are also great for bigger grids. We have compared GPU and CPU calculation speed without using the specialized packages and achieved an average increase of 10 times in calculation speed for the same code with different amount of data and different processing units.

Conclusions
The more data has to be processed, the more efficient GPU calculations will be compared to CPU. After increasing the size of the grid, the calculations on NVIDIA GTX 1060 ran up to 11.7 times faster than on Intel i7-7700. For more accurate results, more actions have to be performed with data like uploading it to GPU memory before the calculations and moving the results back after the calculations. While using a heat source function GPU calculations became more efficient than CPU when the number of matrix elements reached 300 2 contrary to 400 2 without the heat source. With a small amount of data, CPU calculations might be a better option because CPU usually consists of several cores and support parallelism.