A THREE-LEVEL PARALLELISATION SCHEME AND APPLICATION TO THE NELDER-MEAD ALGORITHM

We consider a three-level parallelisation scheme. The second and third levels deﬁne a classical two-level parallelisation scheme and some load balancing algorithm is used to distribute tasks among processes. It is well-known that for many applications the efﬁciency of parallel algorithms of the second and third level starts to drop down after some critical par-allelisation degree is reached. This weakness of the two-level template is addressed by introduction of one additional parallelisation level. As an alternative to the basic solver some new or modiﬁed algorithms are considered on this level. The idea of the proposed methodology is to increase the parallelisation degree by using less efﬁcient algorithms in comparison with the basic solver. As an example we investigate two modiﬁed Nelder-Mead methods. For the selected application, a few partial differential equations are solved numerically on the second level, and on the third level the parallel Wang’s algorithm is used to solve systems of linear equations with tridiagonal matrices. A greedy workload balancing heuristic is proposed, which is oriented to the case of a large number of available processors. The complexity estimates of the computational tasks are model-based, i.e. they use empirical computational data.


Introduction
Current trends in supercomputing show that in order to accumulate high computing power, computers with more, but not faster, processors are used.This trend induces changes in the development of parallel algorithms.The important challenge is to develop parallelization techniques which enable exploitation of substantially more computational resources than the standard existing methods.
This paper deals with problems that can be split into a collection of independent subproblems and this splitting step is repeated iteratively.The solutions of subproblems define the solution of an initial problem.Thus, an additional splitting step increases the potential parallelisation degree of a parallel algorithm.
Any multi-level parallelisation can be considered as a way to generate a pool of tasks.After the pool of tasks is obtained, it is not important how many parallelisation levels were used.However, often such final simplification of the template leads to a loss of an important information and as a consequence to degraded efficiency of the parallel algorithm.Especially this is true if different levels of the scheme are characterised by different properties of an algorithm that should be properly addressed.
In this paper, we consider a special case of a three level parallelisation.The template of this approach is given in Fig. 1: • At the first level of parallelisation we assume that there exist a few parallel alternatives A j (see Figure 1) to the original modelling algorithm.The first level of parallelisation becomes a part of a new parallel algorithm and the degree of the first level parallelism can be selected dynamically during the computations -a selection of the best algorithm is performed.In this paper as an example we consider two new parallel modifications of the Nelder-Mead method (Nelder and Mead, 1965).
• On the second level, a set of computational tasks V j = {v j 1 , v j 2 , . . ., v j Mj } (see Figure 1) with different computational complexities is defined.These tasks are solved in parallel.As an example we investigate the case when computation of one value of the objective function requires to solve numerically M partial differential equations.The computational complexities of tasks are non-equal because different discretisation steps must be used for different equations in order to achieve the same accuracy for each equation.
• The third level defines parallel algorithms to solve tasks from the second level.As an example we use Wang's algorithm to parallelise the solution of systems of linear equations with tridiagonal matrices (Wang, 1981).
The second and the third levels define a well-investigated two-level parallelisation template.We note that load balancing techniques for two-level parallelisation are widely used in applications, see, e.g., (Ciegis and Baravykaite, 2007), (Huismann et al., 2015).
The scheduling problem can be formulated representing a parallel algorithm by a directed acyclic graph (DAG).The vertices define computational tasks, the edges define connections/order among tasks.Then a set of partially ordered computational tasks is scheduled onto a multiprocessors system to minimise the computational time (or to optimise some other performance criteria).It is well known that the scheduling problem is NP complete.Many interesting heuristics are proposed to solve it, we mention greedy algorithms ( Čiegis and Šilko, 2002), genetic algorithms (Sharma and Kaur, 2015), (Singh, 2014), simulated annealing and tabu search algorithms (Kirkpatrick et al., 1983), (Glover, 1989), (Glover, 1990).Such algorithms include a possibility of dynamic scheduling and allow for tasks to arrive continuously and they can consider variable in time computational resources.
A scheduling task can be very challenging due to specificity of a given application problem and the necessity to parallelise it on modern parallel architectures.
As an example we mention the particle simulation which is solved by appropriate domain decomposition techniques (Furuichi and Nishiura, 2017).Another example is the dynamic load balancing on heterogeneous clusters for parallel ant colony optimisation (Llanes et al., 2016).In the recent work (Datta et al., 2019) it is concentrated on the problem of high-dimensionality of the data while solving subspace clustering problem.
In this article we focus on the scheduling problem, when all tasks in the set are independent and can be solved in parallel.It is well known that the given optimisation problem can be redefined as a problem to equalise the computational times of all processes.The simplest load balancing algorithm is based on the assumption that the computation time is proportional to sizes of sub-tasks.Then the domain decomposition algorithm is applied to guarantee that the sizes of subtasks scheduled for each group of processors are equal (Ciegis and Baravykaite, 2007).
The quasi-optimal distributions of tasks can be obtained using the greedy strategy to distribute the work on demand, i.e. to apply dynamic load balancing techniques such as work-stealing (Imam and Sarkar, 2015), self-organising process rescheduling (Righi et al., 2018).
However, the efficiency of two-level approach is limited due to a typical saturation of the speed up of parallel algorithms for increased numbers of processors and fixed sizes of tasks.
Exactly this situation has motivated us to introduce an additional level of parallelisation template.In most cases the usage of it leads to a less efficient algorithms than the initial state-of-the-art algorithm.But the additional degree of parallelism on the second level gives a large overall speed-up, if the number of available resources is large.
Recent developments of new architectures of parallel processors make even more challenging the task to build accurate theoretical performance models.The empirical data shows that for some advanced algorithms the efficiency of parallel computations can depend non-monotonically on the size of a task.Thus the model-based load balancing method starts to become the main tool in developing efficient and accurate task scheduling algorithms.In our work we build the model for prediction of computation time empirically by solving the specialised benchmarks for a wide range of problem sizes and numbers of processors.In fact this analysis resemblance the classical experimental strong scalability analysis of a given parallel algorithm.We note, that these measurements are always done for all processes working simultaneously in order to reflect their actual performance during the execution of real applications (see, also (Lastovetsky and Manumachu, 2017;Lastovetsky et al., 2017)).
Here we mention two interesting papers, where the model-based task scheduling algorithms are considered.
In (Lastovetsky et al., 2017), it is concentrated on multicore co-processors Xeon Phi, where the empirical computation time curves are used to find optimal parameters for a workload distribution.The obtained model predicts non-monotonic dependence of computation speed on the sizes of problems.The authors call their approach "load imbalancing", however, it can be considered as an advanced balancing which adapts the scheduling algorithm to the specificity of Xeon Phi processors.Obviously in this case the assumption that computation time is proportional to the task size is not valid.In a similar research (Lastovetsky and Manumachu, 2017), computations were performed on non-uniform memory access (NUMA) parallel platform with various shared on-chip resources such as Last Level Cache.Again the model-based approach enables to take into account the specific properties of the algorithm and processors.The matrix multiplication and Fast Fourier Transform are used as benchmark problems.It is interesting to note that, according to the presented results, the globally optimal solutions may not load-balance the sizes of sub-tasks.The authors pay a special attention to the energy efficiency of calculations.We note, that there are some papers that are specifically dedicated to load balancing of energy efficiency (Perez et al., 2017).In our work we formulate some restrictions that are connected to energy efficiency as well -we do not use additional available computational resources if the parallelisation efficiency drops below some specific level.In this paper we propose a general methodology for parallelisation of algorithms.As an example we use it to solve some applied optimisations problems.is shown The superiority of the three level parallelisation scheme is shown, comparing it with two level paralleisation scheme.On the second level a set of different-size tasks is defined, which is a typical situation for computation of one value of a black box objective function.In most cases these tasks (or groups of tasks) are independent but computationally costly.Thus each task also should be solved in parallel.This fact leads to a necessity of the third level.The second and third levels of the template define a set of tasks solved in parallel and some load balancing algorithm should be used to take into account the different sizes of subtasks.The necessity of the additional first level comes from the assumption of having more computational resources than can be utilised by the two-level parallelisation approach.It is a consequence of the efficiency saturation for parallel algorithms when the size of the problem is fixed and the number of processes is increased.We select a different optimization method (or a modification of the basic solver) which gives additional degrees of parallelisation thus enabling the possibility to use more processors.At the first level of the template the optimal algorithm is selected.This part requires to find a compromise between the increased parallelisation degree and the decreased convergence rate of the modified parallel optimization algorithm.
In this work we are also interested to address some green computing (GC) challenges.In a broader sense GC is the practices and procedures of designing, manufacturing, using of computing resources in an environment friendly way while maintaining overall computing performance and finally disposing in a way that reduces their environmental impact (Saha, 2018).The research in green computing is done in many areas (Nemalikanti et al., 2011): Energy Consumption; E-Waste Recycling; Data Center Consolidation and Optimization; Virtualization; I. T Products and Eco-labeling.One of approaches for optimisation of energy consumption on the software level is the autotuning software, which is able to optimise its own execution parameters with respect to a specific objective function (usually, it is execution time) (Carretero et al., 2015).Well known examples of autotuning software are: FFTW (Frigo and Johnson, 2005) (fast Fourier transformations); ATLAS (Whaley and Dongarra, 1998), PHiPAC (Bilmes et al., 1997) (dense matrix computations); OSKI (Vuduc et al., 2005), SPARSITY (Im and Yelick, 2001) (sparse matrix computations).
Usually, the goal for any autotuning software is to achieve the same result with the same resources, however, reducing the computation time -in terms of parallelisation it means to increase the parallelisation efficiency.Another way to decrease the power consumption is to increase the efficiency by avoiding inefficient calculations; this may slightly increase the execution time, however will give a reasonable increase of parallel efficiency, which leads to the energy savings.We propose to control the efficiency of the parallel algorithm on the load balancing stage of the parallelization template.In many cases this strategy reduces the amount of computational resources used in computations.This analysis is done a priori, meaning that the user knows how many cores should be used for solving a specific parallel task even before starting real computations.
This paper makes the following contributions: 1. We propose to extend the typical two level parallelisation, which is usually accompanied by some load balancing technique, by adding one additional level.Also, we investigate the possibility to limit the number of used processors to sustain the parallelisation efficiency at the selected level.This approach let us to avoid the inefficient calculations, supporting the green computing technology.
As an example two different families of parallel Nelder-Mead methods were investigated: the family of the generalised parallel Nelder-Mead method (Lee and Wiswall, 2007) and the parallel versions of the classical Nelder-Mead method.In order to perform the load balancing on the second and third levels of the proposed template, we use the complexity model of tasks which is based on the computational data (also known as model-based), as it is done in recent state-of-the-art works (Lastovetsky et al., 2017), (Lastovetsky and Manumachu, 2017), (Rico-Gallego et al., 2017).We demonstrate a big potential of this new technique.
2. A parallel version of the Nelder-Mead method is proposed, which does not change the convergence properties of the sequential optimisation algorithm.We note, that there were some attempts to parallelise this optimisation method before (Lee and Wiswall, 2007), (Klein and Neira, 2014).However, in these papers the convergence properties are changed and these changes are not studied comprehensively enough.Moreover, it is questionable whether these parallel algorithms are applicable in the case of small-dimension problems.
Our parallel algorithm leads to an increasement of the parallelisation degree up to factor three.However, the introduced changes do not affect the convergence of the sequential optimisation algorithm.The experimental comparison of this new parallelization algorithm with the state-of-the-art technique (Lee and Wiswall, 2007) is provided.The obtained experimental results show that in the case of the Rosenbrock function the convergence properties of the parallel algorithm (Lee and Wiswall, 2007) are much worse than of the classical sequential Nelder-Mead algorithm.
The rest of this paper is organised as follows.In Section 2 the workload balancing problem is formulated, the selection of the optimal algorithm is provided and a general strategy for workload distribution is presented along with the efficient workload distribution algorithm.In Section 3 the detailed description of three parallelisation levels are given for the studied case.We consider the approximation of boundary conditions of Schrödinger equation.The modified Nelder-Mead method is used to solve local optimisation problems on the first level, on the second level a set of partial differential equations are solved numerically, and on the third level Wang's algorithm is used to solve systems of linear equations in parallel.
In Section 4 the results of computational experiments are provided and the efficiency of the proposed three-level parallelisation template is analysed.In Section 5 the comparison of different Nelder-Mead parallelisation methods is presented.The final conclusions are done in Section 6.

Workload balancing problem
In this section we formulate the workload balancing problem for the two level parallelisation.Also we present a greedy scheduling algorithm to distribute the processes among tasks.Next, we introduce the additional level -the first and second levels of the two level parallelisation technique become the second and the third levels, accordingly and the first level is a new parallelisation level.On the first level the selection of the optimal algorithm is performed.
First, we will present two level parallelisation template.Assume that we solve a given problem by using the basic method A. The solution process consists of K blocks of tasks (a simple DAG) and all blocks must be solved sequentially one after another.Each block consists of M tasks where X k defines a set of parameters for the V k block.V k defines the first level of two level parallelisation scheme.Each task v m can be solved by parallel algorithm -this is the second level of the scheme.The complexities of tasks v m are different, however, they are known in advance and do not depend on k.For each task v m the prediction of computation time t m (p), p ≤ P , m = 1, . . ., M is given -it is based on the modelling results, P is the number of processors in a parallel system.We assume that up to P m processes the computation time monotonically decreases: (2) For P m the predicted computation time function t m (p) reaches the minimum value: (3) Such a model of computation time t m (p) is important for algorithms with limited scalability such as Wang's algorithm.In Fig. 2 we present speed-ups of this algorithm for different sizes of linear systems.It is important to mention that the provided results include some additional costs for computation of the objective function along with Wang's algorithm computational costs.These additional calculations slightly increase the overall parallelisation scalability, thus the provided figure represents the optimistic scenario for general Wang's algorithm and the realistic scenario for actual computations, that were done in this paper.
In our specific case this data was derived from a simple benchmark implementing Wang's algorithm.This benchmark performs computations using different numbers of processes and different problem complexity parameters J.It is important to note, that nodes were artificially loaded with calculations to imitate the real situation.For example, with the number of processes p = 4 there were 32 tasks that were solved by 128 processes at the same time.Thus this benchmark must be run once, using all processes available.
From Figure 2 it follows that the computation time monotonically decrease till some critical number of processes and therefore the efficient usage of processes is limited to this number of processes.Even for large size systems, when the number of equations is J = 16000, the maximum number of processes P m does not exceed 80.This analysis justifies our motivation to use the multi-level approach in order to solve the given applied problem.In the two-level parallelisation scheme for each block of tasks V k we select the number of processes such that the overall solution time is minimised: where a set of feasible processors distributions Q is defined as Remark 1 In the case when we solve only few large size tasks and the remaining tasks are much smaller and the number of processes P is not very big, the optimal scheduling is obtained when a few smaller tasks are combined into one group v m .Then sub-task v m consists of tasks v l1 , . . ., v ln .The computation time for this combined task is predicted by the model: t li ( p li ), p li = min (p m , P li ) .
In this work we are interested to solve the scheduling problem, when the number of processes is large, so the aggregation step is not used.
Next, we propose a simple greedy partitioning algorithm, which is described in Algorithm 1.It aims to find a near-optimal distribution of M tasks of different sizes between homogeneous P processes by using the model-based complexity model t m (p) (similar ideas are also used in (Lastovetsky and Manumachu, 2017)).We assume that P ≥ M .The interesting feature of the presented algorithm is that for a given number of processes P the number of active processes can be taken less than P to minimise the overall execution time of the parallel algorithm.
The algorithm starts from the initial distribution when one process is assigned for each task and the predictions of parallel execution times are calculated using the selected performance model.Then, the greedy iterative procedure is applied to distribute the remaining processes.At each iteration, one additional process is assigned to the task which has the largest predicted computation time.Then its parallel execution time is updated.Iterations are repeated until all processes are distributed or the number of processes for some task reaches the limit P m .
Algorithm 1.The algorithm for distribution of P processes between M tasks end if 13: end while Note, that before t m (p) has reached the minimum, value starts to decrease slowly, thus the parallelisation efficiency drops.Therefore, it may be wise to restrict the number of processes by taking into account the efficieny value.
We define the maximum number of processes P k for which the efficiency condition is still satisfied where ] is a given efficiency lower bound.Estimate (4) is used to modify the limit of the maximum number of processes (3) that can be used to solve the j-th task P m = min (P m , P m ). (5) Therefore, in the presented technique P m includes two restrictions: • The number of processes cannot exceed the number after which the speed-up drops down (see Fig. 2).
• The number of processes is limited by efficiency requirement (4), which states: the number of processes per block of tasks V k is not allowed to be increased if the efficiency of the parallel algorithm on the third level reaches the critical value E min .
In fact the second level of the two-level scheme can be used alone, however, it is limited due to Amdahl's law (Amdahl, 1967), i.e. the efficiency begin to drop as the number of processes increases for a fixed size of problem.Two-level approach let us to solve this issue up to some point.
Exactly this situation has motivated us to introduce an additional level of parallelisation template.
In the new three-level parallelisation scheme, the second and third levels represent the two-level scheme part described before.Additionally, we add new first level of the template.We assume that there exist parallel alternative algorithms A j : Each block V j k consists of M j independent tasks The numbers of blocks of tasks K j , the numbers of tasks per block M j , the sizes of tasks |v j m | may be different for different j.
Next, we select the optimal algorithm according to the number of resources available.We denote T P (A j ) = T P (V j )K j the total solution time for algorithm A j .The block of tasks V j is solved by using the heuristic proposed above.Then the optimal algorithm is defined as The usage of j > 1 may lead to a less efficient algorithm than the initial basic algorithm.But the additional degree of parallelism gives a large overall speed-up.

Application of the three-level parallelisation scheme
First, we briefly present the problem which is used to test our methodology.We solve an initial-boundary value Schrödinger problem formulated in a finite space domain (Bugajev et al., 2017): where operators L l , L r define the nonlocal/transparent boundary conditions.
Let ω h and ω τ be discrete uniform grids with space and time steps h, τ : Let U n j be a numerical approximation of the exact solution u n j = u(x j , t n ) at the grid points (x j , t n ) ∈ ω h × ω τ .For functions defined on the grid we introduce the forward and backward difference quotients with respect to x and similarly the backward difference quotient and the averaging operator with respect to t We approximate the differential equation ( 6) by the Crank-Nicolson finite difference scheme (Radziunas et al., 2014) A very interesting approach to construct the approximate local artificial boundary conditions is based on approximation of the transparent boundary condition by rational functions.The discrete boundary conditions can be written as: where ∂ n u is the normal derivative, ϕ k are solutions of the initial value problem for ODEs (Bugajev et al., 2017): Our aim is to find optimal values of parameters {a 0 , a 1 , . . .a l , d 1 , a 2 , . . .d l }, when the following minimisation problem is solved and M specially selected benchmark PDEs are solved.
In all examples we use l = 3, i.e., the dimensionality of the optimization problem ( 11) is equal to 7.Here discrete approximations of PDEs represent the tasks v m in (1).To solve v m we must find solutions of N systems of linear equation with tridiagonal matrix (Bugajev et al., 2017).According to our three-level parallelisation scheme, the calculations of a single point in minimisation problem (11) define the block of tasks V k .
The systems of linear equations with tridiagonal matrices are solved using Wang's algorithm.It is well known that if the size of a system is J and p processes are used then the computation time can be estimated as where T c1 (p) defines communication costs.The time to compute a value of the objective function f for the specified equation can be estimated as In this work instead of theoretical complexity models ( 12) and ( 13) we use t m (p), m = 1, . . ., M , based on empirical computations for a selected set of benchmark problems.Such an approach takes into account all specific details of the parallel algorithm and the computer system.It is interesting to note that the complexity of computational task v m depends on both parameters: the number of linear equations J m of the system and the number of integration in time steps N m .The computation time T mp is equal to N m t m (p), but the scalability of the parallel algorithm depends on J m only, since the integration in time is done sequentially step by step.
Next, we consider the problem (11) as a local optimisation problem, which can be solved using an iterative algorithm with a given initial starting point.As a local optimiser Nelder-Mead algorithm is used (Nelder and Mead, 1965).
We propose a family of modifications of the original Nelder-Mead algorithm in order to increase the parallelisation degree of it.
At each iteration the following four different scenarios can be obtained: • Reflection -compute the value f R of the objective function at the point X R .Depending on the value f R this can be the end of the iteration.
• Expansion -depending on the f R , an additional computation of the objective function at the point X E is done, meaning the total computation of two objective function values: f R , f E .
• Contraction -depending on the f R , an additional computation of the objective function at point X C is done, meaning the total computation of two objective function values: f R , f C .
• Compression -compute m objective function values, as well as f R and f C .Here m is the number of simplex dimensions.
The first three scenarios require to compute one or two values of the objective function from the set: f R , f E , f C .We can neglect the last scenario, because it occurs very rarely.For the first three scenarios we propose to compute two or three points simultaneously.Algorithmically this means that we change the order of computations, which let us to parallelise the Nelder-Mead method.In most cases only two of three points will be used.Therefore, some redundant calculations will be performed, however, this modification gives an additional parallelisation of computations.
Thus, two modifications of the sequential (A 1 ) Nelder-Mead method are defined.For A 2 we compute in parallel values f R , f E and for A 3 = 3 we compute in parallel all three values f R , f E , f C .As a test case we assume that the first scenario is relatively rare, the extension step is done with probability 2/3 and contraction steps occurs with probability 1/3.Then we get that the algorithmic efficiency of the proposed parallel modifications are equal to γ 2 = 0.75 and γ 3 = 2/3, respectively.We note, that these values can be estimated more precisely for specific applications, and one example is given for the computational experiments with the Rosenbrock objective function in Section 5.
On the first level different parallel algorithms can be used, however, the proposed approach is oriented to the cases when the increased degree of parallelisation gives the speed-up at the cost of efficiency which is a typical situation in parallel algorithms theory (Amdahl's law).As one more example we mention new algorithms developed to solve the global optimisation problems.The modification of the well-known DIRECT method (Finkel, 2003) was presented in (Stripinis et al., 2018), it is called DIRECT-GL.The new modification is based on the idea at each iteration to analyse more potential optimal rectangles.This approach increases the global sensitivity of the method but in many cases this property is achieved at the cost of additional computations.The potential parallelisation degree of DIRECT-GL algorithm can increase up to 2-3 times.But the results of computational experiments in (Stripinis et al., 2018) show that for many benchmark problems (in (Stripinis et al., 2018) these cases are numbered 1,2,5,6,20,21,22,24,35,37,38,47,48,49,52) the DIRECT-GL algorithm increases the computational costs to achieve the same accuracy of approximations as DIRECT algorithm.Thus, the classical DIRECT algorithm and its modification DIRECT-GL fit well into the proposed three-level parallelisation template.Then the degree of parallelisation should be increased only if this increasement compensates the reduced efficiency of the modified algorithm.Thus we state, that in order to apply the proposed three level parallelisation scheme, first the computations of one point should be parallelised by a two-level parallelisation approach.Then alternative cases of parallel algorithms with additional degrees of parallelisation should be identified and the optimal algorithm should be selected.

Experimental results
In this section we present results of the parallel scalability tests.All parallel numerical tests in this work were performed on the computer cluster "HPC Sauletekis" at the High Performance Computing Center of Vilnius University, Faculty of Physics.
We have used up to 8 nodes with Intel R Xeon R processors E5-2670 with 16 cores (2.60 GHz) and 128 GB of RAM per node.Computational nodes are interconnected via the InfiniBand network.
Our main goal is to investigate the efficiency of the proposed three level template of workload distribution between processes.
First, we have selected three specific benchmarks with different discretizations (7), when M = 4 discrete approximations of PDEs ( 9) are solved numerically to compute one value of the objective function.The sizes (J m × N m ), m = 1, . . ., 4 of discrete problems are given in Table 1.In the first benchmark the size of one task v 1 is much bigger than sizes of the remaining three tasks.In the second benchmark two changes are done.They make this set of tasks more suited for parallelisation on large number of processes: the size of task v 1 is reduced twice by taking a smaller number of time steps N 1 ; the size of task v 3 remains the same, but the number of points J 3 is increased twice, therefore the scalability of Wang's algorithm is improved for this task.In the third benchmark the relative sizes of tasks v m are more homogeneous than in the first benchmark, but this result is achieved by reducing the number of space grid points J 2 , J 4 , therefore the scalability of Wang's algorithm is decreased for these two tasks, especially for v 4 .
First, we exclude the efficiency condition from the load balancing algorithm by taking E min = 0 in (4).The distribution of processors between tasks are presented in Tables 2-4.We also provide the actual computation time T p along with T M p that were predicted by the theoretical complexity model.As we can see from Table 2 the model and experimental times are close to each other.The experimental time is smaller in cases when there is no interpolation error.Also it is smaller than the model time -it is expected result, model times (see Figure 2) are based on benchmark, that imitate pessimistic scenario -as it was mentioned before, all nodes were artificially loaded at the same time.The prediction accuracy depends on many parameters such as cluster architecture, network loads during computations.
For comparison purposes we provide the results obtained by using the two-level parallelisation template.K = 1, then the first level of the three-level template is not used.
It is important to note, that in Tables 2-5 we present the CPU time needed to compute one useful point (11), i.e., the actual time is divided by γ k k, which represents the usefulness of computations.Optimal algorithm A k is selected automatically using the approach that was described above.
As it follows from Table 2, the usage of the first level with k = 3 and P = 128 processes increases the potential speed-up from 38.75 to 60.44.If P = 128 and k = 1 then only 70 processes are used.However the result is very similar to the case when P = 64 processes are used, which means that these additional resources are used very inefficiently.In the Fig. 3 the Gantt charts show theoretical model time t m (p), that is needed to obtain the solutions of different equations.The workload distribution becomes closer to uniform as the number of processes is increased.

4.1.
The control of efficiency.The reduction of the energy consumption is an important goal, especially when increasment of computation speed-up are small for additional processes.The presented results indicate that in some cases there is a highly inefficient usage of computational resources.
For the purposes of controlling the efficiency of calculations the condition (4) was introduced in Algorithm 1.This condition guarantees that the efficiency of the numerical solution of each block of tasks will be at least E min .It is important to note, that we are not attempting to generate optimal mappings of processorswe have developed an heuristic that provides the quality of distribution of tasks, that is sufficient for the most practical purposes.The quality of the algorithm is improved when more processors are available.
Next, a more detailed analysis of the Benchmark 1 is provided.In Table 5 the results for E min > 0 are presented.Comparing the results in Table 5 with the results in Table 2 we see that for K = 1 and E min = 0.6 the number of processes for the first equation is decreased by 14, however, the computation times are almost the same as it was in the case of E min = 0. Also, for K = 3 the efficiency requirement begins to limit the number of  processes for E min = 0.75 and it decreases further with E min = 0.8.However, even then a three level approach with K = 3 is superior to the standard two-level approach in terms of the final speed-up.The results in Table 5 indicate that even for the efficiency limitation E min = 0.75 the proposed three-level approach lets to maintain a big number of parallel processes active, this number is equal to (26 + 7 + 4 + 2) × 3 = 117.The speed-up is 56 and the efficiency of the parallel algorithm is 56/117 ≈ 0.48.The last column in Table 5 with K = 1 presents the results for the two-level approach (without the first level).A straightforward two-level parallelisation approach would have the limited parallelisation possibility especially for problems of the size J = 2000.For such small subproblems it would be possible to utilise only up to 32 processes (Fig. 2), the speed-up would be quite limited as well.Note, that all previous results represents the analysis based on a single Nelder-Mead iteration.Next, we solve the actual real-world optimisation problem (11).The maximum number of processes P = 128 the load balancing algorithm has selected k = 1.The number of Nelder-Mead method iterations was fixed to 1000.The parallel and sequentional versions gave the same results the minimum value of the error E C ∞ = 0.0806.The sequentional version of computations took 180286 seconds, the parallel version computations took 2232 seconds.Thus, a speed-up factor of 81.8 was achieved.The selection of k = 1 indicates that the number of processes can be greatly increased -the algorithm has selected k = 1 automatically for a given number of processes.

The comparison different Nelder-Mead parallelisation methods
Here we present the analysis of the convergence properties of different modifications of the Nelder-Mead method.
As it was mentioned before, the convergence rate of the selected algorithm directly affects the parallelisation efficiency, which is represented by γ k , where k is the parallelisation degree.In this section we measure γ k by measuring the experimental parallel efficiency of algorithms.
The detailed analysis of convergence behaviour for different objective functions is out of the scope of this research.However, the objective function from the previous sections is suitable for a narrow class of applications.Thus, to perform a comparison of different parallel versions of Nelder-Mead method we minimise the Rosenbrock objective function that is widely used by researchers in the field of optimisation theory (Fajfar et al., 2018), (Stripinis et al., 2018).
We show that in the case of the Rosenbrock function the real experimental γ k values are different than were assumed to be in the experiments of the previous sections.The reason is that the significant number of iterations require to compute only one point F R .
We compare the results of our parallel modification of the Nelder-Mead method with the state-of-art technique proposed in (Lee and Wiswall, 2007).As a benchmark we use the Rosenbrock function which makes the optimisation problem challenging.It should be noted that the parallel algorithm (Lee and Wiswall, 2007) can achieve the parallelisation degree K that is equal to the optimisation problem dimension d.
Thus potentially this algorithm is well suited for parallel computers with a big number of processes.In the Table 6 we compare three cases d = 3, 6, 7: d = 3 -the minimum, that is needed for parallelilsation with both methods, d = 7 -the case that was studied in previous section, d = 6 -to show the tendency for smaller d.We provide results obtained when the Rosenbrock function of different dimensions d = 3, 6, 7 was minimized by using our parallel modification of the Nelder-Mead method.The values of the efficiency coefficients γ k are presented.They show that this parallel algorithm is quite stable and it is well-suited to be used in the three-level template solver for small dimension objective functions.
Table 7 presents results obtained by using the state-of-the-art parallel Nelder-Mead algorithm from (Lee and Wiswall, 2007).It follows, that in all investigated cases the parallelisation degree is very limited, since the convergence drops significantly when the parallelization degree is increased.This method is mainly targeted to solve problems when the dimension of the objective

Fig. 2 .
Fig.2.The speed-ups of Wang's parallel algorithm for different number of processes p and sizes J of systems.The detailed specification of processors is presented in Section 3.

Table 1 .
Benchmarks with different sizes Jm × Nm of the dis-

Table 2 .
The results for Benchmark 1. Tp is the CPU time in seconds required to compute one useful point(11)

Table 3 .
The results for Benchmark 2. Tp is the CPU time in seconds required to compute one useful point (11).

Table 4 .
The results for Benchmark 3. Tp is the CPU time in seconds required to compute one useful point (11).

Table 5 .
The results for Benchmark 1 with Emin > 0. Tp is the CPU time in seconds required to compute one useful point (11).