1. Introduction
Many metaheuristic algorithms have been proposed to obtain near-optimal solution in finite time when solving NP-hard problems [
1,
2,
3,
4,
5]. Among them is the genetic algorithm (GA), which is one of the most popular and widely studied algorithms, and is originally proposed by Holland [
6] and DeJong [
7]. GA is a powerful, domain-independent search technique inspired by Darwinian Theory [
8]. As a global search algorithm, in many years GA has been widely used to help solve many optimization problems in various fields [
9,
10,
11,
12,
13,
14,
15,
16,
17,
18].
Several GA variants have also been proposed. One of the most promising variant is the island model GA (IMGA) [
19]. IMGA is a multiple-population coarse-grained model, consisting of several islands distributed subpopulations for occasional exchanges of individuals. Because of the migration processes of individuals between the independent islands, this model is more likely to explore different search regions for better-quality solutions [
20,
21]. For instance, Juan M. Palomo-Romero et al. [
20] applied IMGA to solve the unequal area facility layout problem (UA-FLP). According to the experimental results, most of the instances have obtained better solutions in fewer generations when using IMGA.
UA-FLP is an important and widely studied facility layout problem in industry engineering [
22]. A good placement of facilities can reduce up to 50% of total expenses in manufacturing [
23]. The problem is also widely used in many fields such as industrial facility design, warehouse organization, and VLSI placement. Most of the researchers build optimization models with quantitative performance criteria, such as optimization of material handling cost. UA-FLP was proven to be an NP-hard combinatorial optimization problem, meaning that when the number of facilities increases, the calculation time is exponentially increased [
24]. More and more metaheuristic algorithms [
25,
26,
27,
28,
29,
30] are proposed to solve UA-FLP for finding better near-optimal solutions. Among them, GA is the most popular algorithm due to its effectiveness in enhancing the opportunity to achieve global optimal solutions. Lately, the IMGA is proving to be an efficient method to solve UA-FLP, which can obtain good solutions in a reasonable computational time [
20].
Although IMGA is effective in solving many NP-hard problems, it takes a much longer execution time on CPU when the problem size becomes larger, since the individuals in IMGA should be selected, evaluated, crossed over, mutated and migrated,. Modern GPUs (Graphics Processing Units) are very popular and flexible processors in high performance computing, especially when CUDA (compute unified device architecture) were distributed [
31]. The IMGA can take full advantage of the computing abilities by exploiting coarse and fine grained parallelisms of GPU. However, only a very limited number of studies have focused on how to reduce the execution time of IMGA on GPU [
32,
33,
34]. In our previous work [
32], we have proposed two different parallel algorithms for each individual step of IMGA when solving UA-FLP on GPU. Which parallel algorithm has a better performance for each step has also been reported. The best performance ratio of our suggested GPU version over the CPU counterpart can be as high as 84.
Most previous studies about GPUs are focused on performance, because the searched solutions for the CPU and the GPU implementation of the same method are exactly the same. Therefore, the related work did not usually report the quality. In practice, when implementing IMGA on GPU, settings of various parameters have a great influence on both solution quality and execution time. Since there are many combinations of parameter settings, it is very time consuming to have the best parameter setting. The easiest way to have a better solution is to let key parameters have large values, e.g., the number of generations. However, a larger parameter value usually implies a longer execution time, but it is not for sure that the solution quality will be obviously improved after a parameter is enlarged. To address the problem in this paper, we investigate how each of the key parameters influence the solution quality and the execution time when running IMGA on a GPU. The results of the paper will give us a guideline, by giving a selection order of parameters when users aim at having better solutions, without significantly increasing the execution time.
In this paper, we explore how to obtain a better solution within a reasonable time when parallelizing IMGA on GPU, taking UA-FLP as an example. Firstly, we focus on how to implement the tournament selection in IMGA on a GPU. The key issue is how random numbers are generated by each thread at the runtime on a GPU. Since there are more than thousands of threads that are executed in parallel, the aggregated distribution of all random numbers will influence the actual exploited search space. Next, we investigate three important parameters closely related to the solution quality and the execution time in IMGA, including the number of islands, the number of generations (iterations), and the number of chromosomes. By tuning these parameters separately, the solution quality and the execution time are both influenced. We address the challenge of how to trade off between solution quality and execution time for the parameters. In other words, which parameter is the best to improve the solution quality if the execution time is increased with the same ratio. Experiments are conducted on our selected data sets. Through analysis, the order of the influence of the three parameters on solution quality is observed, which can help researchers adjust parameters in IMGA for finding better qualities on a GPU. Finally we recommend a group of parameters for solving UA-FLP with IMGA on a GPU. According to the experimental results, if we give higher priority on reducing execution time on a GPU, the quality of the best solution can be improved by about 3%, with an acceleration that is 29 times faster than the CPU counterpart after applying our suggested parameter settings. If we give solution quality a higher priority, when the GPU performs almost the same execution time as the CPU, the solution quality can be improved by up to 8%. This research is very valuable for helping users set parameters more efficiently when using a GPU to solve UA-FLP with IMGA, in order to get better solutions in an execution time comparable with that in a CPU. It also helps researchers take advantage of GPUs to find a more appropriate solution for optimization problems.
The paper is organized as follows.
Section 2 introduces the related work, including the introduction to UA-FLP, CUDA GPUs, and parallelization of IMGA for UA-FLP on GPU. In
Section 3, our improved parallel tournament selection on a GPU is described. Experiments and evaluations about our proposed method and the quality-oriented discussions are illustrated in detail in
Section 4. Finally, conclusions are given in
Section 5.
2. Related Work
In this section, the related work is introduced, including the formulation and layout representation of UA-FLP, the brief introduction to CUDA GPUs, and GPU-based IMGA for solving UA-FLP.
2.1. UA-FLP
UA-FLP is formulated based on some assumptions. If there is a fixed rectangular region with an area of
, where facilities or departments should be located, the objective of the optimization is to minimize the total material handling cost (MFC), which is commonly represented by the sum of the products of the weighted rectilinear distance, and the material handling flow between the centroids over all facility pairs. Assuming the number of facilities is
, the cost of moving materials from facility
to facility
is
, and the weighted rectilinear distance between facilities
and
is
, the general model is formulated by the following Equation (1).
The distance
can be calculated as Euclidean (Equation (2)) or rectilinear (Equation (3)), where the point defined by
and
is the center of the facility:
The general model of UA-FLP is not shape constrained, that is, no minimum side length or maximum aspect ratio is specified for any facility. To ensure the realistic layouts, we imposed a constraint of maximum allowable aspect ratio (
). If the aspect ratio of one facility is more than
, this facility is infeasible. Let
and
be the length and width of facility
, the aspect ratio of the facility is calculated by Equation (4) below:
The most commonly used continuous representation structure for UA-FLP is the flexible bay structure (FBS), defined by Tong [
35]. The layout of the plant is divided into bays of varying widths in one direction. Each bay allows one or more facilities to be located, and its width is variable, depending on the facilities it contains. In our UA-FLP, an individual of facility layout with FBS is divided into two parts that are; (
i) the facility sequence composed of
n organized facilities bay by bay, from left to right and from top to bottom, and (
ii) bay divisions with
n − 1 binary elements, where value of 0 indicates the facility is placed in the same bay as the previous one, and value of 1 represents the facility is the last facility in the current bay.
Figure 1 shows an example of an individual representing with FBS, and its FBS codes are presented in
Figure 2.
2.2. Introduction to GPU and CUDA
With the introduction of the CUDA platform, more and more researchers pay more attention to NVIDIA GPUs since CUDA makes GPU coding easier [
36]. A GPU-chip is composed of streaming-multiprocessors (SMs), on each of which there are several streaming processors (SPs). Each SP in the same SM executes the same instruction on different data during each cycle.
Figure 3 shows the simple GPU structure:
A CUDA program consists of sequential segments and parallel segments of codes. The sequential codes run on the host (CPU), and the parallel codes embedded in kernel functions are offloaded to the device (GPU). When a kernel is invoked, a large number of threads are launched to exploit data parallelism after thread blocks are distributed to SMs. After a kernel finishes its execution on the device, the results could be copied from GPU memory to CPU memory. In a CUDA application, threads running the same instructions on different data are grouped into blocks. One active thread runs on an SP, and one block is assigned to an SM. The maximal number of threads in a block is determined by the difference of GPU compute capability. Threads in each block are divided further into Warps. A Warp contains 32 threads in a block, and it adopts the Single Instruction Multiple data (SIMD) execution model, meaning all threads within a warp must execute the same instructions at the instance of any given time. If the threads in a warp are diverged to two paths after a data-dependent conditional branch, the warp serially executes branch paths one by one, resulting in threads becoming idle on another path. It is known as branch divergence, and can degrade the performance of the GPU significantly [
37].
There are types of memories residing on a GPU chip, including global memory, shared memory, constant memory, local memory, and register. Different types of memories are differing in size, access scope, access time and whether it is read-only or cached.
Figure 4 shows the CUDA memory hierarchy model:
Global memory, the largest memory (up to several gigabytes) in a GPU, is used for communication between the host and the device. The access time of global memory is longest, since the latency for completing a read or a write operation for instance, requires 400–800 clock cycles. Memory coalescing is a very important mechanism for hiding such memory latency. The host is responsible for the allocation and deallocation of buffers in the space of the global memory. When a kernel is launched to process, the device takes over the access rights to the buffer in the global memory, until the kernel execution is complete. Through the global memory blocks can communicate with each other.
Shared memory is a very fast memory in a GPU, as it only requires two to four cycles if no bank conflicts exist. It can only be accessed by threads in the same block to communicate with each other. Shared memory is available for both reading and writing, but it is not cached and the memory space is limited.
Constant memory is a read-only and cached memory. A CPU can read and write data in constant memory, but a GPU thread only can read data in it. The constant memory is as fast as the shared memory, unless cache misses occur. Texture memory is similar to constant memory, which is also cached and read-only, but it has a larger memory space and its access latency depends on whether cache misses occur.
Local memory is used to save large automatic variables for each thread. It is as slow as global memory and not cached. Registers, the fastest memories, are also for automatic variables of each thread, but the number of 32-bit registers is limited.
2.3. Parallelizaiton of IMGA for UA-FLP on a GPU
There are only a few projects studying how to parallelize IMGA on GPUs. N. Melab and E. G. Talbi [
33] proposed three different schemes to build efficient IMGA on GPUs. According to the experimental results, GPU-based IMGA can accelerate the search process speed, especially for large-scale optimization problems. Steffen and Dietmar [
34] studied two models for evolutionary algorithms, including island model genetic algorithm, and discussed the implementation and performance under different parallel architectures such as CPU clusters and GPUs. Cheng-Chieh Li et al. [
39] took the advantage of GPUs to parallelize IMGA, where they replace the mutation operator by simulated annealing when solving the travelling salesman problem (TSP).
In the latest literature, we first proposed how to use GPUs to parallelize the IMGA for solving UA-FLP [
32]. The structure of our parallel IMGA on a GPU is illustrated in
Figure 5. There are
N islands, organized as a ring topology, and on each of them there are
n individuals. Each individual of the population is represented with a chromosome that corresponds with a facility layout. Each individual has a fitness value that represents the quality of the solution, where the fitness value is calculated by Equation (1). FBS is used to encode a chromosome, consisting of a facility sequence and a bay division. Each island is assigned to one block on a GPU. The information about all individuals is stored in global memory. Each thread performs the procedures of IMGA, including initialization, evaluation, selection, crossover, mutation, and migration. Regarding the selection, important data are stored in shared memory in order to increase the access speed. In addition, migration is carried out through global memory, where threads in different blocks can communicate with each other efficiently using memory coalescing.
In our previous implementation, we proposed various parallel algorithms for each step of IMGA when solving UA-FLP. Experiments were conducted to compare the performances between different parallel versions and the CPU version. According to the experimental results, the performance improvement ratio of applying the basic parallelization methods for main steps of IMGA over the CPU version is more than 20 at best. Furthermore, applying our suggested parallel methods to implement IMGA on a GPU, the best performance improvement ratio over the CPU version can be as high as 84. That is, our GPU version provides much higher performance than both the conventional GPU version and the CPU version.
The previous literatures discuss how to use a GPU to accelerate IMGA in order to reduce the execution time, and thus improve the system performance. Since both CPU and GPU versions adopt the same IMGA algorithm, the solutions will be the same. Consequently, solution quality was not usually discussed in the reports. Nevertheless, when executing IMGA on a GPU, different parameter settings will result in different solution qualities and execution times. Especially for large-scale optimization problems, adjusting parameter values usually leads to a much longer execution time, so even if the solution quality will improve or not, is uncertain. In addition, each parameter has its own impact on solution quality and execution time. For some parameters, increasing its value will improve the best solution a little, but the execution time might grow exponentially. By contrast, increasing another parameter value might improve the best solution significantly, but the execution time only becomes longer polynomially. Therefore, if a user wants to find a better solution on a GPU, but they do not want a longer execution time, it is crucial to give them a guideline on how to adjust individual parameters. Instead of only performance tuning, we focus in this paper on how to adjust parameters efficiently, in order to have a better solution at a lower cost of increment of execution time when solving UA-FLP with IMGA on a GPU.
In IMGA, selection is one of the most time-consuming and quality-influencing operations. The parallelisms of the two most common selection methods on a GPU have been discussed in our previous work [
32]. In the following sections, we firstly proposed an effective method of generating random seeds for the tournament method on a GPU, which can ensure us how to find a better quality of solution, within a shorter execution time. We then focused on how to tune the relevant parameters to have a better solution quality, after trading off between solution quality and execution time when GPUs are used to accelerate IMGA. The results will help researchers set parameters in a more efficient and effective way to ensure better quality solutions, without increasing execution time too much.
5. Conclusions
The island model is a popular and efficient method to implement the genetic algorithm on a parallel architecture. In recent work, the IMGA is applied to solve UA-FLP, which can improve the solution quality for most of the problem data sets. However, the amount of calculation will become larger and larger with the increase in the size of the problem, because UA-FLP is an NP-hard problem. Moreover, the execution time is getting longer and longer even though metaheuristic approaches are used to solve UA-FLP. Modern GPUs, highly-parallel computing processors, can manipulate large amounts of data efficiently by embedded many cores. So far, almost all research on using GPUs to execute IMGA investigates how to reduce the execution time. For the researchers who work on practical problems, finding a solution of higher quality is definitely the key concern.
In this paper, we study how to achieve better qualities in a more cost-effective way when using GPU to accelerate IMGA for solving large-scale problems. We take UA-FLP as an example in this study. Firstly, we addressed the problem of how random seed generation on GPU influences the solution quality. Random seeds can be generated based on two different methods: (1) Generate only one seed with the system clock, or (2) generate one seed with the system clock for each round of comparisons. The first method suffers from the problem of repeated comparisons, which limits the search space explored significantly. The second method can provide a better solution at the cost of much longer execution time. Therefore, we proposed an efficient method to generate only one random seed with the multiple parameters: Thread ID, block ID and the system clock. Consequently, each thread can have a unique sequence of random numbers in each round. According to the experimental results, our method can provide a solution as good as that provided by the first method, and an execution time as short as that of the second method.
Next, we addressed the challenge of how to trade off between solution quality and execution time when we set parameters. Three important parameters of IMGA are investigated, including the number of islands, the number of generations, and the number of chromosomes per island. According to the experimental results, to have a better solution at the low cost of execution time, the order of influence on solution quality is: The number of chromosomes per island, the number of generations, and then the number of islands. Moreover, when executing IMGA on a GPU for solving UA-FLP, we also recommend a set of parameter settings to find a more cost-effective solution based on a series of experiments. The recommended values of the three parameters are: 16 islands, 128 generations, and 256 chromosomes per island. With the above parameter setting, the quality of solution on GPU is improved by about 3% over the CPU baseline version. In addition, the GPU version is 29 times faster than the CPU baseline. Furthermore, if we let GPU and CPU spend almost the same execution time, the quality improvement by GPU can be up to 8%.
In the future, we will further discuss how to improve solution qualities of different NP-hard problems on GPUs when IMGA is adopted. We believe that our parallelization strategy can provide an elegant and cost effective way of solving these problems on a desktop computer simply equipped with a graphic card, instead of relying on the large computational grid.