The lattice Boltzmann method, a petaflop and beyond
Robertsén, Fredrik (2018-04-20)
Robertsén, Fredrik
Turku Centre for Computer Science (TUCS)
20.04.2018
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-12-3692-1
https://urn.fi/URN:ISBN:978-952-12-3692-1
Tiivistelmä
With computer simulations real world phenomena can be analyzed in great detail. Computational fluid dynamics, for example, allows simulation of fluid flow phenomena that might not otherwise be observable or researchers might not have the resources to observe. Researchers want to analyze larger and more complex systems in a shorter time to allow them to get more work done with their resources. To achieve this the simulation codes need to be able to e efficiently use large computer systems.
This thesis focuses on the lattice Boltzmann method (LBM) and the usage of computational accelerators to run LB simulations, but also covers some optimization and performance results for regular CPU based systems. The higher memory bandwidth of the computational accelerators has a significant impact on the performance of the LB code, allowing the accelerators to easily outperform contemporary CPU systems.
The hardware architectures of HPC systems used for these kinds of simulations are briefly presented, as well as what programming methods can be used for these systems. This thesis examines how the usage of the OpenACC programming standard makes it easier to create GPU programs. The benefit of OpenACC compared to CUDA is that it allows the user to add directives into the code. These directives control what part of the execution will be offloaded to the accelerator. In this thesis, there is a description of how OpenACC directives can be applied to an LB solver. We also include a comparison of the performance of that OpenACC solver with a native CUDA solver, both implemented using the same optimization techniques.
For large-scale GPU accelerated systems it is important that any program running on them is able to e ciently utilize the resources. In this thesis, we examine the performance achievable on large-scale GPU accelerated systems running lattice Boltzmann simulations. Specific attention is given to the scalability of our GPU accelerated LB solver on the Titan supercomputer, running on 16384 GPUs in parallel. The highlight of these simulations shows that porous media fluid flow simulations on this system can achieve over 1 petaflops of sustainable computational performance. Basic implementation details such as data layouts and algorithms used are also covered, and the impact they have on the performance is discussed. The results from the large scales simulations show that, even with rather homogeneous porous media samples the workload can become unevenly distributed among the computing units. In this thesis we demonstrate that even a simple recursive bisection scheme, a particular domain decomposition scheme, can effectively improve the load balance for the porous media case used.
The solver used on Titan is implemented using asynchronous communication. This is done to allow the GPUs to continue working uninterrupted while the communication takes place. This thesis discusses how asynchronous communication is handled on GPU systems and the steps needed to allow asynchronous communication while still maintaining memory access patterns that are well suited to the GPU.
Finally, newer processors are deriving more and more of their computational power from SIMD vectorization. The thesis examines the effects vectorization has on an LB solver running on a regular Intel Xeon processor and the manycore Xeon Phi processors. This includes an in-depth analysis of the key optimization methods applied to the code for the Xeon Phi processor. Most of the key optimizations center around how the fast, on-package memory is used. Design choices, such as the data layout and the addition of manual prefetching instructions into the program increases how efficiently the memory bandwidth can be utilized.
This thesis focuses on the lattice Boltzmann method (LBM) and the usage of computational accelerators to run LB simulations, but also covers some optimization and performance results for regular CPU based systems. The higher memory bandwidth of the computational accelerators has a significant impact on the performance of the LB code, allowing the accelerators to easily outperform contemporary CPU systems.
The hardware architectures of HPC systems used for these kinds of simulations are briefly presented, as well as what programming methods can be used for these systems. This thesis examines how the usage of the OpenACC programming standard makes it easier to create GPU programs. The benefit of OpenACC compared to CUDA is that it allows the user to add directives into the code. These directives control what part of the execution will be offloaded to the accelerator. In this thesis, there is a description of how OpenACC directives can be applied to an LB solver. We also include a comparison of the performance of that OpenACC solver with a native CUDA solver, both implemented using the same optimization techniques.
For large-scale GPU accelerated systems it is important that any program running on them is able to e ciently utilize the resources. In this thesis, we examine the performance achievable on large-scale GPU accelerated systems running lattice Boltzmann simulations. Specific attention is given to the scalability of our GPU accelerated LB solver on the Titan supercomputer, running on 16384 GPUs in parallel. The highlight of these simulations shows that porous media fluid flow simulations on this system can achieve over 1 petaflops of sustainable computational performance. Basic implementation details such as data layouts and algorithms used are also covered, and the impact they have on the performance is discussed. The results from the large scales simulations show that, even with rather homogeneous porous media samples the workload can become unevenly distributed among the computing units. In this thesis we demonstrate that even a simple recursive bisection scheme, a particular domain decomposition scheme, can effectively improve the load balance for the porous media case used.
The solver used on Titan is implemented using asynchronous communication. This is done to allow the GPUs to continue working uninterrupted while the communication takes place. This thesis discusses how asynchronous communication is handled on GPU systems and the steps needed to allow asynchronous communication while still maintaining memory access patterns that are well suited to the GPU.
Finally, newer processors are deriving more and more of their computational power from SIMD vectorization. The thesis examines the effects vectorization has on an LB solver running on a regular Intel Xeon processor and the manycore Xeon Phi processors. This includes an in-depth analysis of the key optimization methods applied to the code for the Xeon Phi processor. Most of the key optimizations center around how the fast, on-package memory is used. Design choices, such as the data layout and the addition of manual prefetching instructions into the program increases how efficiently the memory bandwidth can be utilized.