Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Mar 10;32(5):e5528.
doi: 10.1002/cpe.5528. Epub 2019 Oct 23.

Accelerating simulations of cardiac electrical dynamics through a multi-GPU platform and an optimized data structure

Affiliations

Accelerating simulations of cardiac electrical dynamics through a multi-GPU platform and an optimized data structure

Eduardo C Vasconcellos et al. Concurr Comput. .

Abstract

Simulations of cardiac electrophysiological models in tissue, particularly in 3D require the solutions of billions of differential equations even for just a couple of milliseconds, thus highly demanding in computational resources. In fact, even studies in small domains with very complex models may take several hours to reproduce seconds of electrical cardiac behavior. Today's Graphics Processor Units (GPUs) are becoming a way to accelerate such simulations, and give the added possibilities to run them locally without the need for supercomputers. Nevertheless, when using GPUs, bottlenecks related to global memory access caused by the spatial discretization of the large tissue domains being simulated, become a big challenge. For simulations in a single GPU, we propose a strategy to accelerate the computation of the diffusion term through a data-structure and memory access pattern designed to maximize coalescent memory transactions and minimize branch divergence, achieving results approximately 1.4 times faster than a standard GPU method. We also combine this data structure with a designed communication strategy to take advantage in the case of simulations in multi-GPU platforms. We demonstrate that, in the multi-GPU approach performs, simulations in 3D tissue can be just 4× slower than real time.

Keywords: GPU Computing; cardiac electrophysiology models; memory access optimization; parallel cardiac dynamics simulations.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTEREST The authors declare no potential conflict of interests.

Figures

FIGURE 1
FIGURE 1
3D stencil representing data required for calculating the value at the next numerical time step t + 1 at each domain point U0 using a standard second-order FDM
FIGURE 2
FIGURE 2
3D mesh used in FDM. A, non-partition domain; B, CUDA blocks partition
FIGURE 3
FIGURE 3
Representation of memory access pattern for sequential threads when computing Equation (5). From left to right, we show data required by each thread from points (x, y, z − 1), (x, y − 1, z), (x, y, z), (x, y + 1, z), and (x, y, z + 1)
FIGURE 4
FIGURE 4
Representation of memory access pattern for sequential threads when accessing data from (x − 1, y, z) (a) and (x + 1, y, z) (b) to solve Equation (5). A, (x − 1, y, z) points; B, (x + 1, y, z) points
FIGURE 5
FIGURE 5
Geometric representation of data required by a 2D CUDA block. Colored cells highlight neighboring data required for computation
FIGURE 6
FIGURE 6
Access pattern for core data on global memory
FIGURE 7
FIGURE 7
Access pattern for y neighborhoods
FIGURE 8
FIGURE 8
Access pattern for x neighborhoods
FIGURE 9
FIGURE 9
3D data structure representation for a 32 × 32 × 32 mesh. A, Mesh division in towers; B, Tower neighborhood; C, Proposed data structure
FIGURE 10
FIGURE 10
A simple representation of data positions to be accessed by a 8 × 4 thread block. A, Geometric data distribution in the mesh; B, Geometric data distribution in the proposed structure; C, Data distribution in global memory
FIGURE 11
FIGURE 11
Global memory access pattern
FIGURE 12
FIGURE 12
Global memory write pattern
FIGURE 13
FIGURE 13
Computation time as function of block size for GTX 1080 Ti. This experiment computed 20 000 time steps
FIGURE 14
FIGURE 14
Computation time as function of block size for GTX Titan X. This experiment computed 20 000 time steps
FIGURE 15
FIGURE 15
Computation time as function of block size for Tesla P100. This experiment computed 20 000 time steps
FIGURE 16
FIGURE 16
Sub-domain buffers used to communicate z borders between different GPUs. The right side of the figure shows the scheme used with the multi-GPU/multi-stream strategy
FIGURE 17
FIGURE 17
Communication strategy for three sequential GPUs during the computation of two consecutive time steps
FIGURE 18
FIGURE 18
Performance evaluation of different multi-GPU strategies. The letter S in the legends means streams strategy and SP means stream + page-lock memory
FIGURE 19
FIGURE 19
Average number of cells processed per second for the fastest block setup for each experiment
FIGURE 20
FIGURE 20
Electrical stimulus propagation along the z direction
FIGURE 21
FIGURE 21
Spiral wave simulated with our towerDS implementation for 2 seconds of physical time (40 000 time steps)
FIGURE 22
FIGURE 22
Mirror scheme applied as boundary conditions at mesh borders

References

    1. Patterson D, Anderson T, Cardwell N, et al. A case for intelligent RAM. IEEE Micro. 1997;17(2):34–44.
    1. Patterson DA, Hennessy JL. Computer Organization and Design. Cambridge, MA: Morgan Kaufmann; 2007:474–476.
    1. Whaley RC, Dongarra JJ. Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing; 1998; Orlando, FL.
    1. Dongarra J Sparse matrix storage formats. Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Philadelphia, PA: SIAM; 2000:445–448.
    1. Silva J, Boeres C, Drummond L, Pessoa AA. Memory aware load balance strategy on a parallel branch-and-bound application. Concurr Comput Pract Exp. 2015;27(5):1122–1144.

LinkOut - more resources