. 2020 Mar 10;32(5):e5528.

doi: 10.1002/cpe.5528. Epub 2019 Oct 23.

Accelerating simulations of cardiac electrical dynamics through a multi-GPU platform and an optimized data structure

Eduardo C Vasconcellos¹, Esteban W G Clua¹, Flavio H Fenton², Marcelo Zamith³

Affiliations

¹ Institute of Computing, Fluminense Federal University, Niterói, Brazil.
² School of Physics, Georgia Institute of Technology, Atlanta, Georgia.
³ Department of Computer Science, Universidade Federal Rural do Rio de Janeiro, Seropédica, Brazil.

PMID: 34720756
PMCID: PMC8552220
DOI: 10.1002/cpe.5528

Accelerating simulations of cardiac electrical dynamics through a multi-GPU platform and an optimized data structure

Eduardo C Vasconcellos et al. Concurr Comput. 2020.

. 2020 Mar 10;32(5):e5528.

doi: 10.1002/cpe.5528. Epub 2019 Oct 23.

Authors

Eduardo C Vasconcellos¹, Esteban W G Clua¹, Flavio H Fenton², Marcelo Zamith³

Affiliations

¹ Institute of Computing, Fluminense Federal University, Niterói, Brazil.
² School of Physics, Georgia Institute of Technology, Atlanta, Georgia.
³ Department of Computer Science, Universidade Federal Rural do Rio de Janeiro, Seropédica, Brazil.

PMID: 34720756
PMCID: PMC8552220
DOI: 10.1002/cpe.5528

Abstract

Simulations of cardiac electrophysiological models in tissue, particularly in 3D require the solutions of billions of differential equations even for just a couple of milliseconds, thus highly demanding in computational resources. In fact, even studies in small domains with very complex models may take several hours to reproduce seconds of electrical cardiac behavior. Today's Graphics Processor Units (GPUs) are becoming a way to accelerate such simulations, and give the added possibilities to run them locally without the need for supercomputers. Nevertheless, when using GPUs, bottlenecks related to global memory access caused by the spatial discretization of the large tissue domains being simulated, become a big challenge. For simulations in a single GPU, we propose a strategy to accelerate the computation of the diffusion term through a data-structure and memory access pattern designed to maximize coalescent memory transactions and minimize branch divergence, achieving results approximately 1.4 times faster than a standard GPU method. We also combine this data structure with a designed communication strategy to take advantage in the case of simulations in multi-GPU platforms. We demonstrate that, in the multi-GPU approach performs, simulations in 3D tissue can be just 4× slower than real time.

Keywords: GPU Computing; cardiac electrophysiology models; memory access optimization; parallel cardiac dynamics simulations.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTEREST The authors declare no potential conflict of interests.

Figures

**FIGURE 1**
3D stencil representing data required for calculating the value at the next numerical time step t + 1 at each domain point U₀ using a standard second-order FDM

**FIGURE 2**
3D mesh used in FDM. A, non-partition domain; B, CUDA blocks partition

**FIGURE 3**
Representation of memory access pattern for sequential threads when computing Equation (5). From left to right, we show data required by each thread from points (*x, y, z* − 1), (*x, y* − 1, z), (*x, y, z*), (*x, y* + 1, z), and (*x, y, z* + 1)

**FIGURE 4**
Representation of memory access pattern for sequential threads when accessing data from (x − 1, y, z) (a) and (x + 1, y, z) (b) to solve Equation (5). A, (x − 1, y, z) points; B, (x + 1, y, z) points

**FIGURE 5**
Geometric representation of data required by a 2D CUDA block. Colored cells highlight neighboring data required for computation

**FIGURE 6**
Access pattern for core data on global memory

**FIGURE 7**
Access pattern for y neighborhoods

**FIGURE 8**
Access pattern for x neighborhoods

**FIGURE 9**
3D data structure representation for a 32 × 32 × 32 mesh. A, Mesh division in towers; B, Tower neighborhood; C, Proposed data structure

**FIGURE 10**
A simple representation of data positions to be accessed by a 8 × 4 thread block. A, Geometric data distribution in the mesh; B, Geometric data distribution in the proposed structure; C, Data distribution in global memory

**FIGURE 11**
Global memory access pattern

**FIGURE 12**
Global memory write pattern

**FIGURE 13**
Computation time as function of block size for GTX 1080 Ti. This experiment computed 20 000 time steps

**FIGURE 14**
Computation time as function of block size for GTX Titan X. This experiment computed 20 000 time steps

**FIGURE 15**
Computation time as function of block size for Tesla P100. This experiment computed 20 000 time steps

**FIGURE 16**
Sub-domain buffers used to communicate z borders between different GPUs. The right side of the figure shows the scheme used with the multi-GPU/multi-stream strategy

**FIGURE 17**
Communication strategy for three sequential GPUs during the computation of two consecutive time steps

**FIGURE 18**
Performance evaluation of different multi-GPU strategies. The letter S in the legends means streams strategy and SP means stream + page-lock memory

**FIGURE 19**
Average number of cells processed per second for the fastest block setup for each experiment

**FIGURE 20**
Electrical stimulus propagation along the z direction

**FIGURE 21**
Spiral wave simulated with our towerDS implementation for 2 seconds of physical time (40 000 time steps)

**FIGURE 22**
Mirror scheme applied as boundary conditions at mesh borders

See this image and copyright information in PMC

References

1. Patterson D, Anderson T, Cardwell N, et al. A case for intelligent RAM. IEEE Micro. 1997;17(2):34–44.
1. Patterson DA, Hennessy JL. Computer Organization and Design. Cambridge, MA: Morgan Kaufmann; 2007:474–476.
1. Whaley RC, Dongarra JJ. Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing; 1998; Orlando, FL.
1. Dongarra J Sparse matrix storage formats. Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Philadelphia, PA: SIAM; 2000:445–448.
1. Silva J, Boeres C, Drummond L, Pessoa AA. Memory aware load balance strategy on a parallel branch-and-bound application. Concurr Comput Pract Exp. 2015;27(5):1122–1144.

Grants and funding

R01 HL143450/HL/NHLBI NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accelerating simulations of cardiac electrical dynamics through a multi-GPU platform and an optimized data structure

Affiliations

Accelerating simulations of cardiac electrical dynamics through a multi-GPU platform and an optimized data structure

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources