Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 1;5(2):027001.
doi: 10.1088/2632-2153/ad51c9. Epub 2024 Jun 13.

GPU optimization techniques to accelerate optiGAN-a particle simulation GAN

Affiliations

GPU optimization techniques to accelerate optiGAN-a particle simulation GAN

Anirudh Srikanth et al. Mach Learn Sci Technol. .

Abstract

The demand for specialized hardware to train AI models has increased in tandem with the increase in the model complexity over the recent years. Graphics processing unit (GPU) is one such hardware that is capable of parallelizing operations performed on a large chunk of data. Companies like Nvidia, AMD, and Google have been constantly scaling-up the hardware performance as fast as they can. Nevertheless, there is still a gap between the required processing power and processing capacity of the hardware. To increase the hardware utilization, the software has to be optimized too. In this paper, we present some general GPU optimization techniques we used to efficiently train the optiGAN model, a Generative Adversarial Network that is capable of generating multidimensional probability distributions of optical photons at the photodetector face in radiation detectors, on an 8GB Nvidia Quadro RTX 4000 GPU. We analyze and compare the performances of all the optimizations based on the execution time and the memory consumed using the Nvidia Nsight Systems profiler tool. The optimizations gave approximately a 4.5x increase in the runtime performance when compared to a naive training on the GPU, without compromising the model performance. Finally we discuss optiGANs future work and how we are planning to scale the model on GPUs.

Keywords: Monte-Carlo simulation; generative adversarial networks; graphics processing unit; multidimensional probability distributions; performance optimization; radiation detector.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
CPU and GPU architectures.
Figure 2.
Figure 2.
OptiGAN training dataset. Accurate optical simulations were performed at different emission positions inside a crystal to train and test the conditional generative adversarial network optiGAN. Optical photon distributions (position, directions, and energy) were stored for 140 emission points in a multidimensional matrix which included the source 3D emission positions. This tabular data was used as the high-fidelity training dataset of the optiGAN.
Figure 3.
Figure 3.
OptiGAN architecture (from Trigila et al (2023)). It consists of a Generator (a) and a Discriminator/Critic network (b) with H = 128 hidden nodes and a ReLU activation function.
Figure 4.
Figure 4.
Automatic mixed precision training pipeline.
Figure 5.
Figure 5.
GPU Profiling results using Nvidia Nsight Systems. It shows the runtime performance of different sections of the model training (like Dataloading, Generator and Discriminator training process), GPU memory usage, GPU utilization (SM active and SM instructions), and the tensor core activity. These events were sampled at a rate of 10 kHz. The pink regions consists of several vertical lines which represents the percentage of the GPU cores that were active at that moment and the the blue regions represents the percentage of instructions issued by the SMs at that moment.
Figure 6.
Figure 6.
Dataloader optimization results. The runtime (22.9 (s)) dropped by almost 2 times from the previous GPU runtime showed in figure 5. The GPU is active during most of the training duration. However, the memory usage was not optimized with this technique.
Figure 7.
Figure 7.
Automatic Mixed Precision optimization results. The runtime was further reduced (10.4 (s) and the memory usage was reduced to 3.37 GiB using tensor cores, that specializes in optimizing deep learning computation. It also gives the possibility to increase the batch size.
Figure 8.
Figure 8.
optiGAN model execution time and batch size comparison in CPU and GPU.
Figure 9.
Figure 9.
Execution time comparison of the GPU optimizations.

References

    1. Allison J, et al. Recent developments in Geant4. Nucl. Instrum. Methods Phys. Res. A . 2016;835:186–225. doi: 10.1016/j.nima.2016.06.125. - DOI
    1. Arjovsky M, Chintala S, Bottou L. Wasserstein GAN. 2017 (arXiv: 1701.07875 [cs, stat])
    1. Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E. cuDNN: efficient primitives for deep learning. 2014 (arXiv: 1410.0759 [cs])
    1. Dao T. FlashAttention-2: faster attention with better parallelism and work partitioning. 2023 (arXiv: 2307.08691 [cs])
    1. Data Sheet: Quadro RTX 4000 2019. (available at: www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-p...)