Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 14;16(9):e0257047.
doi: 10.1371/journal.pone.0257047. eCollection 2021.

Off-chip prefetching based on Hidden Markov Model for non-volatile memory architectures

Affiliations

Off-chip prefetching based on Hidden Markov Model for non-volatile memory architectures

Adrián Lamela et al. PLoS One. .

Abstract

Non-volatile memory technology is now available in commodity hardware. This technology can be used as a backup memory for an external dram cache memory without needing to modify the software. However, the higher read and write latencies of non-volatile memory may exacerbate the memory wall problem. In this work we present a novel off-chip prefetch technique based on a Hidden Markov Model that specifically deals with the latency problem caused by complexity of off-chip memory access patterns. Firstly, we present a thorough analysis of off-chip memory access patterns to identify its complexity in multicore processors. Based on this study, we propose a prefetching module located in the llc which uses two small tables, and where the computational complexity of which is linear with the number of computing threads. Our Markov-based technique is able to keep track and make clustering of several simultaneous groups of memory accesses coming from multiple simultaneous threads in a multicore processor. It can quickly identify complex address groups and trigger prefetch with very high accuracy. Our simulations show an improvement of up to 76% in the hit ratio of an off-chip dram cache for multicore architecture over the conventional prefetch technique (g/dc). Also, the overhead of prefetch requests (failed prefetches) is reduced by 48% in single core simulations and by 83% in multicore simulations.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. TL with four llc configurations.
Temporal locality of lbm, libquantum and omnetpp, represented as time interval probability between consecutive accesses (100 in this figure is the forget threshold), with four different llc configurations: Size 16MB-32MB and associativity 1-8. The number of off-chip accesses is included showing the ability of llc to filter cache misses.
Fig 2
Fig 2. Example of two groups identified by hmm in mcf benchmark.
The figures represent the off-chip accessed lines through time and the address interval (red lines) that our prefetcher identify and may use to trigger dram cache prefetches. These groups appear simultaneously and are identified, isolated and grouped by our hmm proposal so prefetches may be individualized to each group.
Fig 3
Fig 3. Analysis of spatial locality in mcf benchmark.
Spatial locality is modeled using off-chip opkc, so when opkc increases hit ratio in external cache becames critical. In (a) can be seen how starting on cycle 1100-Million opkc increases in all llc configurations. (b) shows how starting on cycle 1100-Million our proposal hmm achieves very good hit ratio when off-chip cache presure increases, beating clearly g/dc prefetch technique in this scenario.
Fig 4
Fig 4. Frequency analysis with four llc configurations.
In this figure we use off-chip opkc to describe the types of misses related to llc size and associativity. lbm opkc is independent of llc configurations, libquantum misses reduce with llc size so they are capacity type, omnetpp misses diminish when associativiy increases so they are conflict type and, finally, milc has both types.
Fig 5
Fig 5. Schematic architecture.
Proposed virtual address (va) based architecture for off-chip prefetching. The use of va allows the prefetcher to exploit all the locality information at the cost of increase memory and energy use to store tags and asid information. The number of va to pa translations is greatly reduced due to the positive effect of the cache hierarchy in reducing off-chip accesses. Off-chip prefetchers move data/instructions in advance from nvm-ram to dram cache.
Fig 6
Fig 6. Symbolic representation of Hidden Markov Model (HMM).
Fig 7
Fig 7. Example of astar spatial locality.
Algorithmic complexity in astar with multiple simultaneous groups that are isolated and identified by our hmm proposal. In (c) the different groups are represented by colors.
Fig 8
Fig 8. Main areas of spatial locality in astar.
Example of four groups identified in astar by hmm. This information is feeded to the prefetcher to get intervals of addresses with high probability of future use.
Fig 9
Fig 9. Main areas of spatial locality in astar with recognized intervals.
Based on the group identification, hmm gets address intervals with high probability of use, which are shown in this figure between the red lines.
Fig 10
Fig 10. Prefetch circuit.
Schematic description of the prefetcher on-chip implementation. llc miss address (q) is used to identify a group and generate the prefetch address interval, or to create a new group based on the nearest current group. This implementation of hmm allows for precise identification of the simultanenous off-chip memory groups accessed by the different processes running in a multicore chip.
Fig 11
Fig 11. Hit ratio and overhead of base, hmm, g/dc and g/ac in a single core architecture.
The hit ratio for the base experiment is represented by a horizontal line (0-6.3% for all benchmarks). The last plot is the geometric mean of all benchmarks.
Fig 12
Fig 12. Hit ratio and overhead of base, hmm, g/dc and g/ac in a 9 core architecture.
The hit ratio of the base experiment is shown in the box. Each mix consists of nine benchmarks.
Fig 13
Fig 13. Hit ratio and overhead of base, hmm and g/dc in a 16 core architecture and in a multiprogrammed 4 core architecture.
The hit ratio of the base experiment is shown in the box.

References

    1. Pelley S, Chen PM, Wenisch TF. Memory Persistency. In: 41nd ISCA; 2014.
    1. Ren J, Zhao J, Khan S, Choi J, Wu Y, Jacob B. ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems. In: MICRO-48; 2015.
    1. Dosi K, Giles E, Varman P. Atomic Persistence for SCM with a Non-intrusive Backend Controller. In: HPCA; 2016.
    1. Joshi A, Nagarajan V, Viglas S, Cintra M. ATOM: Atomic Durability in Non-volatile Memory Through Hardware Logging. In: HPCA; 2017.
    1. Jagasivamani M, Asnaashari M, et al DY. Design for ReRAM-based Main-Memory Architectures. In: Memsys; 2019.

Publication types