. 2021 Sep 14;16(9):e0257047.

doi: 10.1371/journal.pone.0257047. eCollection 2021.

Off-chip prefetching based on Hidden Markov Model for non-volatile memory architectures

Adrián Lamela¹, Óscar G Ossorio², Guillermo Vinuesa², Benjamín Sahelices¹

Affiliations

¹ Department of Computer Science, School of Informatics Engineering, University of Valladolid, Valladolid, Spain.
² Department of Electronics, School of Informatics Engineering, University of Valladolid, Valladolid, Spain.

PMID: 34520473
PMCID: PMC8439492
DOI: 10.1371/journal.pone.0257047

Off-chip prefetching based on Hidden Markov Model for non-volatile memory architectures

Adrián Lamela et al. PLoS One. 2021.

. 2021 Sep 14;16(9):e0257047.

doi: 10.1371/journal.pone.0257047. eCollection 2021.

Authors

Adrián Lamela¹, Óscar G Ossorio², Guillermo Vinuesa², Benjamín Sahelices¹

Affiliations

¹ Department of Computer Science, School of Informatics Engineering, University of Valladolid, Valladolid, Spain.
² Department of Electronics, School of Informatics Engineering, University of Valladolid, Valladolid, Spain.

PMID: 34520473
PMCID: PMC8439492
DOI: 10.1371/journal.pone.0257047

Abstract

Non-volatile memory technology is now available in commodity hardware. This technology can be used as a backup memory for an external dram cache memory without needing to modify the software. However, the higher read and write latencies of non-volatile memory may exacerbate the memory wall problem. In this work we present a novel off-chip prefetch technique based on a Hidden Markov Model that specifically deals with the latency problem caused by complexity of off-chip memory access patterns. Firstly, we present a thorough analysis of off-chip memory access patterns to identify its complexity in multicore processors. Based on this study, we propose a prefetching module located in the llc which uses two small tables, and where the computational complexity of which is linear with the number of computing threads. Our Markov-based technique is able to keep track and make clustering of several simultaneous groups of memory accesses coming from multiple simultaneous threads in a multicore processor. It can quickly identify complex address groups and trigger prefetch with very high accuracy. Our simulations show an improvement of up to 76% in the hit ratio of an off-chip dram cache for multicore architecture over the conventional prefetch technique (g/dc). Also, the overhead of prefetch requests (failed prefetches) is reduced by 48% in single core simulations and by 83% in multicore simulations.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. TL with four llc configurations.**
Temporal locality of *lbm*, *libquantum* and *omnetpp*, represented as time interval probability between consecutive accesses (100 in this figure is the forget threshold), with four different llc configurations: Size 16MB-32MB and associativity 1-8. The number of off-chip accesses is included showing the ability of llc to filter cache misses.

**Fig 2. Example of two groups identified by hmm in mcf benchmark.**
The figures represent the off-chip accessed lines through time and the address interval (red lines) that our prefetcher identify and may use to trigger dram cache prefetches. These groups appear simultaneously and are identified, isolated and grouped by our hmm proposal so prefetches may be individualized to each group.

**Fig 3. Analysis of spatial locality in *mcf* benchmark.**
Spatial locality is modeled using off-chip opkc, so when opkc increases hit ratio in external cache becames critical. In (a) can be seen how starting on cycle 1100-Million opkc increases in all llc configurations. (b) shows how starting on cycle 1100-Million our proposal hmm achieves very good hit ratio when off-chip cache presure increases, beating clearly g/dc prefetch technique in this scenario.

**Fig 4. Frequency analysis with four llc configurations.**
In this figure we use off-chip opkc to describe the types of misses related to llc size and associativity. *lbm* opkc is independent of llc configurations, *libquantum* misses reduce with llc size so they are capacity type, *omnetpp* misses diminish when associativiy increases so they are conflict type and, finally, *milc* has both types.

**Fig 5. Schematic architecture.**
Proposed virtual address (va) based architecture for off-chip prefetching. The use of va allows the prefetcher to exploit all the locality information at the cost of increase memory and energy use to store tags and asid information. The number of va to pa translations is greatly reduced due to the positive effect of the cache hierarchy in reducing off-chip accesses. Off-chip prefetchers move data/instructions in advance from nvm-ram to dram cache.

**Fig 6. Symbolic representation of Hidden Markov Model (HMM).**

**Fig 7. Example of *astar* spatial locality.**
Algorithmic complexity in *astar* with multiple simultaneous groups that are isolated and identified by our hmm proposal. In (c) the different groups are represented by colors.

**Fig 8. Main areas of spatial locality in *astar*.**
Example of four groups identified in *astar* by hmm. This information is feeded to the prefetcher to get intervals of addresses with high probability of future use.

**Fig 9. Main areas of spatial locality in *astar* with recognized intervals.**
Based on the group identification, hmm gets address intervals with high probability of use, which are shown in this figure between the red lines.

**Fig 10. Prefetch circuit.**
Schematic description of the prefetcher on-chip implementation. llc miss address (q) is used to identify a group and generate the prefetch address interval, or to create a new group based on the nearest current group. This implementation of hmm allows for precise identification of the simultanenous off-chip memory groups accessed by the different processes running in a multicore chip.

**Fig 11. Hit ratio and overhead of base, hmm, g/dc and g/ac in a single core architecture.**
The hit ratio for the base experiment is represented by a horizontal line (0-6.3% for all benchmarks). The last plot is the geometric mean of all benchmarks.

**Fig 12. Hit ratio and overhead of base, hmm, g/dc and g/ac in a 9 core architecture.**
The hit ratio of the base experiment is shown in the box. Each mix consists of nine benchmarks.

**Fig 13. Hit ratio and overhead of base, hmm and g/dc in a 16 core architecture and in a multiprogrammed 4 core architecture.**
The hit ratio of the base experiment is shown in the box.

See this image and copyright information in PMC

References

1. Pelley S, Chen PM, Wenisch TF. Memory Persistency. In: 41nd ISCA; 2014.
1. Ren J, Zhao J, Khan S, Choi J, Wu Y, Jacob B. ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems. In: MICRO-48; 2015.
1. Dosi K, Giles E, Varman P. Atomic Persistence for SCM with a Non-intrusive Backend Controller. In: HPCA; 2016.
1. Joshi A, Nagarajan V, Viglas S, Cintra M. ATOM: Atomic Durability in Non-volatile Memory Through Hardware Logging. In: HPCA; 2017.
1. Jagasivamani M, Asnaashari M, et al DY. Design for ReRAM-based Main-Memory Architectures. In: Memsys; 2019.

Publication types

Actions

MeSH terms

Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Off-chip prefetching based on Hidden Markov Model for non-volatile memory architectures

Affiliations

Off-chip prefetching based on Hidden Markov Model for non-volatile memory architectures

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources