A self-adaptive hardware with resistive switching synapses for experience-based neurocomputing

S Bianchi^#^{1

2}, I Muñoz-Martin^#^{1

2}, E Covi^{1

3}, A Bricalli⁴, G Piccolboni⁴, A Regev⁴, G Molas⁴, J F Nodin⁵, F Andrieu⁵, D Ielmini⁶

Affiliations

¹ Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano and IUNET, Milano, 20133, Italy.
² Infineon Technologies, Villach, Austria.
³ NaMLab gGmbH, Dresden, Germany.
⁴ Weebit Nano, Hod Hasharon, Israel.
⁵ Univ. Grenoble Alpes, CEA, Leti, F-38000, Grenoble, France.
⁶ Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano and IUNET, Milano, 20133, Italy. daniele.ielmini@polimi.it.

^# Contributed equally.

PMID: 36944647
PMCID: PMC10030830
DOI: 10.1038/s41467-023-37097-5

A self-adaptive hardware with resistive switching synapses for experience-based neurocomputing

S Bianchi et al. Nat Commun. 2023.

. 2023 Mar 21;14(1):1565.

doi: 10.1038/s41467-023-37097-5.

Authors

S Bianchi^#^{1

2}, I Muñoz-Martin^#^{1

2}, E Covi^{1

3}, A Bricalli⁴, G Piccolboni⁴, A Regev⁴, G Molas⁴, J F Nodin⁵, F Andrieu⁵, D Ielmini⁶

Affiliations

¹ Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano and IUNET, Milano, 20133, Italy.
² Infineon Technologies, Villach, Austria.
³ NaMLab gGmbH, Dresden, Germany.
⁴ Weebit Nano, Hod Hasharon, Israel.
⁵ Univ. Grenoble Alpes, CEA, Leti, F-38000, Grenoble, France.
⁶ Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano and IUNET, Milano, 20133, Italy. daniele.ielmini@polimi.it.

^# Contributed equally.

PMID: 36944647
PMCID: PMC10030830
DOI: 10.1038/s41467-023-37097-5

Abstract

Neurobiological systems continually interact with the surrounding environment to refine their behaviour toward the best possible reward. Achieving such learning by experience is one of the main challenges of artificial intelligence, but currently it is hindered by the lack of hardware capable of plastic adaptation. Here, we propose a bio-inspired recurrent neural network, mastered by a digital system on chip with resistive-switching synaptic arrays of memory devices, which exploits homeostatic Hebbian learning for improved efficiency. All the results are discussed experimentally and theoretically, proposing a conceptual framework for benchmarking the main outcomes in terms of accuracy and resilience. To test the proposed architecture for reinforcement learning tasks, we study the autonomous exploration of continually evolving environments and verify the results for the Mars rover navigation. We also show that, compared to conventional deep learning techniques, our in-memory hardware has the potential to achieve a significant boost in speed and power-saving.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Electrical characterization of the RRAM synaptic devices.**
a Scanning Electron Microscope image of the SiO_x RRAM devices and sample photo of the packaged RRAM arrays used in this work. b I-V characteristics of the 1T1R RRAM devices (device-to-device measurements) at fixed V_STOP as a function of the compliance current I_C in order to study the switching mechanism of the synapses under different operative conditions. Note that the compliance current is directly managed by acting on the gate voltage V_G of the selector of the cell, which is an nmos transistor. By sending pre-neuronal spikes at the gate of the selector and biasing the top electrode of the RRAM synaptic element, a post-synaptic current is generated and used for the post-neuronal computation. During the fire events, a programming signal is superimposed to the bias of the top electrode in order to set or reset the memory device. c Typical low-resistive (LRS) and high-resistive (HRS) distributions using I_C = 74 μA and V_STOP = −1.5 V. d Multilevel LRS at increasing I_C with the average resistive value μ_R (e) and the corresponding standard deviation σ_R (f): note that the precision of the synaptic weight is dependent on the module of the programming current (higher power, higher precision of the synaptic weight). g Modulation of the HRS as a function of V_STOP sweep with the extracted σ error bar: the higher, in module, the stop voltage, the higher the resistance that is obtained.

**Fig. 2. Flow-chart of the reinforcement learning procedure implemented in hardware.**
a Representation of high-level reinforcement learning for autonomous navigation considering 8 main directions of movement: an agent (e.g., a robot) interacts with the environment by means of decision-making events which eventually lead to penalties or rewards that modulate the next actions. The direction of movement between two positions is ruled by the STDP. The pre-neuronal signal (current position of the agent) excites the gate of the selector of the synaptic RRAM element by sending a sequence of rectangular pulses while the TE of the synapse is biased at a read voltage (between 50 and 150 mV). The consequent post-synaptic current is integrated in the post-neuron and compared with the internal threshold (ruled by a further “state” device) eventually inducing fire activities which potentiate the synaptic element and mark the direction of movement. Note also that LFSR registers can select random neurons for sending stochastic depression signals. b High-level description of the bio-inspired reinforcement learning procedure implemented in hardware. Note that, for best operation, the initial combination of the RRAM matrices is bimodal. c Block scheme of the hardware, with the “synaptic” and “internal states” RRAM arrays, the FPGA and the 8 neurons that stand for the 8 cardinal directions. d Example of the operative condition of the firing neuron NW with respect to the synaptic and internal state arrays. The internal threshold is modulated by the resistive state of the internal RRAM device which changes as a function of the fire activity: an analogue front-end is also necessary for a correct definition of the post-synaptic currents. e Top view of the memory array and of the integrated circuital periphery for the management of the memory addresses. f Example of a dynamic maze to test the systems for reinforcement learning tasks.

**Fig. 3. Step-by-step description of the main signals ruling the autonomous navigation.**
a The FPGA records the current position of the agent and b triggers the gate voltage signal of the synaptic devices to start the integration phase of the nearest neurons. Once a neuron fires, all the integration signals are discharged by switching on a transistor in parallel to the capacitor used for integration. After the fire event, the corresponding synaptic connection is brought high (c) and the current position of the agent is updated (d); the threshold of the new internal state rises as a consequence of the internal state partial set (e); the procedure (a–e) is repeated at every movement of the agent. f If a position (i, j) is accessed consequent times, it plastically adapts the corresponding internal thresholds causing a gradual increase of the threshold V_TH; the neuronal threshold plastic adaptation is also used to map the penalties, by increasing the corresponding V_TH (g), and the rewards, by decreasing the corresponding V_TH (h). Note that the gradual increase of the neuronal threshold is bounded to the effective multilevel capability of the RRAM devices (i). During the ordinary movement, the synaptic connections from one position to another are potentiated or depressed for the STDP mechanism (j), while, on the other hand, the penalty positions always undergo depression, due to reinforcement learning (k). Note that the synaptic connections are always potentiated if the agent does not come back. If rewarded positions run into a penalty due to the dynamic evolution of the environment, the corresponding internal thresholds rise slower than the ordinary positions, due to the firing history and the different fire excitability (l).

**Fig. 4. Recall property in dynamics environments and power efficiency of the system.**
a Experimental results for 9 successive trials of a maze which changes topological configuration every 3 trials. The system explores the environment to find the reward and it recalls the first solution once the previous configuration is proposed. b The time to get the solution improves from trial to trial along with the optimization of the policy. However, note that, once the maze changes shape, the reward time increases accordingly since a new solution must be found. When the maze comes back to the previous situation the first solution is recalled. c Energy consumption tendency for each core of the system. d Once the initial point is changed from trial to trial, the energy consumption stays high, but a policy map of the whole environment is retrieved. e Map of the firing rate of the neurons, showing that the highest values are, on average, in the nearby zone of the final reward. f Colour maps of the accuracy for standard Python-based deep Q-learning and the proposed bio-inspired approach under the same benchmarking condition. Note that the bio-inspired hardware assures better accuracy results for every combination of explorative parameters (number of trials per experiment and number of steps per single trial, i.e., exploration time). g Comparison in terms of memory computing elements between the deep Q-learning procedure and the bio-inspired solution at increasing sizes of the environment to explore. Note that the power consumption is also furtherly improved in the bio-inspired solution thanks to the use of RRAM memory devices built in the back end of the line, which avoids the von Neumann bottleneck typical of standard computing platforms.

**Fig. 5. Reconfigurability and scalability of the hardware under the Mars Rover navigation test.**
a Custom environment of Mars readapted from HiRISE, with highlighted the initial (start) and the final (global reward) points selected for the test of the algorithm. b The agent explores the environment in 100 trials, eventually finding the target: note that the successful trials improve the strategy step-by-step to get faster to the solution. c Various trials of exploration lead to the creation of a complete policy map of the whole environment, with higher equivalent threshold of the positions which received penalties. d Time improvement of the exploration path: after selecting a starting point, the policy map drives the system to optimize the number of steps to get to the final reward. e Iterative selection of small sections of the previous policy map by reading the integrated current of the state array to record generic shapes of the penalty-related objects. Note that this procedure can be iterated as a function of the shape size by choosing proper boundaries to avoid misleading cases (e.g., sections of the policy maps in which no shapes are detected). f Exploration of a new environment taking into consideration different sizes of penalty shapes: once the agent receives a penalty, it is possible to inhibit a generic pre-recorded area of the environment, thus avoiding a memory-position mapping. Furthermore, the memory array can be continually exploited by the bio-inspired computation until the complete memory resource is fully allocated. Then, the direction of movement and the further information can be saved in peripheral registers as pure coordinates. g Study over 100 experiments of the time improvement of the exploration path comparing the free-policy with the optimized policy using recorded penalty shapes, eventually highlighting the benefits of transfer learning from previous explorations.

See this image and copyright information in PMC

References

1. Power JD, Schlaggar BL. Neural plasticity across the lifespan. Wiley Interdiscip. Rev.: Dev. Biol. 2017;6:1. - PMC - PubMed
1. Folke C, et al. Resilience thinking: integrating resilience, adaptability and transformability. Ecol. Soc. 2010;15:20. doi: 10.5751/ES-03610-150420. - DOI
1. Hassabis D, Kumaran D, Summerfield C, Botvinick M. Neuroscience-inspired artificial intelligence. Neuron. 2017;95:245–258. doi: 10.1016/j.neuron.2017.06.011. - DOI - PubMed
1. Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J. Artif. Intell. Res. 1996;4:237–285. doi: 10.1613/jair.301. - DOI
1. Sutton RS. Learning to predict by the methods of temporal differences. Mach. Learn. 1988;3:9–44. doi: 10.1007/BF00115009. - DOI

Grants and funding

899559/EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 European Research Council (H2020 Excellent Science - European Research Council)

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A self-adaptive hardware with resistive switching synapses for experience-based neurocomputing

Affiliations

A self-adaptive hardware with resistive switching synapses for experience-based neurocomputing

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous