Applying Deep Learning Techniques to Estimate Patterns of Musical Gesture

David Dalmazzo¹, George Waddell^{2

3}, Rafael Ramírez¹

Affiliations

¹ Music Technology Group, Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona, Spain.
² Centre for Performance Science, Royal College of Music, London, United Kingdom.
³ Faculty of Medicine, Imperial College London, London, United Kingdom.

PMID: 33469435
PMCID: PMC7813937
DOI: 10.3389/fpsyg.2020.575971

Applying Deep Learning Techniques to Estimate Patterns of Musical Gesture

David Dalmazzo et al. Front Psychol. 2021.

. 2021 Jan 5:11:575971.

doi: 10.3389/fpsyg.2020.575971. eCollection 2020.

Authors

David Dalmazzo¹, George Waddell^{2

3}, Rafael Ramírez¹

Affiliations

¹ Music Technology Group, Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona, Spain.
² Centre for Performance Science, Royal College of Music, London, United Kingdom.
³ Faculty of Medicine, Imperial College London, London, United Kingdom.

PMID: 33469435
PMCID: PMC7813937
DOI: 10.3389/fpsyg.2020.575971

Abstract

Repetitive practice is one of the most important factors in improving the performance of motor skills. This paper focuses on the analysis and classification of forearm gestures in the context of violin playing. We recorded five experts and three students performing eight traditional classical violin bow-strokes: martelé, staccato, detaché, ricochet, legato, trémolo, collé, and col legno. To record inertial motion information, we utilized the Myo sensor, which reports a multidimensional time-series signal. We synchronized inertial motion recordings with audio data to extract the spatiotemporal dynamics of each gesture. Applying state-of-the-art deep neural networks, we implemented and compared different architectures where convolutional neural networks (CNN) models demonstrated recognition rates of 97.147%, 3DMultiHeaded_CNN models showed rates of 98.553%, and rates of 99.234% were demonstrated by CNN_LSTM models. The collected data (quaternion of the bowing arm of a violinist) contained sufficient information to distinguish the bowing techniques studied, and deep learning methods were capable of learning the movement patterns that distinguish these techniques. Each of the learning algorithms investigated (CNN, 3DMultiHeaded_CNN, and CNN_LSTM) produced high classification accuracies which supported the feasibility of training classifiers. The resulting classifiers may provide the foundation of a digital assistant to enhance musicians' time spent practicing alone, providing real-time feedback on the accuracy and consistency of their musical gestures in performance.

Keywords: CNN; CNN_LSTM; ConvLSTM; LSTM; bow-strokes; gesture recognition; music education; music interaction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Musical excerpts performed for each of the eight violin bowing gestures.

**Figure 2**
Trajectory samples of all gestures. The bow-stroke examples displayed were chosen randomly from the expert performers. The shapes can be understood as temporal signatures with specific speeds and sounds. The performer's samples are similar in speed and shape but not identical. The color bar is the reference of the depth shown as the “y” axis.

**Figure 3**
The cluster visualization serves to check if the data distribution is centralized and normalized.

**Figure 4**
The three-dimensional data is organized in sets of 150 samples which contain two bow-strokes per sample. The x-axis is the observation of those paired bow-strokes, the y-axis is the number of observations defined as “samples,” and the z-axis is the number of features, which in this case is 3 × 3 sensor axes (gyroscope, accelerometer, and Euler-angles). Hence, each of the features is itself a file of time-steps × samples stored in a folder.

**Figure 5**
The shuffled data gives an insight into how the features and the labeled datasets have to be reorganized with the same sample order. By shuffling, we ensure that each data observation creates an independent unbiased change on the model, learning all gestures in the same proportion.

**Figure 6**
**(A)** CNN architecture: After the filtering layers, with a dropout of 0.5, the first Dense layer is 100 neurons in size projected to eight neurons of output. **(B)** 3D_Multihaded_CNN: Each head is a different resolution of the whole package of data which is concatenated in the layer concatenate_1.

**Figure 7**
**(A)** CNN_LSTM is a hybrid model with six layers of CNN processing extracting temporal features of the gestures and projecting them to a standard Vanilla LSTM. **(B)** ConLSTM is a recurrent neural network LSTM that handles 3D tensors, receiving in the input gates the matrices processed by its internal CNN.

**Figure 8**
Boxplot of accuracy reports from **(A)** CNN filter configurations of 8, 16, 32, 64, 128, 256; **(B)** CNN kernel configurations of 2, 3, 5, 7, 9; **(C)** CNN_LSTM Batches configurations 32, 64, 128, 256, 512; and **(D)** Conv_LSTM with filters configurations of 8, 16, 32, 64, 128, 256. All models were run 10 times to determine their range of accuracy.

**Figure 9**
The figure is composed of 10 experiment runs per parameter (percentage of the data used in this study) with 20 epochs each test run.

See this image and copyright information in PMC

References

1. Ahmed S. H., Kim D. (2016). Named data networking-based smart home. ICT Express 2, 130–134. 10.1016/j.icte.2016.08.007 - DOI
1. Anguita D., Ghio A., Oneto L., Parra X., Reyes-Ortiz J. L. (2013). A public domain dataset for human activity recognition using smartphones, in ESANN 2013 Proceedings, 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (Bruges: ), 437–442.
1. Côté-Allard U., Fall C. L., Drouin A., Campeau-Lecours A., Gosselin C., Glette K., et al. (2019). Deep learning for electromyographic hand gesture signal classification using transfer learning. IEEE Trans. Neural Syst. Rehabil. Eng. 27, 760–771. 10.1109/TNSRE.2019.2896269 - DOI - PubMed
1. Caramiaux B., Bevilacqua F., Tanaka A. (2013). Beyond recognition, in CHI '13 Extended Abstracts on Human Factors in Computing Systems–CHI EA '13 (Seoul: ), 2109 10.1145/2468356.2468730 - DOI
1. Caramiaux B., Montecchio N., Tanaka A., Bevilacqua F. (2015). Adaptive gesture recognition with variation estimation for interactive systems. ACM Trans. Interact. Intell. Syst. 4:18 10.1145/2643204 - DOI

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Applying Deep Learning Techniques to Estimate Patterns of Musical Gesture

Affiliations

Applying Deep Learning Techniques to Estimate Patterns of Musical Gesture

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources