This project is in collaboration with the
This project is partially supported by the National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding".
We aim to train a system to generate a virtual pianist animation with expressive performance motions given a symbolic music in MIDI format.
We first use two CNN structures to parse the raw input of the MIDI note stream and the metric structure, and then feed the extracted feature representations to an LSTM network to generate the body movements, as a sequence of upper-body joint coordinates forming a skeleton.
We conduct subjective evaluations to rate the expressiveness and naturalness of the generated skeleton movements compared with the ones extracted from real human players. More specifically, we recruit 18 subjects from Yamaha company to watch 32 10-sec video excerpts of "skeleton plays piano", 16 from the generated ones, and 16 from the real ones. The rating result is plotted in the following figure, where the tracks with significant different ratings are marked with "*".
All the generated skeleton movements (compared with real human) for the 16 tracks are listed here:
Visit the YouTube playlist for the above 16 demo videos <here>
Visit the YouTube playlist for demo videos without comparing with real human <here>
Bochen Li, Akira Maezawa, and Zhiyao Duan, Skeleton plays piano: online generation of pianist body movements from MIDI performance, in Proc. International Society for Music Information Retrieval (ISMIR), 2018.