We create a dataset for facilitating audio-visual analysis of musical performances. The dataset comprises a number of simple multi-instrument musical pieces assembled from coordinated but separately recorded performances of individual tracks. We anticipate that the dataset will be useful as “ground truth” for evaluating audio-visual techniques for music source separation, transcription, and performance analysis. A more detailed description and sample data is here.
Bach10 dataset is a polyphonic music dataset which can be used for versatile research problems, such as Multi-pitch Estimation and Tracking, Audio-score Alignment, Source Separation, etc. This dataset consists of the audio recordings of each part and the ensemble of ten pieces of four-part J.S. Bach chorales, as well as their MIDI scores, the ground-truth alignment between the audio and the score, the ground-truth pitch values of each part and the ground-truth notes of each piece. The audio recordings of the four parts (Soprano, Alto, Tenor and Bass) of each piece are performed by violin, clarinet, saxophone and bassoon, respectively. A more detailed description is here. Dataset Download
The Pitch-Tracking Database from Graz University of Technology (PTDB-TUG) is a speech database for pitch tracking. It contains microphone and laryngograph signals of 20 English native speakers reading the TIMIT corpus. The database also provides reference pitch trajectories which were calculated from the laryngograph signals using the RAPT pitch tracking algorithm . Here, we provide another version of the reference pitch trajectories, calculated using the Praat pitch tracking algorithm  on the microphone signals. We found that about 85% of the Praat-generated ground-truth pitches agree with the RAPT-generated ground-truth pitches. Praat-generated Reference Pitch Trajectories Download
 D. Talkin, “A robust algorithm for pitch tracking (RAPT),” in Speech Coding and Synthesis (W.B. Kleijn and K.K. Paliwal, eds.), pp. 495–518, Elsevier Science B.V., 1995.
 P. Boersma, “Praat, a system for doing phonetics by computer,” Glot International, vol. 5, no. 9/10, pp. 341–345, 2001.
For research on speech enhancement, we collected recordings of ten kinds of non-stationary noise: birds, casino, cicadas, computer keyboard, eating chips, frogs, jungle, machine guns, motor cycles, and ocean. The recording of aach noise is between one minute to three minutes long. Dataset Download.
This code performs sound search by vocal imitation using a Semi-Siamese Convolutional Network (SCN) described in the paper "Yichi Zhang and Zhiyao Duan, IMINET: convolutional semi-siamese networks for sound search by vocal imitation, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp: 304-308.". For a vocal imitation spectrogram comes in, it compares with the spectrogram of each sound candidate in the dataset. Highest similarity imitation-recording pairs are chosen and returned to the user. SYMM-IMINET_WASPAA2017_Code.rar
This code performs Multi-pitch Estimation (MPE) and Multi-pitch Streaming (MPS) on polyphonic music or multi-talker speech. For a piece of polyphonic audio composed of monophonic harmonic sound sources, this program first estimates pitches in each time frame, then it streams these pitch estimates across time into pitch trajectories (streams), each of which corresponds to a sound source. mpe_mps.zip
The MPE and MPS code is also available separately. mpe.zip, mps.zip
This toolbox is for evaluating multi-pitch analysis results. It compares the estimated pitch content with the ground-truth pitch content and outputs some error measures. Help each file to see the details of their measurement. mpa_eval.zip
This code implements the Soundprism online score-informed source separation system. soundprism.zip