Robust Singing Voice Transcription Serves Synthesis

Abstract

Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that the proposed model achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application.

SVS Results of Different Ratios of Pseudo Annotations

We train ROSVOT with M4Singer dataset and utilize which to annotate and generate the pseudo annotations of dataset $D_1$. Pseudo annotations with different ratios are mixed into $D_1$ to train the SVS model, RMSSinger. The inference is performed on the test set of $D_1$.

Models:
- GT: Ground Truth samples generated with the vocoder.
- 100%D1: the SVS model trained with 100% of the real annotations.
- 50%D1: trained with 50% of the real and 50% of the pseudo annotations.
- 10%D1: 10% real and 90% pseudo.
- 5%D1: 5% real and 95% pseudo.
- 0%D1: 100% pseudo annotations.

<breath> 如果那两个字没有颤抖 <breath> 我不会发现我难受 <breath> 怎么说出口 <silence>

GT

wav

100%D1 50%D1 10%D1 5%D1 0%D1

wav
再一次沾染你 <breath> 若生命 <breath> 如过场电影 <breath>

GT

wav

100%D1 50%D1 10%D1 5%D1 0%D1

wav
能够握紧的就别放了 <breath> 能够拥抱的就别拉扯 <breath> 时间着急地 <breath>

GT

wav

100%D1 50%D1 10%D1 5%D1 0%D1

wav
去寻找遗失了的思念 <breath> 如果你在眼前 <breath> 我会让你看见 <silence>

GT

wav

100%D1 50%D1 10%D1 5%D1 0%D1

wav

	GT
wav

	GT
wav

	GT
wav

	GT
wav

SVS Results of Expanding Datasets

We use MFA and the same ROSVOT trained with M4Singer to re-align and re-annotate OpenSinger, a multi-singer dataset designed for training vocoders which is without note annotations. We use M4Singer as the base training set and gradually involve dataset $D_1$ and OpenSinger to see the improvement of the SVS model. The inference is performed on the test set of $D_1$ and M4Singer. However, model M4 is only tested on M4Singer (sample 5 and 6) since we don’t investigate SVS model’s generalization capabilities here.

Models:
- GT: Ground Truth samples generated with the vocoder.
- M4: the SVS model trained with M4Singer, as the baseline.
- M4+100%D1: trained with M4Singer and dataset $D_1$ with 100% real annotations.
- M4+0%D1: trained with M4Singer and dataset $D_1$ with 0% real annotations (100% pseudo annotations).
- M4+0%D1+OP: A large version of RMSSinger, trained with M4Singer and dataset $D_1$ with 0% real annotations and OpenSinger.

<breath> 如果那两个字没有颤抖 <breath> 我不会发现我难受 <breath> 怎么说出口 <silence>

GT

wav

M4+100%D1 M4+0%D1 M4+0%D1+OP

wav
再一次沾染你 <breath> 若生命 <breath> 如过场电影 <breath>

GT

wav

M4+100%D1 M4+0%D1 M4+0%D1+OP

wav
能够握紧的就别放了 <breath> 能够拥抱的就别拉扯 <breath> 时间着急地 <breath>

GT

wav

M4+100%D1 M4+0%D1 M4+0%D1+OP

wav
去寻找遗失了的思念 <breath> 如果你在眼前 <breath> 我会让你看见 <silence>

GT

wav

M4+100%D1 M4+0%D1 M4+0%D1+OP

wav
我想唱一首歌给

GT

wav

M4 M4+100%D1 M4+0%D1 M4+0%D1+OP

wav
却是下落不详 <breath> 心好 <silence> 空荡 <breath>

GT

wav

M4 M4+100%D1 M4+0%D1 M4+0%D1+OP

wav

	GT
wav

	M4+100%D1	M4+0%D1	M4+0%D1+OP
wav

	GT
wav

	M4+100%D1	M4+0%D1	M4+0%D1+OP
wav

	GT
wav

	M4+100%D1	M4+0%D1	M4+0%D1+OP
wav

	GT
wav

	M4+100%D1	M4+0%D1	M4+0%D1+OP
wav

	GT
wav

	M4	M4+100%D1	M4+0%D1	M4+0%D1+OP
wav

	GT
wav

	M4	M4+100%D1	M4+0%D1	M4+0%D1+OP
wav

SVS with English Transcriptions

We use MFA and the same ROSVOT trained with M4Singer to re-align and re-annotate a small English singing dataset, $D_2$. We finetune the instance M4+0%D1+OP in the previous section using the Enligh singing dataset to test the cross-lingual annotation capability of ROSVOT. The inference is performed on English transcriptions.

Models:
- GT: Ground Truth samples generated with the vocoder.
- RMSSinger: a large version, pre-trained with M4Singer + $D_1$ + OpenSinger and finetuned with $D_2$.

I wouldn’t change a thing about it

GT RMSSinger

wav
They’ve all been said before you know <breath> so why don’t we <breath> just play pretend <breath>

GT RMSSinger

wav