Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
OBTAINING A SINGING VOICE DETECTION MODEL
Document Type and Number:
WIPO Patent Application WO/2021/021305
Kind Code:
A1
Abstract:
The present disclosure provides methods and apparatuses for obtaining a singing voice detection model. A plurality of speech clips and a plurality of instrumental music clips may be synthesized into a plurality of audio clips. A speech detection model may be trained with the plurality of audio clips. At least a part of the speech detection model may be transferred to a singing voice detection model. The singing voice detection model may be trained with a set of polyphonic music clips.

Inventors:
HOU YUANBO (US)
LUAN JIAN (US)
SOONG KAO-PING (US)
Application Number:
PCT/US2020/036869
Publication Date:
February 04, 2021
Filing Date:
June 10, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MICROSOFT TECHNOLOGY LICENSING LLC (US)
International Classes:
G10H1/00; G06N3/02; G10L25/30; G10L25/51
Other References:
SWAMINATHAN RUPAK VIGNESH ET AL: "Improving Singing Voice Separation Using Attribute-Aware Deep Network", 2019 INTERNATIONAL WORKSHOP ON MULTILAYER MUSIC REPRESENTATION AND PROCESSING (MMRP), IEEE, 23 January 2019 (2019-01-23), pages 60 - 65, XP033529291, DOI: 10.1109/MMRP.2019.8665379
STOLLER DANIEL ET AL: "Jointly Detecting and Separating Singing Voice: A Multi-Task Approach", 6 June 2018, ANNUAL INTERNATIONAL CONFERENCE ON THE THEORY AND APPLICATIONS OF CRYPTOGRAPHIC TECHNIQUES, EUROCRYPT 2018; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER, BERLIN, HEIDELBERG, PAGE(S) 329 - 339, ISBN: 978-3-642-17318-9, XP047474369
ALE KORETZKY: "Audio AI: isolating vocals from stereo music using Convolutional Neural Networks | by Ale Koretzky | Towards Data Science", 4 February 2019 (2019-02-04), XP055728359, Retrieved from the Internet [retrieved on 20200907]
MAVADDATI SAMIRA ED - KHATEB FABIAN ET AL: "A Novel Singing Voice Separation Method Based on a Learnable Decomposition Technique", CIRCUITS, SYSTEMS AND SIGNAL PROCESSING, CAMBRIDGE, MS, US, vol. 39, no. 7, 8 January 2020 (2020-01-08), pages 3652 - 3681, XP037127830, ISSN: 0278-081X, [retrieved on 20200108], DOI: 10.1007/S00034-019-01338-0
TAKAHASHI NAOYA ET AL: "Improving Voice Separation by Incorporating End-To-End Speech Recognition", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 4 May 2020 (2020-05-04), pages 41 - 45, XP033793497, DOI: 10.1109/ICASSP40776.2020.9053845
WEI TSUNG LU ET AL: "Vocal Melody Extraction with Semantic Segmentation and Audio-symbolic Domain Transfer Learning", ISMIR 2018, 26 September 2018 (2018-09-26), XP055728084, DOI: 10.5281/zenodo.1492466
YIN-JYUN LUO ET AL: "Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders", ISMIR 2018, 26 September 2018 (2018-09-26), XP055728801, DOI: 10.5281/zenodo.1492501
ARORA PRERNA ET AL: "A study on transfer learning for acoustic event detection in a real life scenario", 2017 IEEE 19TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), IEEE, 16 October 2017 (2017-10-16), pages 1 - 6, XP033271590, DOI: 10.1109/MMSP.2017.8122258
SHINGCHERND YOU ET AL: "Comparative study of singing voice detection based on deep neural networks and ensemble learning", HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, BIOMED CENTRAL LTD, LONDON, UK, vol. 8, no. 1, 26 November 2018 (2018-11-26), pages 1 - 18, XP021263029, DOI: 10.1186/S13673-018-0158-1
PO-SEN HUANG ET AL: "Singing-Voice Separation From Monaural Recordings Using Deep Recurrent Neural Networks", ISMIR 2014, 31 October 2014 (2014-10-31), XP055729050, DOI: 10.5281/zenodo.1415678
Attorney, Agent or Firm:
SWAIN, Cassandra T. et al. (US)
Download PDF:
Claims:
CLAIMS

1. A method for obtaining a singing voice detection model, comprising:

synthesizing a plurality of speech clips and a plurality of instrumental music clips into a plurality of audio clips;

training a speech detection model with the plurality of audio clips;

transferring at least a part of the speech detection model to a singing voice detection model; and

training the singing voice detection model with a set of polyphonic music clips.

2. The method of claim 1, wherein the speech detection model performs a source task for detecting speech in an audio clip.

3. The method of claim 2, wherein each of the plurality of audio clips comprises a plurality of frame-level labels indicating whether ther e exists speech.

4. The method of claim 1, wherein the speech detection model is based on a convolutional neural network (CNN) comprising one or more convolutional layers.

5. The method of claim 4, wherein the transferring comprises: transferring at least one convolutional layer in the one or more convolutional layers to the singing voice detection model.

6. The method of claim 5, wherein the at least one convolutional layer locates at a bottom level of the one or more convolutional layers.

7. The method of claim 1, wherein the singing voice detection model performs a target task for detecting singing voice in a polyphonic music clip.

8. The method of claim 7, wherein each of the set of polyphonic music clips comprises a plurality of frame-level labels indicating whether there exists singing voice.

9. The method of claim 1, wherein the singing voice detection model performs a target task for detecting singing voice, accompaniment and silence in a polyphonic music clip.

10. The method of claim 9, wherein each of the set of polyphonic music clips comprises a plurality of frame-level labels indicating whether there exists singing voice, accompaniment and/or silence.

11. The method of claim 1, wherein the singing voice detection model is based on a convolutional recurrent neural network (CRNN), the CRNN comprising a convolutional neural network (CNN) and a recurrent neural network (RNN).

12. The method of claim 11, wherein the CNN comprises at least one convolutional layer transferred from the speech detection model.

13. The method of claim 12, wherein the training the singing voice detection model comprises: fixing parameters of the at least one convolutional layer; or adapting parameters of the at least one convolutional layer with the set of polyphonic music clips.

14. An apparatus for obtaining a singing voice detection model, comprising:

an audio clip synthesizing module, for synthesizing a plurality of speech clips and a plurality of instrumental music clips into a plurality of audio clips;

a speech detection model training module, for training a speech detection model with the plurality of audio clips;

a transferring module, for transferring at least a part of the speech detection model to a singing voice detection model; and

a singing voice detection model training module, for training the singing voice detection model with a set of polyphonic music clips.

15. An apparatus for obtaining a singing voice detection model, comprising:

at least one processor; and

a memory storing computer-executable instructions that, when executed, cause the at least one processor to:

synthesize a plurality of speech clips and a plurality of instrumental music clips into a plurality of audio clips,

train a speech detection model with the plurality of audio clips,

transfer at least a part of the speech detection model to a singing voice detection model, and

train the singing voice detection model with a set of polyphonic music clips.

Description:
OBTAINING A SINGING VOICE DETECTION MODEL

BACKGROUND

[0001] Singing voice detection techniques may be used for determining endpoints of singing voice in music clips, e.g., determining singing voice regions and non-singing voice regions in polyphonic music clips, etc. Herein, a polyphonic music clip may refer to an audio clip containing singing voices and accompaniments that are mixed together. Successful detection of singing voice regions in polyphonic music clips is critical to Music Information Retrieval (MIR) tasks. Typical MIR tasks may comprise, e.g., music summarization, music retrieval, music annotation, music genre classification, singing voice separation, etc.

SUMMARY

[0002] This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

[0003] Embodiments of the present disclosure propose methods and apparatuses for obtaining a singing voice detection model. A plurality of speech clips and a plurality of instrumental music clips may be synthesized into a plurality of audio clips. A speech detection model may be trained with the plurality of audio clips. At least a part of the speech detection model may be transferred to a singing voice detection model. The singing voice detection model may be trained with a set of polyphonic music clips.

[0004] It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

[0006] FIG.1 illustrates an exemplary application of singing voice detection according to an embodiment.

[0007] FIG.2 illustrates an exemplary application of singing voice detection according to an embodiment.

[0008] FIG.3 illustrates an exemplary process for obtaining a singing voice detection model based on transfer learning according to an embodiment.

[0009] FIG.4 illustrates an exemplary implementation of a speech detection model according to an embodiment.

[0010] FIG.5 illustrates an exemplary implementation of a singing voice detection model according to an embodiment.

[0011] FIG.6 illustrates a flowchart of an exemplary method for obtaining a singing voice detection model according to an embodiment.

[0012] FIG.7 illustrates an exemplary apparatus for obtaining a singing voice detection model according to an embodiment.

[0013] FIG.8 illustrates an exemplary apparatus for obtaining a singing voice detection model according to an embodiment.

DETAILED DESCRIPTION

[0014] The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

[0015] Recently, deep learning techniques have been applied for singing voice detection. Deep Neural Networks may be used for estimating an Ideal Binary Spectrogram Mask that represents spectrogram bins in which singing voices are more prominent than accompaniments. A temporal and timbre feature-based model may be established based on a Convolutional Neural Network (CNN), for boosting the performance in MIR. Recurrent Neural Networks (RNN) may be employed to predict soft masks that are multiplied with the original signal to obtain a desired isolated region. The training of the above systems requires a large scale of accurately labeled polyphonic music clip dataset, in which end points of singing voices, accompaniments, etc. are annotated in a frame level. However, such large scale of labeled dataset is usually not available, and manual labeling is time- consuming and expensive. Therefore, only a small scale of labeled polyphonic music clip dataset may be practically used for training these systems.

[0016] To overcome the problem of insufficient training data, transfer learning has been proposed to extract knowledge learned from a source task and apply the knowledge to a similar but different target task. The transfer learning may alleviate the problem of insufficient training data for the target task and tend to generalize a model. It has been tried to apply the transfer learning for singing voice detection. For example, a CNN for music annotation may be trained based on a dataset containing different genres of songs, and then be transferred to other music-related classification and regression tasks, e.g., singing voice detection. However, such transfer learning-based singing voice detection may only transfer singing voice knowledge among different genres of songs.

[0017] Embodiments of the present disclosure propose knowledge transfer from speech to singing voice. For example, a speech detection model for a source task of speech detection may be trained firstly, and then a part of the speech detection model may be transferred to a singing voice detection model for a target task of singing voice detection, and the singing voice detection model may be further trained with a small amount of labeled polyphonic music clips. Although there are differences between speaking and singing, and acoustic characteristics may also vary with the change of accompaniments, there are still useful similarities between speech and singing voice that can be exploited. Transferring of latent representations learned from speech clips may improve the performance of singing voice detection. The learned latent representations will retain relevant information of the source task of speech detection and transfer the information to the target task of singing voice detection. Moreover, sharing of knowledge between speeches in the source task and singing voices in the target task may enable the singing voice detection model to understand human voices, including speech, singing voice, etc., in a more general and robust approach.

[0018] Clean speech clips and instrumental music clips are both widely and extensively available, e.g., on the network, and it is easy to detect endpoints of speech clips through various existing techniques and to further provide frame-level speech labels. Herein, a speech clip may comprise only voices of human speaking, and an instrumental music clip may comprise only sounds of instruments being played. Speech clips and instrumental music clips may be synthesized together to form a large scale of audio clip training dataset for training the speech detection model. Considering the possible different phonations and vibration levels of vocal cords between speaking and singing, after transferring a part of the trained speech detection model to the singing voice detection model, the singing voice detection model may be further trained or optimized with a polyphonic music clip training dataset containing a small amount of labeled polyphonic music clips. Benefiting from the knowledge transferred from the speech detection, although only a small amount of labeled polyphonic music clips are used, the obtained singing voice detection model will still have higher accuracy than conventional singing voice detection models.

[0019] In an aspect, the speech detection model may employ, e.g., a CNN to perform the source task of distinguishing between speech and non-speech in an audio clip. The singing voice detection model may employ, e.g., a convolutional recurrent neural network (CRNN) to perform the target task of singing voice detection in a polyphonic music clip. When performing the transfer, at least a part of the CNN, e.g., at least some convolutional layers, in the speech detection model may be transferred to the CRNN of the singing voice detection model. Different knowledge transfer modes may be employed. In one mode, when the singing voice detection model is trained with the polyphonic music clip training dataset, the part which is included in the singing voice detection model and transferred from the speech detection model may retain the original parameters. In another mode, the parameters of the part which is included in the singing voice detection model and transferred from the speech detection model may be adapted or refined with the polyphonic music clip training dataset.

[0020] The embodiments of the present disclosure overcome the problem of insufficient training data for training a singing voice detection model, make the obtained singing voice detection model contain voice knowledge in both speech and singing voice, and enable feature extraction to represent voices more efficiently. The proposed transfer learning approach may enable the feature extraction trained in the source task to be efficiently adapted to the target task, and different knowledge transfer modes may be employed.

[0021] The singing voice detection model obtained according to the embodiments of the present disclosure may be applied for various scenarios. In one scenario, the singing voice detection model may be applied in an intelligent singing assistance system with a function of automatically helping singing. When a singer is singing, if the system detects, through comparing with the original song, that the singer’s singing stops due to forgetting lyrics or other reasons, the system may prompt the lyrics in real time or automatically play voices of the next sentence in the original song. In one scenario, the singing voice detection model may be applied for a pre-processing for the separation of singing voices from accompaniments. For example, as the pre-processing for the separation of singing voices from accompaniments, the singing voice detection model may detect at least regions that need not to be separated in a polyphonic music clip, e.g., singing-voice-only regions or accompaniment-only regions, thereby reducing the amount of processing in the separation of singing voices from accompaniments and improving the efficiency of the separation. In one scenario, the singing voice detection model may be applied for music structure decomposition. For example, singing voice parts, accompaniment parts, silence or mute parts, etc. in a target music may be identified with at least the singing voice detection model. In one scenario, the singing voice detection model may be applied for a pre-processing for music recommendation, song library management, etc. For example, the singing voice detection model may be used for segmenting music or songs in a music library or song library in advance to extract a series of regions containing singing voice. These extracted singing voice regions will facilitate to efficiently retrieve corresponding music or songs in the music recommendation, the song library management, etc.

[0022] FIG. l illustrates an exemplary application 100 of singing voice detection according to an embodiment. A singing voice detection model obtained according to an embodiment of the present disclosure may be used for detecting singing voice regions and non-singing voice regions in a polyphonic music clip. A singing voice region may refer to a region including a singing voice of a singer in a polyphonic music clip, and a non singing voice region may refer to a region not including a singing voice of a singer in a polyphonic music clip. Each singing voice region may be defined by corresponding singing voice endpoints, e.g., defined by a singing voice start timepoint and a singing voice end timepoint. Each non-singing voice region may be defined by corresponding non singing voice endpoints, e.g., defined by a non-singing voice start timepoint and a non singing voice end timepoint. In an implementation, the singing voice detection model may perform singing voice detection based on spectrograms.

[0023] As shown in FIG. l, a waveform of the polyphonic music clip to be detected may be firstly converted into a spectrogram. The spectrogram may be further provided to the singing voice detection model as input. The singing voice detection model may generate a detection result by processing the spectrogram, wherein the detection result identifies singing-voice regions and non-singing voice regions in the polyphonic music clip. In an implementation, the singing voice detection model may achieve binary classification of frames in the polyphonic music clip, e.g., classifying each frame as singing voice or non-singing voice. After classifying the frames, adjacent frames having the same category may be collectively identified as a singing voice region or a non-singing voice region, thereby forming the final detection result. For example, the detection result may comprise: identifying a region from time ti to time ti as a non-singing voice region; identifying a region from time h to time t3 as a singing voice region; identifying a region from time t3 to time U as a non-singing voice region; and identifying a region from time U to time Ϊ5 as a singing voice region, etc.

[0024] FIG.2 illustrates an exemplary application 200 of singing voice detection according to an embodiment. A singing voice detection model obtained according to an embodiment of the present disclosure may be used for detecting singing voice regions, accompaniment regions and silence regions in a polyphonic music clip. A singing voice region may refer to a region including a singing voice of a singer in a polyphonic music clip, an accompaniment region may refer to a region including sounds of instruments being played in a polyphonic music clip, and a silence region may refer to a region not including any sounds in a polyphonic music clip. Each singing voice region may be defined by corresponding singing voice endpoints, e.g., defined by a singing voice start timepoint and a singing voice end timepoint. Each accompaniment region may be defined by corresponding accompaniment endpoints, e.g., defined by an accompaniment start timepoint and an accompaniment end timepoint. Each silence region may be defined by corresponding silence endpoints, e.g., defined by a silence start timepoint and a silence end timepoint. In an implementation, the singing voice detection model may perform singing voice detection based on spectrograms.

[0025] As shown in FIG.2, a waveform of the polyphonic music clip to be detected may be firstly converted into a spectrogram. The spectrogram may be further provided to the singing voice detection model as an input feature. The singing voice detection model may generate a detection result by processing the spectrogram, wherein the detection result identifies singing voice regions, accompaniment regions and silence regions in the polyphonic music clip. In an implementation, the singing voice detection model may achieve triple classification of frames in the polyphonic music clip, e.g., classifying each frame as at least one of singing voice, accompaniment and silence. It should be appreciated that each frame may have one or more categories, e.g., if the current frame corresponds to a singer’s singing with accompaniment, this frame may have two categories of singing voice and accompaniment. After classifying the frames, adjacent frames having the same category may be collectively identified as a singing voice region, an accompaniment region or a silence region, thereby forming the final detection result. For example, the detection result may comprise: identifying a region from time ti to time t3 as an accompaniment region; identifying a region from time t2 to time U as a singing voice region; identifying a region from time U to time ts as a silence region; identifying a region from time ts to time ti as an accompaniment region; and identifying a region from time t6 to time Xi as a singing voice region, etc. Moreover, as shown in the figure, there may be overlapping parts between different types of regions, e.g., the accompaniment region from time t2 to time t3 overlapping with the singing voice region from time t2 to time t 3 , which indicates that the polyphonic music clip comprises both singing voice and accompaniment between time t2 and time t 3.

[0026] It should be appreciated that although exemplary applications included in the singing voice detection tasks according to the embodiments are discussed above in conjunction with FIG. l and FIG.2, the singing voice detection tasks involved in the present disclosure are not limited to these exemplary applications, but may also cover any applications that aim to detect singing voice regions and one or more types of other annotated regions in a polyphonic music clip.

[0027] FIG.3 illustrates an exemplary process 300 for obtaining a singing voice detection model based on transfer learning according to an embodiment. According to an embodiment of the present disclosure, transfer learning is used for extracting voice knowledge from a source task of speech detection, and applying the extracted voice knowledge to a target task of singing voice detection to perform singing voice detection. By utilizing the transfer learning, the problem that training data for the target task of singing voice detection is insufficient to train a good singing voice detection model may be overcome. In an implementation, in the source task, a CNN in a speech detection model may be trained for detecting speech regions in a synthesized audio clip. Voice knowledge learned from a large scale of audio clip training dataset in the source task may be transferred to the target task. Then, a small scale of polyphonic music clip training dataset containing a small amount of labeled polyphonic music clips collected in the target task may be used for further training or optimizing a CRNN in the singing voice detection model, so as to perform singing voice detection in a polyphonic music clip.

[0028] A large amount of speech clips 302 and instrumental music clips 304 may be obtained respectively. The speech clips 302 may be collected on the network or obtained from any content sources, which may be any types of speech recording containing only voices of human speaking, e.g., speech recording, news broadcast recording, storytelling recording, etc. The instrumental music clips 304 may be collected on the network or obtained from any content sources, which may be any types of instrument sound recording containing only sounds of instruments being played, e.g., pure music, etc. Moreover, the instrumental music clips 304 may also broadly comprise any non-speech sound recordings, e.g., recording of sounds existing in the nature, recording of artificially simulated sounds, etc.

[0029] The speech clips 302 and the instrumental music clips 304 may be synthesized into a plurality of audio clips 306. For example, one or more speech clips and one or more instrumental music clips may be provided to a plurality of different audio tracks according to a specific timing, so as to synthesize an audio clip.

[0030] A large scale of audio clip training dataset 308 for training the speech detection model may be formed based on the synthesized audio clips 306. Each audio clip in the audio clip training dataset 308 may comprise a plurality of frame-level labels indicating whether there exists speech. In an implementation, speech regions, in which there exists speech, in the speech clips may be determined firstly. Each speech region is identified by a pair of speech endpoints including, e.g., a speech start timepoint and a speech end timepoint. Then, frame-level speech labels are added to frames in the speech clips based on the determined speech regions. For example, a label indicating the existence of speech is added to frames located in the speech regions, and a label indicating the absence of speech is added to frames not located in any speech region. Accordingly, the audio clips synthesized with the labeled speech clips also have a plurality of frame-level labels indicating the existence or absence of speech.

[0031] The audio clip training dataset 308 containing a large amount of labeled synthesized audio clips may be used for training a speech detection model 310. The speech detection model 310 may perform a source task for detecting speech in an audio clip. For example, the speech detection model 310 may classify each frame in an audio clip as speech or not, and may further determine speech regions and non-speech regions in the audio clip. In an implementation, the speech detection model 310 may be based on a CNN comprising one or more convolutional layers. The CNN may be trained for recognizing speech regions in an audio clip.

[0032] After the speech detection model 310 is trained, a singing voice detection model 320 may be constructed. The singing voice detection model 320 may perform a target task of singing voice detection. For example, in an implementation, the singing voice detection model 320 may perform a target task for detecting singing voice in a polyphonic music clip. The singing voice detection model 320 may classify each frame in a polyphonic music clip as singing voice or not, and may further determine singing voice regions and non-singing voice regions in the polyphonic music clip. For example, in another implementation, the singing voice detection model 320 may perform a target task for detecting singing voice, accompaniment and silence in a polyphonic music clip. The speech detection model 320 may classify each frame in a polyphonic music clip as singing voice, accompaniment and/or silence, and may further determine singing voice regions, accompaniment regions and silence regions in the polyphonic music clip.

[0033] The singing voice detection model 320 may be based on CRNN. The CRNN may comprise, e.g., CNN 322 and RNN 324. According to the process 300, when constructing the singing voice detection model 320, at least a part of the CNN 312 in the speech detection model 310 may be transferred to the CNN 322 in the singing voice detection model 320. In one case, the entire CNN 312, e.g., all the convolutional layers, may be transferred to the singing voice detection module 320 as the CNN 322. In another case, only a part of the CNN 312, e.g., one or more convolutional layers, may be transferred to the CNN 322 as a part of the CNN 322.

[0034] After the singing voice detection model 320 is constructed, the singing voice detection model 320 may be further trained or optimized. A set of polyphonic music clips 326 may be obtained, and the set of polyphonic music clips 326 may be used for forming a polyphonic music clip training dataset 328 for training or optimizing the singing voice detection model 320. The polyphonic music clip training dataset 328 may comprise only a small amount of labeled polyphonic music clips. According to different target tasks of singing voice detection performed by the singing voice detection model 320, the polyphonic music clips 326 may have corresponding frame-level labels. If the singing voice detection model 320 performs a target task for detecting singing voice in a polyphonic music clip, each polyphonic music clip in the polyphonic music clip training dataset 328 may comprise a plurality of frame-level labels indicating whether there exists singing voice. For example, a label indicating the existence of singing voice is added to frames located in singing voice regions in a polyphonic music clip, and a label indicating the absence of singing voice is added to frames not located in any singing voice region. If the singing voice detection model 320 performs a target task for detecting singing voice, accompaniment and silence in a polyphonic music clip, each polyphonic music clip in the polyphonic music clip training dataset 328 may comprise a plurality of frame-level labels indicating whether there exists singing voice, accompaniment and/or silence. For example, a label indicating the existence of singing voice is added to frames located in singing voice regions in a polyphonic music clip, a label indicating the existence of accompaniment is added to frames located in accompaniment regions, and a label indicating the existence of silence is added to frames located in silence regions. The polyphonic music clip training dataset 328 containing labeled polyphonic music clips may be used for training or optimizing the singing voice detection model 320. Through the transfer process described above, the singing voice detection model 320 may obtain the knowledge about speech learned in the source task, and through utilizing the polyphonic music clip training dataset 328 for further training or optimization, the singing voice detection model 320 may be better adapted to the dataset involving singing voice in the target task, thereby mitigating the mismatch problem that a detection model trained with synthesized audio clips cannot match data in a target task well.

[0035] The singing voice detection model 320 obtained through the process 300 may be used for performing a singing voice detection task on input polyphonic music clips with high accuracy.

[0036] FIG.4 illustrates an exemplary implementation of a speech detection model according to an embodiment. The speech detection model 420 shown in FIG.4 may correspond to the speech detection model 310 in FIG.3.

[0037] Input 410 of the speech detection model 420 may be an audio clip. In an implementation, a waveform of the audio clip may be converted into a spectrogram, and the spectrogram is used as the input 410. During the training process, the audio clip may be an audio clip synthesized with utterance clips and instrumental music clips. The spectrogram converted from the waveform of the audio clip may be a Mel spectrogram, e.g., log Mel spectrogram, etc., which is a 2D representation that is used for approximating human auditory perception and has high computational efficiency. As an example, the following discussion takes an audio clip representation in the form of log Mel spectrogram as an input feature of the speech detection model 420.

[0038] In an implementation, the speech detection model 420 may be based on CNN. For example, the speech detection model 420 may comprise a CNN 430. The CNN 430 may comprise one or more convolutional layers stacked in sequence, e.g., convolutional layer 432, convolutional layer 436, convolutional layer 440, etc. Moreover, optionally, each convolutional layer may be further attached with a corresponding pooling layer, e.g., pooling layer 434, pooling layer 438, pooling layer 442, etc. These pooling layers may be, e.g., max-pooling layers. It should be appreciated that the structure of the CNN 430 shown in FIG.4 is only exemplary, and depending on specific application requirements or design constraints, the CNN 430 may also have any other structures, e.g., comprising more or less convolutional layers, omitting pooling layers, adding layers for other processes, etc.

[0039] In an implementation, in order to comprehensively understand the contextual information of an audio clip, the input of the CNN 430 may adopt a moving data block. The moving data block may comprise the current frame, the preceding L frames of the current frame, and the succeeding L frames of the current frame. The shift between consecutive blocks may be, e.g., one frame. Each moving data block may contain 2L + 1 frames. The value of L determines the range of context visible at each frame, which may be set empirically.

[0040] The convolutional layers in the CNN 430 may be used for extracting spatial location information. For example, the convolutional layers may learn local shift-invariant patterns from the input log Mel spectrogram feature. Optionally, to preserve the time resolution of the input, pooling may be further applied to the frequency axis only. A convolutional layer may be represented by (filters, (receptive field in time, receptive field in frequency)), e.g., (64, (3, 3)). A pooling layer may be represented by (pooling length in time, pooling length in frequency), e.g., (1, 4). In all convolutional layers, batch normalization may be used to accelerate training convergence. In an implementation, to reduce the gradient vanishing problem in deep networks training, gated linear units (GLUs) may be used in the convolutional layers. The GLUs provide a linear path for gradient propagation while retaining non-linear capabilities through, e.g., a sigmoid operation. Given W and V as convolutional filters, b and c as biases, X as the input features or the feature maps of the interval layers and s as sigmoid function, GLU may be defined as:

Y = (W * X + b)Qa(V * X + c) Equation (1) where the symbol Q is the element-wise product and * is the convolution operator. It should be appreciated that another benefit of using GLUs is that GLUs, by weighting time- frequency units separately according to their unique time positions, may help the network concentrate on speech and ignore unrelated instrumental music, etc.

[0041] The speech detection model 420 may further comprise an output layer 444. The output layer 444 may comprise two output units having, e.g., softmax, which may indicate whether the current input corresponds to speech. It should be appreciated that although not shown in FIG.4, a Relu-based full connection layer may optionally be included between the pooling layer 442 and the output layer 444.

[0042] The speech detection model 420 may classify frames in an audio clip as speech or non-speech, and these classification results may form the final speech detection result 450. In one case, the speech detection result 450 may be represented as frame-level speech or non-speech labels for frames in the audio clip. In one case, the speech detection result 450 may be an integration of frame-level speech or non-speech labels, and is represented as speech regions and non-speech regions as identified in the audio clip.

[0043] FIG.5 illustrates an exemplary implementation of a singing voice detection model according to an embodiment. The singing voice detection model 520 shown in FIG.5 may correspond to the singing voice detection model 320 in FIG.3.

[0044] Input 510 of the singing voice detection model 520 may be a polyphonic music clip. In an implementation, a waveform of the polyphonic music clip may be converted into a spectrogram, and the spectrogram is used as the input 510. The spectrogram converted from the waveform of the polyphonic music clip may be a Mel spectrogram, e.g., log Mel spectrogram. As an example, the following discussion takes a polyphonic music clip representation in the form of log Mel spectrogram as an input feature of the singing voice detection model 520.

[0045] In an implementation, the singing voice detection model 520 may be based on CRNN. For example, the singing voice detection model 520 may comprise a CNN 530. The CNN 530 may comprise one or more convolutional layers stacked in sequence, e.g., convolutional layer 532, convolutional layer 536, convolutional layer 540, etc. The convolutional layers in the CNN 530 may be used for extracting spatial location information. Moreover, optionally, each convolutional layer may be further attached with a corresponding pooling layer, e.g., pooling layer 534, pooling layer 538, pooling layer 542, etc. These pooling layers may be, e.g., max-pooling layers. It should be appreciated that the structure of the CNN 530 shown in FIG.5 is only exemplary, and depending on specific application requirements or design constraints, the CNN 530 may also have any other structures, e.g., comprising more or less convolutional layers, omitting pooling layers, adding layers for other processes, etc. In an implementation, in order to comprehensively understand the contextual information of a polyphonic music clip, similar to the above discussion in conjunction with FIG.4, the input of the CNN 530 may also adopt a moving data block. The moving data block may comprise the current frame, the preceding L frames of the current frame, and the succeeding L frames of the current frame. The shift between consecutive blocks may be, e.g., one frame. Each moving data block may contain 2L + 1 frames. The value of L determines the range of context visible at each frame, which may be set empirically.

[0046] The singing voice detection model 520 may further comprise a RNN 550. The RNN 550 may learn timing information and capture long-term temporal contextual information. The RNN 550 may utilize recurrent neurons, e.g., simple RNN, gated recurrent unit (GRU), long short-term memory (LSTM) network, etc., for learning the timing information. A recurrent neuron in the RNN 550 may have a feedback loop for feeding the learned information back to its own neuron in order to record historical information. Therefore, at the next instant, the current information and the existing historical information may be combined to jointly make a decision. In an implementation, in order to jointly make a decision in combination with contextual information, the RNN 550 may also be based on a bidirectional recurrent neural network. In each recurrent neuron in the bidirectional recurrent neural network, the information flow propagates not only from front to back, but also from back to front, so that the recurrent neuron may know past information and future information within a certain time range, thereby making better decisions.

[0047] The singing voice detection model 520 may further comprise an output layer 552. The output layer 552 may generate a classification result for the current input. Depending on different specific singing voice detection tasks, the classification result may be singing voice or non-singing voice, or may be singing voice, accompaniment, or silence.

[0048] The classification result generated by the singing voice detection model 520 may form a final singing voice detection result 560. In one case, the singing voice detection result 560 may be represented as frame-level classification labels for frames in the polyphonic music clip, e.g., singing voice or non-singing voice, or, e.g., singing voice, accompaniment or silence. In one case, the singing voice detection result 560 may be an integration of frame-level classification results, and is represented as singing voice regions and non-singing voice regions, or singing voice regions, accompaniment regions and silent regions, as identified in the polyphonic music clip.

[0049] As described above, the CNN 530 in the singing voice detection model 520 may be constructed by the transferring from the CNN 430 of the speech detection model 420. For example, at least one of the convolutional layer 532, the convolutional layer 536, and the convolutional layer 540 in the CNN 530 may be from corresponding convolutional layers in the CNN 430. The CNN 530 may have various construction approaches. In one construction approach, all the convolutional layers in the CNN 430 may be transferred to the CNN 530, and accordingly, the convolutional layer 532, the convolutional layer 536, and the convolutional layer 540 may correspond to the convolutional layer 432, the convolutional layer 436 and the convolutional layer 440 respectively. In another construction approach, a part of the convolutional layers in the CNN 430 may be transferred to the CNN 530. For example, only the convolutional layer 432 is transferred to the CNN 530 as the convolutional layer 532, or only the convolutional layer 432 and the convolutional layer 436 are transferred to the CNN 530 as the convolutional layer 532 and the convolutional layer 536. In this case, preferably, one or more convolutional layers located at a bottom level in the CNN 430 may be transferred to the CNN 530 as the corresponding convolutional layers at a bottom level in the CNN 530, wherein the convolutional layers at the bottom level may refer to those convolutional layers closer to the input 410 or 510. The bottom-level convolutional layers may contain more generic features that are useful for both the source task and the target task. The bottom-level convolutional layers learn the basic and local features of sound, while high-level convolutional layers may become more irrelevant in learning some high-level representations and knowledge. The singing voice in the target task is more complicated than the speech in the source task, because the singing voice will change with the accompaniment. Therefore, high-level representations of sound learned by the high-level convolutional layers from speech in the CNN 430 may not match the target task, which results in that transferring of this knowledge is less helpful for the target task. Therefore, to transfer one or more convolutional layers locating at the bottom level in the CNN 430 to the CNN 530, rather than to transfer convolutional layers locating at the high level in the CNN 430 to the CNN 530, may help to further improve the performance of the CNN 530.

[0050] The above transfer from the CNN 430 to the CNN 530 may employ various knowledge transfer modes. In a transfer mode that may be called a fixed mode, knowledge from the source task may be applied directly to the target task. For example, parameters learned by the convolutional layers in the CNN 430 are directly transferred to the CNN 530, and these parameters are fixed or retained in subsequent training of the singing voice detection model 520. Specifically, assuming that the convolutional layer 432 in the CNN 430 is transferred to the CNN 530 as the convolutional layer 532, the convolutional layer 532 will fix those parameters previously learned by the convolutional layer 432, and will not change these parameters in the subsequent training process. In another transfer mode that may be called a fine-tuning mode, the CNN 530 considers new knowledge learned from the target task domain, in addition to the knowledge from the source task. For example, parameters learned by the convolutional layers in the CNN 430 are firstly transferred to the CNN 530 as initial values of the corresponding convolutional layers, and then during the training of the singing voice detection model 520 with a polyphonic music clip training dataset, the transferred parameters are adapted or fine-tuned continuously, so that new knowledge in the target task of singing voice detection may be learned and knowledge from both the source task and the target task may be integrated, thus obtaining a more generic and more robust model.

[0051] The knowledge transfer modes and the various construction approaches of the CNN 530 described above may be arbitrarily combined. For example, after transferring one or more convolutional layers locating at the bottom layer of the CNN 430 to the CNN 530, the fine-tuning mode may be employed to adapt or fine-tune the parameters of the transferred convolutional layers.

[0052] It should be appreciated that the CNN 530 may have a similar structure with the CNN 430. For those convolutional layers in the CNN 530 that are not transferred from the CNN 430, they may be trained in the process of training the singing voice detection model with polyphonic music clips. Moreover, optionally, the pooling layers in the CNN 530 may be transferred from the CNN 430 along with the corresponding convolutional layers, or may be reconstructed.

[0053] FIG.6 illustrates a flowchart of an exemplary method 600 for obtaining a singing voice detection model according to an embodiment.

[0054] At 610, a plurality of speech clips and a plurality of instrumental music clips may be synthesized into a plurality of audio clips.

[0055] At 620, a speech detection model may be trained with the plurality of audio clips.

[0056] At 630, at least a part of the speech detection model may be transferred to a singing voice detection model.

[0057] At 640, the singing voice detection model may be trained with a set of polyphonic music clips.

[0058] In an implementation, the speech detection model may perform a source task for detecting speech in an audio clip. Each of the plurality of audio clips may comprise a plurality of frame-level labels indicating whether there exists speech.

[0059] In an implementation, the speech detection model is based on a CNN comprising one or more convolutional layers. The transferring may comprise: transferring at least one convolutional layer in the one or more convolutional layers to the singing voice detection model. The at least one convolutional layer may locate at a bottom level of the one or more convolutional layers. Each of the one or more convolutional layers may connect to a corresponding pooling layer.

[0060] In an implementation, the singing voice detection model may perform a target task for detecting singing voice in a polyphonic music clip. Each of the set of polyphonic music clips may comprise a plurality of frame-level labels indicating whether there exists singing voice.

[0061] In an implementation, the singing voice detection model may perform a target task for detecting singing voice, accompaniment and silence in a polyphonic music clip. Each of the set of polyphonic music clips may comprise a plurality of frame-level labels indicating whether there exists singing voice, accompaniment and/or silence.

[0062] In an implementation, the singing voice detection model may be based on a CRNN, the CRNN comprising a CNN and an RNN. The CNN may comprise at least one convolutional layer transferred from the speech detection model. The training the singing voice detection model may comprise: fixing parameters of the at least one convolutional layer. Optionally, the training the singing voice detection model may comprise: adapting parameters of the at least one convolutional layer with the set of polyphonic music clips.

[0063] In an implementation, inputs to the speech detection model and the singing voice detection model may be in a Mel spectrogram form.

[0064] It should be appreciated that the method 600 may further comprise any steps/processes for obtaining a singing voice detection model according to the above embodiments of the present disclosure.

[0065] FIG.7 illustrates an exemplary apparatus 700 for obtaining a singing voice detection model according to an embodiment.

[0066] The apparatus 700 may comprise: an audio clip synthesizing module 710, for synthesizing a plurality of speech clips and a plurality of instrumental music clips into a plurality of audio clips; a speech detection model training module 720, for training a speech detection model with the plurality of audio clips; a transferring module 730, for transferring at least a part of the speech detection model to a singing voice detection model; and a singing voice detection model training module 740, for training the singing voice detection model with a set of polyphonic music clips.

[0067] Moreover, the apparatus 700 may further comprise any other modules configured for obtaining a singing voice detection model according to the above embodiments of the present disclosure.

[0068] FIG.8 illustrates an exemplary apparatus 800 for obtaining a singing voice detection model according to an embodiment.

[0069] The apparatus 800 may comprise at least one processor 810 and a memory 820 storing computer-executable instructions. When the computer-executable instructions are executed, the processor 810 may: synthesize a plurality of speech clips and a plurality of instrumental music clips into a plurality of audio clips; train a speech detection model with the plurality of audio clips; transfer at least a part of the speech detection model to a singing voice detection model; and train the singing voice detection model with a set of polyphonic music clips. Moreover, the processor 810 may further perform any steps/processes for obtaining a singing voice detection model according to the above embodiments of the present disclosure.

[0070] The embodiments of the present disclosure may be embodied in a non- transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for obtaining a singing voice detection model according to the above embodiments of the present disclosure.

[0071] It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

[0072] It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

[0073] Processors are described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a micro-controller, a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, a micro-controller, a DSP, or other suitable platforms.

[0074] Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, and the like. Software may reside on computer readable medium. Computer readable medium may include, for example, a memory, which may be, for example, a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

[0075] The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.