Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SINGING VOICE SYNTHESIS
Document Type and Number:
WIPO Patent Application WO/2021/101665
Kind Code:
A1
Abstract:
The present disclosure provides methods and apparatuses for singing voice synthesis. First music score phoneme information extracted from a music score may be received, the first music score phoneme information comprising a first phoneme, and a pitch and a beat of a note corresponding to the first phoneme. A fundamental frequency residual and spectral parameters corresponding to the first phoneme may be generated based on the first music score phoneme information. A fundamental frequency corresponding to the first phoneme may be obtained through regulating the pitch of the note with the fundamental frequency residual. An acoustic waveform corresponding to the first phoneme may be generated based at least in part on the fundamental frequency and the spectral parameters.

Inventors:
LU PEILING (US)
LUAN JIAN (US)
WU JIE (US)
Application Number:
PCT/US2020/057268
Publication Date:
May 27, 2021
Filing Date:
October 26, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MICROSOFT TECHNOLOGY LICENSING LLC (US)
International Classes:
G10H7/08; G10H1/06; G10L13/033; G10L25/30
Foreign References:
EP2276019A12011-01-19
Other References:
MERLIJN BLAAUW ET AL: "A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs", APPLIED SCIENCES, vol. 7, no. 12, 18 December 2017 (2017-12-18), pages 1313, XP055627719, DOI: 10.3390/app7121313
TAKESHI SAITOU ET AL: "Speech-to-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices", APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, 2007 IEEE WO RKSHOP ON, IEEE, PI, 1 October 2007 (2007-10-01), pages 215 - 218, XP031167096, ISBN: 978-1-4244-1618-9
KAZUHIRO NAKAMURA ET AL: "Singing voice synthesis based on convolutional neural networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 15 April 2019 (2019-04-15), XP081169221
Attorney, Agent or Firm:
SWAIN, Cassandra T. et al. (US)
Download PDF:
Claims:
CLAIMS

1. A method for singing voice synthesis, comprising: receiving a first music score phoneme information extracted from a music score, the first music score phoneme information comprising a first phoneme, and a pitch and a beat of a note corresponding to the first phoneme; generating a fundamental frequency residual and spectral parameters corresponding to the first phoneme based on the first music score phoneme information; obtaining a fundamental frequency corresponding to the first phoneme through regulating the pitch of the note with the fundamental frequency residual; and generating an acoustic waveform corresponding to the first phoneme based at least in part on the fundamental frequency and the spectral parameters.

2. The method of claim 1, wherein the generating a fundamental frequency residual and spectral parameters corresponding to the first phoneme comprises: generating a first vector representation based on the first music score phoneme information; determining, by a duration predictor, a phoneme duration of the first phoneme based on the first vector representation, the duration predictor being configured for predicting a phoneme duration under a constraint by a note beat; expanding the first vector representation to a second vector representation based on the phoneme duration of the first phoneme; and generating the fundamental frequency residual and the spectral parameters corresponding to the first phoneme based at least on the second vector representation.

3. The method of claim 2, wherein training data for the duration predictor at least comprises: a reference phoneme duration of each reference phoneme and a beat of each reference note, extracted from a reference audio.

4. The method of claim 3, wherein training of the duration predictor adopts a first loss function, the first loss function being for calculating a difference between: a phoneme duration predicted by the duration predictor for a reference phoneme; and a reference phoneme duration of the reference phoneme.

5. The method of claim 4, wherein the training of the duration predictor further adopts a second loss function, the second loss function being for calculating a difference between: a sum of a plurality of phoneme durations predicted by the duration predictor for a plurality of reference phonemes corresponding to a reference note; and a beat of the reference note.

6. The method of claim 2, further comprising: receiving an indication of a singing style, and wherein the determining a phoneme duration of the first phoneme is further based on the singing style, and the generating a fundamental frequency residual and spectral parameters corresponding to the first phoneme is further based on the singing style.

7. The method of claim 1, further comprising: receiving an indication of voice of a target singer, and wherein the generating spectral parameters corresponding to the first phoneme is further based on the voice of the target singer.

8. The method of claim 2, further comprising: receiving an indication of a singing style of a first target singer; and receiving an indication of voice of a second target singer, and wherein the determining a phoneme duration of the first phoneme is further based on the singing style of the first target singer, the generating a fundamental frequency residual corresponding to the first phoneme is further based on the singing style of the first target singer, and the generating spectral parameters corresponding to the first phoneme is further based on the singing style of the first target singer and the voice of the second target singer.

9. The method of claim 1, wherein the fundamental frequency residual and the spectral parameters corresponding to the first phoneme are generated through a self-attention based feed-forward neural network.

10. The method of claim 1, wherein the fundamental frequency residual and the spectral parameters corresponding to the first phoneme are generated in a non- autoregressive approach.

11. The method of claim 1, wherein the music score is generated based on at least one of: image music score data, audio music data, symbolic music score data, and text music score data.

12. An apparatus for singing voice synthesis, comprising: an acoustic feature predictor, for: receiving a first music score phoneme information extracted from a music score, the first music score phoneme information comprising a first phoneme, and a pitch and a beat of a note corresponding to the first phoneme; and generating a fundamental frequency residual and spectral parameters corresponding to the first phoneme based on the first music score phoneme information; a pitch regulator, for obtaining a fundamental frequency corresponding to the first phoneme through regulating the pitch of the note with the fundamental frequency residual; and a vocoder, for generating an acoustic waveform corresponding to the first phoneme based at least in part on the fundamental frequency and the spectral parameters.

13. The apparatus of claim 12, wherein the acoustic feature predictor comprises: a music score encoder, for generating a first vector representation based on the first music score phoneme information; a duration predictor, for determining a phoneme duration of the first phoneme based on the first vector representation, the duration predictor being configured for predicting a phoneme duration under a constraint by a note beat; a length regulator, for expanding the first vector representation to a second vector representation based on the phoneme duration of the first phoneme; and a spectrum decoder, for generating the fundamental frequency residual and the spectral parameters corresponding to the first phoneme based at least on the second vector representation.

14. The apparatus of claim 13, wherein training data for the duration predictor at least comprises a reference phoneme duration of each reference phoneme and a beat of each reference note extracted from a reference audio, and the duration predictor is trained based at least on a loss function for calculating a difference between: a sum of a plurality of phoneme durations predicted by the duration predictor for a plurality of reference phonemes corresponding to a reference note; and a beat of the reference note.

15. An apparatus for singing voice synthesis, comprising: at least one processor; and a memory storing computer-executable instructions that, when executed, cause the at least one processor to: receive a first music score phoneme information extracted from a music score, the first music score phoneme information comprising a first phoneme, and a pitch and a beat of a note corresponding to the first phoneme, generate a fundamental frequency residual and spectral parameters corresponding to the first phoneme based on the first music score phoneme information, obtain a fundamental frequency corresponding to the first phoneme through regulating the pitch of the note with the fundamental frequency residual, and generate an acoustic waveform corresponding to the first phoneme based at least in part on the fundamental frequency and the spectral parameters.

Description:
SINGING VOICE SYNTHESIS

BACKGROUND

[0001] Singing Voice Synthesis (SVS) is a technique for generating virtual singing voices based on a music score including information of, e.g., lyrics, tempo, pitch, etc. The singing voice synthesis may comprise predicting acoustic features based on music scores, and then generating speech waveforms based on the acoustic features. The singing voice synthesis aims to automatically generate singing voices that simulate real human singing voices.

SUMMARY

[0002] This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

[0003] Embodiments of the present disclosure provide methods and apparatuses for singing voice synthesis. First music score phoneme information extracted from a music score may be received, the first music score phoneme information comprising a first phoneme, and a pitch and a beat of a note corresponding to the first phoneme. A fundamental frequency residual and spectral parameters corresponding to the first phoneme may be generated based on the first music score phoneme information. A fundamental frequency corresponding to the first phoneme may be obtained through regulating the pitch of the note with the fundamental frequency residual. An acoustic waveform corresponding to the first phoneme may be generated based at least in part on the fundamental frequency and the spectral parameters.

[0004] It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects. [0006] FIG.1 illustrates an existing exemplary TTS system architecture. [0007] FIG.2 illustrates an exemplary process of parsing a music score according to an embodiment of the present invention.

[0008] FIG.3 illustrates an exemplary SVS system architecture according to an embodiment of the present invention.

[0009] FIG.4 illustrates an exemplary process of generating a music score according to an embodiment of the present invention.

[0010] FIG.5 illustrates an exemplary architecture of a music score encoder according to an embodiment of the present invention.

[0011] FIG.6 illustrates an exemplary architecture of a spectrum decoder according to an embodiment of the present invention.

[0012] FIG.7 illustrates an exemplary application scenario of singing voice synthesis according to an embodiment of the present invention.

[0013] FIG.8 illustrates an exemplary process of performing singing voice synthesis based on a music score according to an embodiment of the present invention.

[0014] FIG.9 illustrates an exemplary training process of an acoustic feature predictor according to an embodiment of the present invention.

[0015] FIG.10 illustrates a flowchart of an exemplary method for singing voice synthesis according to an embodiment of the present invention.

[0016] FIG.11 illustrates a diagram of an exemplary apparatus for singing voice synthesis according to an embodiment of the present invention.

[0017] FIG.12 illustrates a diagram of an exemplary apparatus for singing voice synthesis according to an embodiment of the present invention.

DETAILED DESCRIPTION

[0018] The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

[0019] Traditional singing voice synthesis techniques employ approaches such as waveform unit concatenation or statistical parametric synthesis to simulate human singing voices. However, there still exists a large gap between the quality of synthesized singing voices and that of human recordings. Recently, deep learning-based models, e.g., deep neural networks (DNN), long short-term memory (LSTM), etc., have been introduced into the SVS field. In an approach, a WaveNet model is proposed to predict acoustic features, wherein duration, pitch and spectral models are trained independently. In an approach, a deep auto-regressive neural network is proposed to model fundamental frequency F0 and spectral features. In addition, an adversarially-trained end-to-end SVS system which is based on an auto-regressive model is proposed. The auto-regressive model has forward dependency. In an approach, a strategy for post-processing predicted fundamental frequency based on note pitch is proposed to ensure the fundamental frequency in tune. However, there is still a need for a singing voice synthesis model leading to synthesized singing voices with high naturalness, fast processing speed and good audio quality.

[0020] Singing voice synthesis techniques proposed by embodiments of the present disclosure at least partially refer to a Text-to- Speech (TTS) model, e.g., Fastspeech. The Fastspeech is a feed-forward network which is based on transformer architecture and generates Mel spectrogram in parallel. The embodiments of the present disclosure may modify the Fastspeech model to adapt to the SVS task of generating singing voices based on music scores. Unlike plain text, a music score comprises information associated with lyrics and notes, wherein a note has a corresponding beat and pitch.

[0021] In an aspect, considering that note beats may bring rhythmic feeling and enhance auditory experience and the human brain is more sensitive to the rhythm of singing voices, during the process of predicting acoustic features, in addition to considering phoneme level duration alignment, the embodiments of the present disclosure also consider note level duration alignment, so that the finally generated singing voices can be more in line with note beats, thereby enhancing rhythmic feeling and bringing smoother auditory experience. [0022] In an aspect, the embodiments of the present disclosure propose a residual connection between an input note pitch and an output fundamental frequency, so that an acoustic feature prediction model only needs to predict a deviation or a fundamental frequency residual from a note pitch in a music score. This can not only overcome the difficulty of covering all pitch ranges by training data, avoid the need to perform data augmentation through implementing pitch shift on the training data, but can also flexibly enhance dynamic ranges of fundamental frequency, e.g., vibrato, etc., to convey emotions more expressively.

[0023] In an aspect, the embodiments of the present disclosure may implement flexible switching among singers’ voices and/or singing styles based on settings of different singers’ voices and/or different singing styles. For example, the embodiments of the present disclosure may implement synthesizing singing voices with a specified singer's voice, synthesizing singing voices with a specified singing style, etc. [0024] Although the following discussions of the embodiments of the present disclosure are directed to singing voice synthesis, it should be understood that the innovative concepts of the present disclosure may be applied to any other scenarios in the field of speech synthesis in a similar approach.

[0025] FIG.l illustrates an existing exemplary TTS system architecture. The TTS system architecture may comprise a speech synthesizer 100 and a phoneme extraction module 110.

[0026] The phoneme extraction module 110 may be used for extracting phoneme data from a text 102 one by one, and providing the extracted phoneme data to the speech synthesizer 100. The text 102 may be recognized from any types of data, e.g., electronic documents, images, symbol data, etc. The phoneme data includes names of phonemes. A phoneme is the smallest phonetic unit that constitutes a syllable. Generally, a syllable may be divided into multiple phonemes. For example, in Chinese, a Chinese character is a syllable, and a Chinese character may be divided into, e.g., 3 phonemes. As an example, a combination of an initial consonant and a final sound of the syllable of the Chinese character "¾%" is "woo", and this Chinese character may be decomposed into three phonemes, e.g., "w", "o", and "o". "w" is the phoneme name of the first phoneme, and so on. The number of phonemes into which a syllable is divided may be referred to as phoneme granularity. The larger the phoneme granularity is, the more different combinations of phoneme that constitute a syllable would have. Continuing to the previous example, in the case of decomposing the Chinese character into three phonemes, the phoneme granularity is 3.

[0027] The speech synthesizer 100 is used for converting the phoneme data from the phoneme extraction module 110 into a speech waveform 104 characterizing a virtual speech corresponding to the phoneme data. The speech synthesizer 100 comprises an acoustic feature predictor 120 and a vocoder 130.

[0028] The acoustic feature predictor 120 is configured for predicting acoustic feature parameters corresponding to the phoneme data based on the phoneme data. The acoustic feature parameters may comprise Mel spectrogram parameters, fundamental frequencies, etc. The acoustic feature predictor 120 comprises a phoneme encoder 122, a duration predictor 124, a length regulator 126, and a spectrum decoder 128. The phoneme encoder 122 is configured for encoding the phoneme data from the phoneme extraction module 110 into a corresponding phoneme side vector representation. The duration predictor 124 may predict phoneme duration associated with the phoneme data based on the phoneme side vector representation. The phoneme duration characterizes a length of a frequency spectrum corresponding to the phoneme in time. The phoneme duration may be, e.g., in units of time frames of audio. The duration predictor 124 considers phoneme duration in a real human speech during prediction, to provide a more accurate prediction result. The length regulator 126 expands the phoneme side vector representation to a spectrum side vector representation according to the phoneme duration predicted by the duration predictor 124, so as to adapt to subsequent spectrum prediction processing. The spectrum side vector representation is provided to the spectrum decoder 128. The spectrum decoder 128 generates acoustic feature parameters corresponding to the spectrum side vector representation based on the received spectrum side vector representation.

[0029] The vocoder 130 may convert the acoustic feature parameters generated by the spectrum decoder 128 into the speech waveform 104. The vocoder 130 may be a vocoder that generates a speech waveform based on Mel spectrogram parameters, e.g., WaveGlow, Griffm-Lim, WaveNet, etc.

[0030] Although the speech synthesizer 100 shown in FIG.l can synthesize virtual speech based on the input text 102, the speech synthesizer 100 cannot be directly used for synthesize virtual singing voices based on a music score because a music score usually comprises not only lyrics in a text form, but also various note information.

[0031] FIG.2 illustrates an exemplary process 200 of parsing a music score according to an embodiment of the present invention. In FIG.2, a segment of a Chinese song is taken as an example for illustrating the process of parsing a music score.

[0032] A music score 210 comprises Chinese lyrics "iff" and corresponding notes. It may also be learned from the music score that the tempo is specified as 120 beats per minute. In other words, the duration of each beat is 0.5 second. Through parsing the music score 210, each syllable in the lyrics may be divided into multiple phonemes, and a pitch and a beat of a note corresponding to each phoneme may be obtained.

[0033] The lyrics in the music score 210 comprises 3 syllables “woo”, “hee”, and “nii”, and each syllable may be exemplarily divided into 3 phonemes. For example, the syllable “woo” may be divided into 3 phonemes "w", "o", and "o". The first phoneme "w" corresponds to an initial consonant of the syllable “woo”, and the second phoneme and the third phoneme both correspond to a final sound of the syllable “woo”. Accordingly, the phonemes included in the syllables “woo”, “hee”, and “nii” form a phoneme sequence [w, o, o, h, e, e, n, i, i]

[0034] A note of each phoneme may be determined. For example, if a syllable corresponds to a note w, it may be determined that all the multiple phonemes of the syllable correspond to the note m. Taking the syllable “woo” as an example, in the music score 210, the syllable “woo” corresponds to a note 211, thus all the three phonemes "w", "o", and "o" of the syllable “woo” correspond to the note 211.

[0035] Since a pitch of the note 211 is "C4", the phonemes "w", "o", and "o" may be all marked with the pitch "C4" accordingly. Alternatively, pitches may also be quantified according to specific musical standard specifications. For example, according to the MIDI standard specification, the pitch C4 may be quantized as the number 60. Therefore, the phonemes "w", "o" and "o" may also be marked with a pitch "60".

[0036] According to the music score 210, a beat of the note 211 is "1", thus the phonemes "w", "o", and "o" may be all marked with beat "1". Alternatively, beats may be quantified in terms of time, e.g., 1 beat corresponds to 0.5 second. Therefore, the phonemes "w", "o", and "o" may be all marked with a beat "0.5" in units of seconds. Moreover, alternatively, beats may be quantified in terms of the number of frames, e.g., 1 beat corresponds to 33 frames. Therefore, the phonemes "w", "o", and "o" may be all marked with a beat "33" in units of frames.

[0037] In the process of parsing a music score in FIG.2, the values of pitch and beat are duplicated with reference to the number of phonemes included in the syllable. For example, since the syllable “woo” includes 3 phonemes, all the corresponding pitch "C4", "60" and beat "1", "0.5", and "33" are duplicated three times, and are associated with the 3 phonemes respectively. Herein, music score phoneme information may comprise various information associated with phonemes in a music score, e.g., phoneme name, pitch, beat, etc. The phoneme name is a phoneme indicator. For example, "w" is the phoneme name of the first phoneme in the syllable “woo”. The pitch refers to a pitch of a note corresponding to a phoneme. The beat refers to a beat of a note corresponding to a phoneme. It should be understood that the above description of the music score phoneme information is only for the purpose of examples, in practice, the music score phoneme information may comprise any information indicating phoneme name, pitch, and beat.

[0038] The parsing approach of a music score by the process 200 is only exemplary, and the embodiments of the present disclosure may also adopt any other parsing approaches to parse a music score.

[0039] It should be understood that although a Chinese song is taken as an example in FIG.2, the process 200 is also applicable to any other language. Different languages may have different basic phonemes. Taking English as an example, 48 English international phonetic symbols may be used as basic phonemes. For an exemplary English word "you", this word is a syllable with phonetic symbols "ju:", which may be divided into, e.g., 3 phonemes "j", "u:", and "u:". The first phoneme "j" corresponds to a consonant of the syllable, and both the second phoneme and the third phoneme "u:" correspond to a vowel of the syllable. After dividing syllables in lyrics in other languages into phonemes, each phoneme may be marked with information of, e.g., pitch, beat, etc. of a note, similar to the process 200. It should be understood that any processes of the embodiments of the present disclosure are not limited by specific language categories.

[0040] FIG.3 illustrates an exemplary SVS system architecture according to an embodiment of the present invention. The SVS system architecture may comprise a singing voice synthesizer 300, a music score parser 310, etc. It should be understood that although the music score parser 310 is shown in FIG.3 as independent of the singing voice synthesizer 300, the music score parser 310 may alternatively be included in the singing voice synthesizer 300 as a part of the singing voice synthesizer 300.

[0041] The music score parser 310 may extract music score phoneme information from a music score 302 on a phoneme-by-phoneme basis in an approach as described in FIG.2, and provide the extracted music score phoneme information to the singing voice synthesizer 300.

[0042] The singing voice synthesizer 300 is used for predicting an acoustic waveform 304 of virtual singing voices corresponding to the music score phoneme information based on the music score phoneme information from the music score parser 310. The singing voice synthesizer 300 may comprise an acoustic feature predictor 320, a pitch regulator 330, a vocoder 340, etc.

[0043] The SVS system architecture may comprise a voice encoder 350, which is configured for providing a voice vector representation based on a voice ID. The voice is an attribute of sound, which depends on overtones of sounds made by a person. The voice ID may be an indicator indicating an inherent voice of a specific singer, e.g., an index, the name of a singer, etc. One voice ID may uniquely correspond to the voice of one singer. In an implementation, when receiving a voice ID, the voice encoder 350 may generate a voice vector representation for characterizing the voice of a singer corresponding to the voice ID based on audio data of the singer. In an implementation, a voice vector representation corresponding to each voice ID may be generated in advance, and a voice ID and a corresponding voice vector representation may be stored in a voice database associated with the voice encoder 350. When receiving a voice ID, the voice encoder 350 may retrieve a voice vector representation corresponding to the voice ID from the voice database. It should be understood that the voice ID input to the voice encoder 350 may be provided by a user. For example, when a user wants to obtain a song sung with the voice of a specific singer, the user may provide a voice ID of the specific singer to the SVS system architecture. [0044] The SVS system architecture may comprise a style encoder 360, which is configured for providing a style vector representation based on a singing style ID. The singing style may indicate singing approaches adopted by a singer to sing a song, e.g., vocalization approach, vocalization skills, etc. The singing style may be associated with phoneme durations and/or fundamental frequencies corresponding to phonemes. For example, different singers may have different vocalization duration habits for initial consonants or final sounds when singing, thus resulting in different vocalization approaches. Moreover, for example, if a singer uses a vocalization skill, e.g., vibrato, fundamental frequency will also reflect corresponding characteristics. The singing style ID may be an indicator indicating a specific singing style, e.g., an index, the name of a singing style, etc. In some cases, a singing style may refer to the type of a song, e.g., rock song, folk song, etc. In some cases, a singing style may refer to a singing approach of a specific singer. In an implementation, when receiving a singing style ID, the style encoder 360 may generate a style vector representation for characterizing a singing style corresponding to the singing style ID based on audio data of the singing style. In an implementation, a style vector representation corresponding to each singing style ID may be generated in advance, and a singing style ID and a corresponding style vector representation may be stored in a style database associated with the style encoder 360. When receiving a singing style ID, the style encoder 360 may retrieve a style vector representation corresponding to the singing style ID from the style database. It should be understood that the singing style ID input to the style encoder 360 may be provided by a user. For example, when a user wants to obtain a song in a specific singing style, the user may provide a singing style ID of this specific singing style to the SVS system architecture.

[0045] Although the voice encoder 350 and the style encoder 360 are shown outside the singing voice synthesizer 300 in FIG.3, this is only for the purpose of clarity and examples. It should be understood that the voice encoder 350 and/or the style encoder 360 may also be embedded into the singing voice synthesizer 300 or the acoustic feature predictor 320. Alternatively, in an implementation, the voice encoder 350 and the style encoder 360 may output a fixed voice vector representation and a fixed style vector representation, instead of depending on inputs of voice ID and style ID. Thus, the generated singing voices will adopt a voice corresponding to the fixed voice vector representation and a singing style corresponding to the fixed style vector representation. Moreover, alternatively, either or both of the voice encoder 350 and the style encoder 360 may also be omitted in the SVS system architecture of FIG.3.

[0046] The acoustic feature predictor 320 is configured for predicting acoustic feature parameters corresponding to the music score phoneme information based on the music score phoneme information, the possible voice vector representation from the voice encoder 350, and the possible style vector representation from the style encoder 360. The acoustic feature parameters may comprise spectral parameters, fundamental frequency residual, etc. The spectral parameters may be Mel Generalized Cepstrum (MGC) parameters, Band Aperiodicity (BAP) parameters, Mel spectrogram parameters, etc. The acoustic feature predictor 320 may comprise a music score encoder 322, a vector combination module 324, a duration predictor 326, a length regulator 328, a spectrum decoder 329, etc.

[0047] The music score encoder 322 is configured for encoding the music score phoneme information from the music score parser 310 into a corresponding phoneme side vector representation. The music score encoder 322 may generate a phoneme side vector representation in a non-autoregressive approach. In an implementation, the music score encoder 322 may comprise a feed-forward neural network structure which is based on self attention mechanism in a transformer and one-dimensional (ID) convolution. The phoneme side vector representation may be a hidden state that is related to the music score phoneme information and generated by the feed-forward neural network structure. The music score phoneme information may be provided to the music score encoder 322 on a phoneme-by- phoneme basis, so that the music score encoder 322 may generate the phoneme side vector representations on a phoneme-by-phoneme basis. Taking the music score shown in FIG.2 as an example, for the first phoneme “w” of the syllable “woo”, the music score encoder 322 may generate a phoneme side vector representation corresponding to the phoneme “w” based on the phoneme information of the phoneme “w”. Similarly, the music score encoder 322 may further generate a phoneme side vector representation for the second phoneme "o" of the syllable "woo", a phoneme side vector representation for the third phoneme "o" of the syllable "woo", etc. Assuming that a set of all the phonemes is represented as a phoneme sequence, a phoneme side vector representation sequence corresponding to the phoneme sequence may be obtained through the music score encoder 322, which may also be referred to as a hidden state sequence. Taking the syllable "woo" as an example, the three phonemes "w", "o", and "o" of the syllable form a phoneme sequence [w, o, o], and the music score encoder 322 may generate a hidden state sequence H Pho =[hi , fe, hs] corresponding to the phoneme sequence based on the phoneme information of the three phonemes, wherein hi is a hidden state corresponding to the first phoneme "w", hi is a hidden state corresponding to the second phoneme "o", and fe is a hidden state corresponding to the third phoneme "o". Exemplary architecture of the music score encoder 322 will be described in details later in conjunction with FIG.5.

[0048] As described above, the acoustic feature predictor 320 will finally predict spectral parameters corresponding to each phoneme, and multiple spectral parameters corresponding to multiple phonemes will form a spectrum sequence. Since each phoneme generally corresponds to multiple time frames and thus multiple segments of spectrum, the length of the spectrum sequence will be longer than the length of its corresponding phoneme sequence. The length regulator 328 may up-sample the phoneme sequence according to a phoneme duration predicted by the duration predictor 326, to match the length of the spectrum sequence.

[0049] Since a singing style of a song is related to a phoneme duration, when the acoustic feature predictor 320 is expected to be capable of generating acoustic feature parameters according to a specified singing style ID, the vector combination module 324 may be used for combining the style vector representation output by the style encoder 360 and the phoneme side vector representation from the music score encoder 322 to obtain a phoneme side combined vector representation. In an implementation, the combining operation may refer to performing vector concatenation to the phoneme side vector representation and the style vector representation, and thus the dimension of the resulting phoneme side combined vector representation will be the sum of the dimension of the phoneme side vector representation and the dimension of the style vector representation. In an implementation, the combining operation may refer to performing vector summation to the phoneme side vector representation and the style vector representation. In this case, the phoneme side vector representation, the style vector representation, and the phoneme side combined vector representation will all have the same dimension. The phoneme side combined vector representation containing singing style information may be input to the duration predictor 326.

[0050] The duration predictor 326 may predict phoneme duration associated with a phoneme based on a phoneme side combined vector representation for the phoneme. A set of multiple phoneme side combined vector representations provided to the duration predictor 326 may be represented as a hidden state sequence, e.g., H Pho =[hi , fe, hi] The duration predictor 324 may predict a corresponding phoneme duration sequence D=[di, di, di], wherein di , di, and di represent predicted phoneme durations corresponding to the hidden states hi, h2 , and hi respectively. As an example, when the phoneme duration predicted by the duration predictor 326 for the hidden state hi is 3 frames, the value of di is 3. Unlike the duration predictor 124 in FIG.l, the duration predictor 326 considers, during predicting, not only phoneme duration in real human singing voices, but also standard beats of notes associated with phonemes in a music score, so that a prediction result by the duration predictor 324 can facilitate to achieve virtual singing voices with more rhythmic feeling and smoother auditory experience. Meanwhile, since the duration predictor 326 performs prediction based on the phoneme side combined vector, and the phoneme side combined vector contains information about the singing style, it may be construed that the prediction of the phoneme duration by the duration predictor 326 is based at least on the singing style. An exemplary training process of the duration predictor 324 will be described in details later in conjunction with FIG.9.

[0051] The length regulator 328 is configured for adjusting or expanding the phoneme side combined vector representation of a phoneme to a spectrum side vector representation according to a phoneme duration predicted by the duration predictor 326 for the phoneme, so as to be suitable for the subsequent spectrum prediction processing. For ease of explanation, according to the above example, a set of multiple phoneme side combined vector representations received by the length regulator 328 from the music score encoder 322 is represented as a hidden state sequence H Pho =[hi , hi, hi], and a set of multiple phoneme durations predicted by the duration predictor 324 based on these phoneme side combined vector representations is represented as a predicted phoneme duration sequence D=[di, d2 , di]. The length regulator 326 may expand the hidden state h, of the phoneme i by di times. The spectrum side vector representation sequence H spec obtained by the length regulator 326 may be calculated as:

H spec =LR ( ,H Pho , D, a) Equation (1) wherein LR represents the processing by the length regulator, and a is a hyper-parameter that decides the length of the H spec sequence obtained through expansion and thus can control the speed of singing voices. Given H Pho =[hi, h2, hi and the corresponding predicted phoneme duration sequence D=[ 2, 2, 3], in the case of a=l, then H spec =[hi , hi, h2, h2, hi, hi, hi\. In the case of the parameter a=1.3, i.e., in the case of fast speed, the phoneme duration sequence D is updated to D a=i.3 =[ 2.6, 2.6, 3.9]~[3, 3, 4], then H spec =[hi , hi , hi , fe, fe, fc, fe, fe, fe, fe]. In the case of the parameter a=0.5, i.e., in the case of slow speed, the phoneme duration sequence D is updated to D a= o.s=[ 1, 1, 1.5]~[1, 1, 2], then H spec = [ hi , fe, fe, fe] [0052] The spectrum decoder 329 receives the spectrum side vector representation corresponding to one phoneme from the length regulator 328, and generates corresponding acoustic feature parameters based at least on the spectrum side vector representation, e.g., spectrum parameters, fundamental frequency residual, etc. Alternatively, the process of generating the acoustic feature parameters by the spectrum decoder 329 may be further based on the possible voice vector representation from the voice encoder 350. Since a voice is associated with spectral parameters, the generation of spectral parameters may be further based on a voice of a target singer characterized by a voice vector representation. In another aspect, since a singing style is associated with a fundamental frequency and thus with a fundamental frequency residual output by the spectrum decoder 329, and a spectrum side vector representation contains singing style information, it may be construed that the generation of the fundamental frequency residual is further based on a singing style characterized by a style vector representation. Moreover, since the spectrum side vector representation containing singing style information is also used by the spectrum decoder 329 for generating spectral parameters, it may be construed that the generation of the spectral parameters is further based on a singing style characterized by a style vector representation. The spectrum decoder 329 may generate the acoustic feature parameters in a non-autoregressive approach. In an implementation, the spectrum decoder 329 may comprise a feed-forward neural network structure which is based on self-attention mechanism in a transformer and one-dimensional convolution.

[0053] Frequency coverage by human singing voices is much higher than that of normal speeches, e.g., the frequency coverage by human singing voices may be 80Hz-3400Hz. Moreover, human singers may adopt dynamic singing skills when singing songs, which will further lead to changes in frequency. Therefore, frequency coverage of different songs has large variability, which makes it difficult to completely cover all pitch ranges by training data for traditional acoustic feature predictors. Moreover, deviations of note pitches in singing voices relative to standard pitches, e.g., out-of-tune, will greatly affect auditory experience of the singing voices. Therefore, the embodiments of the present disclosure introduce, in the singing voice synthesizer 300, a residual between a pitch in the input music score phoneme information and an output fundamental frequency. [0054] The spectrum decoder 329 in the acoustic feature predictor 320 may be trained for predicting a fundamental frequency residual. The fundamental frequency residual indicates a deviation between: a standard fundamental frequency corresponding to a standard pitch of the current phoneme parsed from the music score 302 by the music score parser 310; and a fundamental frequency corresponding to the phoneme that is to be used in the synthesized singing voice. Since the acoustic feature predictor 320 only needs to predict the fundamental frequency residual, instead of the fundamental frequency itself, the acoustic feature predictor 320 does not require training data to cover all pitch ranges. In an implementation, the fundamental frequency residual may be set to be no higher than a semitone, so as to avoid an out-of-tune issue in the synthesized singing voice. Exemplary system architecture of the spectrum decoder 329 will be described in details later in conjunction with FIG.6, and an exemplary training process of the spectrum decoder 329 will be described in details in conjunction with FIG.9.

[0055] The pitch regulator 330 is configured for adjusting the standard pitch of the current phoneme from the music score parser 310 with the fundamental frequency residual output by the spectrum decoder 329 for the current phoneme, so as to generate a fundamental frequency that is to be adopted for the current phoneme in the synthesized singing voice. In an implementation, the pitch regulator 330 may be an adder which adds the standard fundamental frequency corresponding to the standard pitch of the current phoneme with the fundamental frequency residual from the spectrum decoder 328, to generate the fundamental frequency that is to be adopted.

[0056] The vocoder 340 may generate a corresponding acoustic waveform 304 based on the fundamental frequency from the pitch regulator 330 and the spectral parameters generated by the spectrum decoder 328. The vocoder 330 may be any types of vocoder, e.g., a vocoder that generates acoustic waveforms based on the Mel Generalized Cepstrum parameters, such as a WORLD vocoder.

[0057] It should be understood that although the above discussion involves generating acoustic waveforms with the singing voice synthesizer on a phoneme-by-phoneme basis, since there is no dependency among the processing of different phonemes by the SVS system architecture in FIG.3, the singing voice synthesizer may also be deployed to process multiple phonemes in parallel, and thus acoustic waveforms may be generated in parallel. [0058] It should be understood that although FIG.3 shows that the style vector representation is provided to the acoustic feature predictor 320 by the style encoder 360, alternatively, when the singing voice synthesizer 300 does not need to adopt a specified singing style to synthesize singing voices, the acoustic feature predictor 320 does not need to receive the style vector representation and thus may not comprise the vector combination module 324. In this case, the duration predictor 326 may directly predict the phoneme duration based on the phoneme side vector representation output by the music score encoder 322, and the length regulator 328 may expand the phoneme side vector representation output by the music score encoder 322 according to the phoneme duration.

[0059] FIG.4 illustrates an exemplary process of generating a music score according to an embodiment of the present invention. In FIG.4, a music score generator 410 may generate a music score 420 based on various types of music score data related to information about a music score.

[0060] In one case, the music score data may be image music score data 402. The image music score data 402 presents information about a music score in the form of image, e.g., a photo of a music score, etc. The music score generator 410 may comprise an image music score recognition module 412, which may recognize the music score 420 from the image music score data 402 through any existing image recognition techniques. In one case, the music score data may be audio music data 404. The audio music data 404 presents information about a music score in the form of audio, e.g., an audio of a song, etc. The music score generator 410 may comprise an audio music score recognition module 414, which may recognize the music score 420 from the audio music data 404 through any existing audio parsing techniques. In one case, the music score data may be symbolic music score data 406. The symbolic music score data 406 presents information about a music score in the form of symbol following a predetermined standard or format, e.g., a music score file in the MIDI format, etc. The music score generator 410 may comprise a symbolic music score recognition module 416, which may recognize the music score 420 from the symbolic music score data 406 based on a predetermined standard or format. In one case, the music score data may be text music score data 408. The text music score data 408 presents information about a music score in the form of text, e.g., a music score file in the format of text, etc. The music score generator 410 may comprise a text music score recognition module 418, which may recognize the music score 420 from the text music score data 408 through any existing text recognition techniques.

[0061] Moreover, the music score generator 410 may also construct a complete music score by combining information in different types of music score data. For example, assuming that note information is recognized with high confidence from the image music score data 402 and lyrics information is recognized with high confidence from the audio music data 404, then the note information and the lyrics information may be combined to form a complete music score.

[0062] FIG.5 illustrates an exemplary architecture of a music score encoder 520 according to an embodiment of the present invention. The music score encoder 520 may correspond to the music score encoder 322 in FIG.3.

[0063] The music score encoder 520 may comprise a phoneme embedding module 522, a beat embedding module 524, a pitch embedding module 526, a position encoding module 528, a plurality of stacked feed-forward transformer (FFT) modules 530-532, etc. Although only two FFT modules are shown in FIG.5, it should be understood that this is only for exemplary purposes, and the music score encoder 520 may comprise more or less FFT modules.

[0064] Referring to the description of FIG.3, the music score encoder 520 receives music score phoneme information 510 from the music score parser 310. The music score phoneme information 510 includes a phoneme name 512 and a note beat 514 and a note pitch 516 of a note corresponding to the phoneme name 512. The phoneme name 512, the note beat 514, and the note pitch 516 are input to the phoneme embedding module 522, the beat embedding module 524, and the pitch embedding module 526 respectively. The phoneme embedding module 522 may perform embedding process on the phoneme name 512 to generate a phoneme embedding vector. The beat embedding module 524 may perform embedding process on the note beat 514 to generate a beat embedding vector. The pitch embedding module 526 may perform embedding process on the note pitch 516 to generate a pitch embedding vector. The phoneme embedding vector, the beat embedding vector, and the pitch embedding vector may have the same dimension. The position encoder module 528 may sum up the phoneme embedding vector, the beat embedding vector, and the pitch embedding vector through position encoding to obtain a sum vector. The sum vector is passed to the stacked FFT module 530 and FFT module 532 to obtain the final encoded output, i.e., the phoneme side vector representation 540. In an implementation, the FFT module may comprise a self-attention network and a ID convolutional network with ReLU activation. The self-attention network may comprise multi-head attention mechanism for extracting cross-position information.

[0065] FIG.6 illustrates an exemplary architecture of a spectrum decoder 620 according to an embodiment of the present invention. The spectrum decoder 620 may correspond to the spectrum decoder 329 in FIG.3.

[0066] The spectrum decoder 620 may comprise a vector combination module 622, a position encoding module 624, a plurality of stacked FFT modules 626-628, a linear layer 630, etc. Although only two FFT modules are shown in FIG.6, it should be understood that this is only for exemplary purposes, and the spectrum decoder 620 may comprise more or less FFT modules.

[0067] Referring to the description of FIG.3, the spectrum decoder 620 receives a spectrum side vector representation 612 from the length regulator 326 and a possible voice vector representation 614 from the voice encoder 350. The vector combination module 622 may combine the spectrum side vector representation 612 and the voice vector representation 614 to obtain a combined vector representation. In an implementation, the combining operation may refer to perform vector concatenation to the spectrum side vector representation 612 and the voice vector representation 614, and thus the dimension of a resulting combined vector representation will be the sum of the dimension of the spectrum side vector representation 612 and the dimension of the voice vector representation 614. In an implementation, the combining operation may refer to perform vector summation to the spectrum side vector representation 612 and the voice vector representation 614. In this case, the spectrum side vector representation 612, the voice vector representation 614, and the combined vector representation will all have the same dimension.

[0068] The position encoding module 624 performs position encoding on the combined vector representation from the vector combination module 622 to generate a position- encoded combined vector representation. The position-coded combined vector representation is passed to the stacked FFT module 626 and FFT module 628. Similar to the FFT modules 530-532 in the music score encoder 520, the FFT modules 626-628 may comprise a self-attention network and a ID convolutional network. The linear layer 630 may linearly transform an output vector representation from the last FFT module 628 to obtain a fundamental frequency residual 632 and spectral parameters 634. As described above, the voice vector representation 614 will affect at least the generation of the spectral parameters 634.

[0069] It should be understood that although FIG.6 shows that the spectrum decoder 620 may obtain the voice vector representation 614, depending on actual application scenarios, the spectrum decoder 620 may not receive the voice vector representation 614, and thus the vector combination module 622 may be omitted.

[0070] FIG.7 illustrates an exemplary application scenario of singing voice synthesis according to an embodiment of the present invention. In this exemplary application scenario, a user may want to use a singing voice synthesizer 700 to synthesize a specific song that is sung based on a style of a singer A and a voice of a singer B.

[0071] The user may input a style ID indicating the singing style of the singer A. The style encoder 710 may provide a style vector representation corresponding to the singer A according to the style ID. At the same time, the user may input a voice ID indicating the voice of the singer B. The voice encoder 720 may provide a voice vector representation corresponding to the singer B according to the voice ID. The style vector representation corresponding to the singer A and the voice vector representation corresponding to the singer B are provided to the singing voice synthesizer 700 as parameters. At the same time, the user may input music score data of a specific song C. The music score generator 730 may generate a music score based on the music score data, and provide the music score to the singing voice synthesizer 700. The singing voice synthesizer 700 may correspond to the singing voice synthesizer 300 of FIG.3, and can synthesize a song C that is sung in the voice of the singer B and in the singing style of the singer A.

[0072] It should be understood that FIG.7 only shows an exemplary scenario to which the embodiments of the present disclosure can be applied, the application scenario may change according to specific application requirements, and the embodiments of the present disclosure may also be applied to a variety of other scenarios.

[0073] In an application scenario, in order to enable a user to synthesize a song by using his own singing style or voice, the user's own corpus may be obtained in advance, and the corpus may be used for training the style encoder and/or the voice encoder in order to obtain a style vector representation and/or a voice vector representation associated with the user. When the user wants to use his own singing style, the user may provide a "style ID" corresponding to himself, so that the singing voice synthesizer may obtain the style vector representation of the user, and further synthesize singing voices in the singing style of the user. When the user wants to use his own voice, the user may provide a "voice ID" corresponding to himself, so that the singing voice synthesizer may obtain the voice vector representation of the user, and further synthesize singing voices in the voice of the user. [0074] In an application scenario, a user may expect to adapt a demo song audio segment. The demo song audio segment may be a recording of a song sung by the user himself, or a singing recording by any other singers. The user may want to replace a singer’ s voices in the original audio segment by a voice of a specified singer, replace a singing style of the original audio segment by a specified singing style, etc. In this case, the demo song audio segment provided by the user may be input to, e.g., the music score generator 410 of FIG.4 as audio music data, and the music score generator may generate a corresponding music score based on the demo song audio segment. Moreover, the user may also provide a desired voice ID and/or a singing style ID. Accordingly, the singing voice synthesizer may perform singing voice synthesis based on the generated music score, a voice vector representation corresponding to the voice ID and/or a style vector representation corresponding to the singing style ID.

[0075] FIG.8 illustrates an exemplary process 800 of performing singing voice synthesis based on a music score according to an embodiment of the present invention.

[0076] At 810, first music score phoneme information associated with, e.g., the first phoneme may be extracted from the music score. The first music score phoneme information may comprise a first phoneme, and a pitch and a beat of a note corresponding to the first phoneme. For example, the first music score phoneme information may be extracted from the music score through a music score parser.

[0077] At 815, a first vector representation, e.g., a phoneme side vector representation, may be generated based on the first music score phoneme information. For example, the first vector representation corresponding to the first phoneme may be generated by a music score encoder based on the first music score phoneme information.

[0078] At 820, optionally, an indication of a singing style may be received. The indication of the singing style may be a singing style ID or a style vector representation obtained based on the singing style ID.

[0079] At 825, a phoneme side combined vector representation may be generated based on the first vector representation and the style vector representation corresponding to the singing style. For example, the first vector representation and the style vector representation may be added, concatenated, or cascaded.

[0080] At 830, a phoneme duration of the first phoneme may be determined by a duration predictor based on the phoneme side combined vector representation. The duration predictor may be configured for predicting a phoneme duration at least under a constraint by a note beat.

[0081] At 835, the phoneme side combined vector representation may be expanded to a second vector representation, e.g., a spectrum side vector representation, based on the phoneme duration of the first phoneme. For example, the phoneme side combined vector representation may be expanded to the second vector representation by a length regulator based on the phoneme duration of the first phoneme.

[0082] At 840, optionally, an indication of a voice of a singer may be received. The indication of the voice of the singer may be a voice ID or a voice vector representation obtained based on the voice ID.

[0083] At 845, a fundamental frequency residual and spectral parameters corresponding to the first phoneme may be generated based on the second vector representation and a possible voice vector representation corresponding to the voice of the singer. For example, the fundamental frequency residual and the spectral parameters may be generated by a spectrum decoder.

[0084] At 850, a fundamental frequency corresponding to the first phoneme may be obtained through regulating a pitch of a note with the fundamental frequency residual. For example, the fundamental frequency to be adopted may be obtained through superimposing the fundamental frequency residual and a standard fundamental frequency of a pitch of the first phoneme by a pitch regulator.

[0085] At 860, an acoustic waveform corresponding to the first phoneme may be generated based at least in part on the fundamental frequency obtained at 850 and the spectral parameters obtained at 845. For example, the acoustic waveform may be generated by a vocoder.

[0086] Through performing the above process 800 on each of all the phonemes recognized from the music score, a plurality of acoustic waveforms respectively corresponding to these phonemes may be obtained. The set of these acoustic waveforms forms singing voices to be synthesized.

[0087] It should be understood that all the processing in the process 800 are exemplary, and the process 800 may be modified in any forms according to specific application requirements. For example, in the case that no indication of a singing style is received, the steps 820 and 825 may be omitted, and the phoneme duration may be directly predicted based on the first vector representation at 830. For example, in the case that no indication of a singer's voice is received, the step 840 may be omitted, and the fundamental frequency residual and the spectral parameters may be generated based on the second vector representation at 845.

[0088] FIG.9 illustrates an exemplary training process 900 of an acoustic feature predictor according to an embodiment of the present invention. The exemplary training process 900 may adopt a large amount of reference audios as training data. These reference audios may be singing voice audios collected from different sources in advance, and each singing voice audio is sung by a reference singer in a reference singing style. Based on an input reference audio, a loss value may be calculated for the whole acoustic feature predictor through at least one predefined loss function, and the whole acoustic feature predictor may be optimized according to the loss value, thereby constraints are imposed on a music score encoder, a spectrum decoder, a duration predictor, etc. in the acoustic feature predictor. FIG.9 takes inputting an exemplary reference audio 902 as an example to illustrate the calculation of the loss value during the training process.

[0089] A reference audio 902 may be input to a music score parser 920. The music score parser 920 parses reference music score phoneme information of each reference phoneme from the reference audio 902 and feeds it to a music score encoder 932. The reference music score phoneme information includes a reference phoneme name, and a reference note beat and a reference note pitch of a reference note corresponding to the reference phoneme name. [0090] The music score encoder 932 generates a phoneme side vector representation based on the input reference music score phoneme information. When it is desired that an acoustic feature predictor can generate acoustic feature parameters according to a specified singing style ID, the phoneme side vector representation is input to a vector combination module 934. The vector combination module 934 may combine a style vector representation from a style encoder 950 with the phoneme side vector representation to obtain a phoneme side combined vector representation. The phoneme side combined vector representation is input to a duration predictor 936. The duration predictor 936 predicts a predicted phoneme duration of a reference phoneme based on the phoneme side combined vector representation for the reference phoneme.

[0091] The reference audio 902 may also be input to an audio processing module 910. The audio processing module 910 analyzes the reference audio 902 by using a speech recognition algorithm to obtain phoneme alignment data associated with the reference phoneme in the reference audio 902. The phoneme alignment data may include information related to: the name of the reference phoneme, and the start time of the reference phoneme in the reference audio 902. Based on the start time of each reference phoneme, the reference phoneme duration of the reference phoneme may be calculated. The reference phoneme duration characterizes an actual phoneme duration of the reference phoneme in the reference audio 902.

[0092] A phoneme loss function 992 may be defined, which is used for calculating a phoneme duration loss value L Pd based on a difference between a predicted phoneme duration of the current reference phoneme and a corresponding reference phoneme duration. Thus, when the duration predictor 936 is constrained through training, alignment of phoneme-level durations may be considered. The alignment of phoneme-level durations may refer to alignment in time between the predicted phoneme duration and the reference phoneme duration. Therefore, the trained duration predictor 936 may predict a phoneme duration at least under the constraints of phoneme-level durations.

[0093] Moreover, in addition to phoneme durations, note durations also have important meaning for SVS, because that a melody brought by the note durations can produce stronger rhythmic feeling, thus making the synthesized singing voices more realistic. Therefore, a note loss function 994 is further proposed in the process 900. The note loss function 994 adds control over note-level durations. Assuming that the phoneme granularity is 3, i.e., one syllable is divided into 3 phonemes, and the syllable has a corresponding note, then this note is also associated with the 3 phonemes. First, 3 predicted phoneme durations derived by the duration predictor 936 from 3 reference phonemes associated with a reference note may be accumulated. Then, the note loss function 994 may calculate a note duration loss value Lsd based on a difference between a total duration of the 3 predicted phoneme durations and a reference note beat of the reference note in the reference music score phoneme information. [0094] A length regulator 938 generates a spectrum side vector representation based on the predicted phoneme duration and the phoneme side combined vector representation output by the vector combination module 934.

[0095] A voice encoder 960 may provide a voice vector representation associated with an input voice ID to a spectrum decoder 939.

[0096] The spectrum decoder 939 may generate predicted spectrum parameters and a predicted fundamental frequency residual based on the spectrum side vector representation and the voice vector representation. A pitch regulator 940 may generate a predicted fundamental frequency based on the predicted fundamental frequency residual and a reference note pitch associated with the current reference phoneme from the music score parser 920.

[0097] The audio processing module 910 may analyze the reference audio 902 by using a speech recognition algorithm to obtain a reference fundamental frequency and reference spectral parameters associated with the current reference phoneme. The reference fundamental frequency and the reference spectral parameters characterize an actual fundamental frequency and actual spectral parameters of the current reference phoneme in the reference audio 902.

[0098] A pitch loss function 996 may be defined, which is used for calculating a pitch loss value Lf based on a difference between the predicted fundamental frequency from the pitch regulator 940 and the reference fundamental frequency. A spectrum loss function 998 may be defined, which is used for calculating a spectrum loss value L sp based on a difference between the predicted spectral parameters from the spectrum decoder 938 and the reference spectral parameters.

[0099] Therefore, a loss value for the whole acoustic feature predictor may be calculated as:

Lo=wf Lf+Wsp *Lsp+Wpd*Lpd+Wsd*Lsd Equation (2) wherein w indicates a weight for different loss values, / represents a pitch or fundamental frequency, sp represents a spectral parameter, pd represents a phoneme duration, and sd represents a note duration.

[00100] Through setting the weight w sd to a larger value, e.g., greater than the weight w Pd , a prediction result by the acoustic feature predictor may be constrained, during the training process, to a rhythm more conforming to the music score, and thus singing voices that are more melodic and closer to human naturalness can be synthesized.

[00101] In an implementation, an average loss value may be calculated for a set of reference audios. Furthermore, the acoustic feature predictor may be trained multiple times based on multiple average loss values for multiple sets of reference audios.

[00102] It should be understood that although the process 900 involves the phoneme loss function 992, the note loss function 994, the pitch loss function 996, and the spectrum loss function 998, the process 900 may adopt more or less loss functions according to specific application requirements.

[00103] It should be understood that although the process 900 involves the style encoder 950, the style encoder 950 may also be omitted, and the vector combination module 934 may be omitted accordingly. Moreover, the phoneme encoder 960 may also be omitted. [00104] FIG.10 illustrates a flowchart of an exemplary method 1000 for singing voice synthesis according to an embodiment of the present invention.

[00105] At 1010, first music score phoneme information extracted from a music score may be received. The first music score phoneme information may comprise a first phoneme, and a pitch and a beat of a note corresponding to the first phoneme.

[00106] At 1020, a fundamental frequency residual and spectral parameters corresponding to the first phoneme may be generated based on the first phoneme information.

[00107] At 1030, a fundamental frequency corresponding to the first phoneme may be obtained through regulating the pitch of the note with the fundamental frequency residual. [00108] At 1040, an acoustic waveform corresponding to the first phoneme may be generated based at least in part on the fundamental frequency and the spectral parameters. [00109] In an implementation, the generating a fundamental frequency residual and spectral parameters corresponding to the first phoneme may comprise: generating a first vector representation based on the first music score phoneme information; determining, by a duration predictor, a phoneme duration of the first phoneme based on the first vector representation, the duration predictor being configured for predicting a phoneme duration under a constraint by a note beat; expanding the first vector representation to a second vector representation based on the phoneme duration of the first phoneme; and generating the fundamental frequency residual and the spectral parameters corresponding to the first phoneme based at least on the second vector representation.

[00110] In an implementation, training data for the duration predictor may at least comprise: a reference phoneme duration of each reference phoneme and a beat of each reference note, extracted from a reference audio.

[00111] In an implementation, training of the duration predictor adopts a first loss function, the first loss function being for calculating a difference between: a phoneme duration predicted by the duration predictor for a reference phoneme; and a reference phoneme duration of the reference phoneme.

[00112] In an implementation, the training of the duration predictor further adopts a second loss function, the second loss function being for calculating a difference between: a sum of a plurality of phoneme durations predicted by the duration predictor for a plurality of reference phonemes corresponding to a reference note; and a beat of the reference note. [00113] In an implementation, the first loss function and the second loss function may have different weights in the training of the duration predictor.

[00114] In an implementation, the weight of the first loss function may be less than the weight of the second loss function.

[00115] In an implementation, the method 1000 may further comprise: receiving an indication of a singing style. The determining a phoneme duration of the first phoneme may be further based on the singing style. The generating a fundamental frequency residual and spectral parameters corresponding to the first phoneme may be further based on the singing style.

[00116] In an implementation, the method 1000 may further comprise: receiving an indication of voice of a target singer. The generating spectral parameters corresponding to the first phoneme may be further based on the voice of the target singer.

[00117] In an implementation, the method 1000 may further comprise: receiving an indication of a singing style of a first target singer; and receiving an indication of voice of a second target singer. The determining a phoneme duration of the first phoneme may be further based on the singing style of the first target singer, the generating a fundamental frequency residual corresponding to the first phoneme may be further based on the singing style of the first target singer, and the generating spectral parameters corresponding to the first phoneme may be further based on the singing style of the first target singer and the voice of the second target singer.

[00118] In an implementation, the fundamental frequency residual and the spectral parameters corresponding to the first phoneme may be generated through a self-attention based feed-forward neural network.

[00119] In an implementation, the fundamental frequency residual and the spectral parameters corresponding to the first phoneme may be generated in a non-autoregressive approach.

[00120] In an implementation, the music score may be generated based on at least one of: image music score data, audio music data, symbolic music score data, and text music score data.

[00121] In an implementation, the phoneme duration may be in units of time frames.

[00122] It should be understood that the method 1000 may further comprise any steps/processes for singing voice synthesis according to the above embodiments of the present disclosure.

[00123] FIG.11 illustrates a diagram of an exemplary apparatus 1100 for singing voice synthesis according to an embodiment of the present invention.

[00124] The apparatus 1100 may comprise: an acoustic feature predictor 1110, for receiving first music score phoneme information extracted from a music score, the first music score phoneme information comprising a first phoneme and a pitch and a beat of a note corresponding to the first phoneme, and generating a fundamental frequency residual and spectral parameters corresponding to the first phoneme based on the first music score phoneme information; a pitch regulator 1120, for obtaining a fundamental frequency corresponding to the first phoneme through regulating the pitch of the note with the fundamental frequency residual; and a vocoder 1130, for generating an acoustic waveform corresponding to the first phoneme based at least in part on the fundamental frequency and the spectral parameters.

[00125] In an implementation, the acoustic feature predictor 1110 may further comprise: a music score encoder 1112, for generating a first vector representation based on the first music score phoneme information; a duration predictor 1114, for determining a phoneme duration of the first phoneme based on the first vector representation, the duration predictor being configured for predicting a phoneme duration under a constraint by a note beat; a length regulator 1116, for expanding the first vector representation to a second vector representation based on the phoneme duration of the first phoneme; and a spectrum decoder 1118, for generating the fundamental frequency residual and the spectral parameters corresponding to the first phoneme based at least on the second vector representation. [00126] In an implementation, training data for the duration predictor 1114 may at least comprise: a reference phoneme duration of each reference phoneme and a beat of each reference note, extracted from a reference audio. The duration predictor 1114 may be trained based at least on a loss function for calculating a difference between: a sum of a plurality of phoneme durations predicted by the duration predictor 1114 for a plurality of reference phonemes corresponding to a reference note; and a beat of the reference note.

[00127] In an implementation, the spectrum decoder 1118 may be for: receiving an indication of a singing style; and generating the fundamental frequency residual and the spectral parameters corresponding to the first phoneme based at least on the second vector representation and the singing style.

[00128] In an implementation, the spectrum decoder 1118 may be for: receiving an indication of voice of a target singer; and generating the spectral parameters corresponding to the first phoneme based at least on the second vector representation and the voice of the target singer.

[00129] Moreover, the apparatus 1100 may further comprise any other modules that perform any steps/processes in the methods for singing voice synthesis according to the above embodiments of the present disclosure.

[00130] FIG.12 illustrates a diagram of an exemplary apparatus 1200 for singing voice synthesis according to an embodiment of the present invention.

[00131] The apparatus 1200 may comprise at least one processor 1210 and a memory 1220 storing computer-executable instructions. The computer-executable instructions, when executed, cause the at least one processor 1210 to: receive first music score phoneme information extracted from a music score, the first music score phoneme information comprising a first phoneme, and a pitch and a beat of a note corresponding to the first phoneme; generate a fundamental frequency residual and spectral parameters corresponding to the first phoneme based on the first music score phoneme information; obtain a fundamental frequency corresponding to the first phoneme through regulating the pitch of the note with the fundamental frequency residual; and generate an acoustic waveform corresponding to the first phoneme based at least in part on the fundamental frequency and the spectral parameters. Moreover, the processor 1210 may further perform any steps/processes for singing voice synthesis according to the above embodiments of the present disclosure.

[00132] The embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for singing voice synthesis according to the above embodiments of the present disclosure.

[00133] It should be understood that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

[00134] It should also be understood that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

[00135] Processors are described in connection with various apparatus and methods. These processors can be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a micro controller, a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, a micro-controller, a DSP, or other suitable platforms. [00136] Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, and the like. Software can reside on computer readable medium. Computer readable medium may include, e.g., a memory, which may be, e.g., a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

[00137] The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalent transformations to the elements of the various aspects of the present disclosure, which are known or to be apparent to those skilled in the art, are hereby explicitly incorporated, and are intended to be covered by the claims.