Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUTOMATIC SONG GENERATION
Document Type and Number:
WIPO Patent Application WO/2018/200268
Kind Code:
A1
Abstract:
In accordance with implementations of the subject matter described herein, there is provided a solution for supporting a machine to automatically generate a song. In this solution, an input from a user is used to determine a creation intention of the user with respect to a song to be generated. A template for the song is generated based on the creation intention, the template indicating a melody of the song and a distribution of lyrics relative to the melody. Then, the lyrics of the song are generated based at least in part on the template. In this way, it is feasible to automatically create the melody and lyrics which not only conform to the creation intention of the user but also match with each other.

Inventors:
LIAO QINYING (US)
YANG NAN (US)
LUAN JIAN (US)
WEI FURU (US)
LIU ZHEN (US)
YANG ZIQI (US)
HUANG BIN (US)
Application Number:
PCT/US2018/028044
Publication Date:
November 01, 2018
Filing Date:
April 18, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MICROSOFT TECHNOLOGY LICENSING LLC (US)
International Classes:
G06F17/30
Foreign References:
US7977560B22011-07-12
US20130218929A12013-08-22
US20110231193A12011-09-22
Other References:
None
Attorney, Agent or Firm:
MINHAS, Sandip S. et al. (US)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method, comprising:

in response to receiving an input from a user, determining, based on the input, a creation intention of the user with respect to a song to be generated;

generating a template for the song based on the creation intention, the template indicating a melody of the song and a distribution of lyrics relative to the melody; and generating the lyrics of the song based at least in part on the template.

2. The method of claim 1, wherein generating the lyrics further comprises: generating the lyrics further based on the creation intention.

3. The method of claim 1, further comprising:

combining the lyrics and the melody indicated by the template to generate the song.

4. The method of claim 1, further comprising:

obtaining a voice model representing a voice characteristic of a singer;

generating a voice spectrum trajectory for the lyrics using the voice model;

synthesizing the voice spectrum trajectory and the melody indicated by the template into a singing waveform of the song; and playing the song based on the singing waveform.

5. The method of claim 4, wherein obtaining the voice model comprises:

receiving a voice segment of the singer; and

obtaining the voice model by adjusting a predefined average voice model with the received voice segment, the average voice model being obtained with voice segments of a plurality of different singers.

6. The method of claim 1, wherein generating the template based on the creation intention comprises:

selecting, based on the creation intention, the template from a plurality of candidate templates.

7. The method of claim 1, wherein generating the template based on the creation intention comprises:

dividing at least one existing song melody into a plurality of melody segments; selecting, based on the creation intention, a plurality of candidate melody segments from the plurality of melody segments;

concatenating, based on smoothness among the plurality of candidate melody segments, at least two of the plurality of candidate melody segments to form the melody indicated by the template; and determining the distribution of the lyrics relative to the melody indicated by the template by analyzing lyrics in songs corresponding to the concatenated at least two candidate melody segments.

8. The method of claim 1, wherein generating the lyrics comprises:

generating candidate lyrics based at least in part on the template; and

modifying the candidate lyrics based on a further input received from the user to obtain the lyrics.

9. The method of claim 1, wherein generating the lyrics comprises:

obtaining a predefined lyrics generation model, the lyrics generation model being obtained with a plurality of pieces of existing lyrics; and

generating the lyrics based on the template using the lyrics generation model.

10. The method of claim 1, wherein the input includes at least one of an image, a word, a video, or an audio.

11. A device, comprising:

a processing unit; and

a memory coupled to the processing unit and including instructions stored thereon which, when executed by the processing unit, cause the device to perform acts including:

in response to receiving an input from a user, determining, based on the input, a creation intention of the user with respect to a song to be generated;

generating a template for the song based on the creation intention, the template indicating a melody of the song and a distribution of lyrics relative to the melody; and generating the lyrics of the song based at least in part on the template.

12. The device of claim 11, wherein generating the lyrics further comprises: generating the lyrics further based on the creation intention.

13. The device of claim 11, wherein the acts further include:

combining the lyrics and the melody indicated by the template to generate the song.

14. The device of claim 11, wherein the acts further include:

obtaining a voice model representing a voice characteristic of a singer;

generating a voice spectrum trajectory for the lyrics using the voice model;

synthesizing the voice spectrum trajectory and the melody indicated by the template into a singing waveform of the song; and

playing the song based on the singing waveform.

15. The device of claim 14, wherein obtaining the voice model comprises:

receiving a voice segment of the singer; and obtaining the voice model by adjusting a predefined average voice model with the received voice segment, the average voice model being obtained with voice segments of a plurality of different singers.

Description:
AUTOMATIC SONG GENERATION

BACKGROUND

[0001] Songs are an artistic form appreciated and loved by people and have been part of people's life. However, song creation is still a complex process. Generally speaking, a song creation process includes two major phases, that is, lyrics writing (namely, generating lyrics) and melody composition (namely, generating a melody). Conventional melody composition requires composers to have music theory knowledge and create a complete song melody by inspirations and creation experiences. Creating a sweet-sounding melody has many requirements in music theory, for example, ensuring the melody and rhythm uniform, representing a certain theme, and reflecting various music styles or combinations of the styles, and/or the like. In addition, lyrics, as an important part of the songs, are also required to express certain meanings, correspond to the themes, and match the melody of the songs. In this sense, high music theory requirements are imposed on a creator to generate songs having specific styles and emotions and representing specific themes. SUMMARY

[0002] In accordance with implementations of the subject matter described herein, there is provided a solution for supporting a machine to automatically generate a song. In this solution, an input from a user is used to determine a creation intention of the user with respect to a song to be generated. A template for the song is generated based on the creation intention, the template indicating a melody of the song and a distribution of lyrics relative to the melody. Then, the lyrics of the song are generated based at least in part on the template. In this way, it is feasible to automatically create the melody and lyrics which not only conform to the creation intention of the user but also match with each other.

[0003] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

[0004] Fig. 1 illustrates a block diagram of a computing environment in which implementations of the subject matter described herein can be implemented;

[0005] Fig. 2 illustrates a block diagram of a system for automatic song generation in accordance with some implementations of the subject matter described herein; [0006] Fig. 3 illustrates a schematic diagram of analysis of creation intention from a user input in accordance with some implementations of the subject matter described herein;

[0007] Fig. 4 illustrates a block diagram of a system for automatic song generation in accordance with some other implementations of the subject matter described herein; and

[0008] Fig. 5 illustrates a flowchart of a process of generating a song in accordance with some implementations of the subject matter described herein.

[0009] Throughout the drawings, the same or similar reference symbols refer to the same or similar elements.

DETAILED DESCRIPTION

[0010] The subject matter described herein will now be discussed with reference to several example implementations. It is to be understood these implementations are discussed only for the purpose of enabling those skilled persons in the art to better understand and thus implement the subject matter described herein, rather than suggesting any limitations on the scope of the subject matter.

[0011] As used herein, the term "includes" and its variants are to be read as open terms that mean "includes, but is not limited to." The term "based on" is to be read as "based at least in part on." The term "one implementation" and "an implementation" are to be read as "at least one implementation." The term "another implementation" is to be read as "at least one other implementation." The terms "first," "second," and the like may refer to different or same objects. Other definitions, explicit and implicit, may be included below.

[0012] As discussed above, there are so many requirements for melody and/or lyrics of songs in order to create a song, and these requirements limit the possibility of ordinary people or organizations in creating personalized songs. In many cases, ordinary people or organizations usually need to resort to professional persons and organizations capable of writing lyrics and composing melody if they want to obtain customized songs. As computer era comes, especially as artificial intelligence develops, it is desirable to generate desired songs, for example, generating melody and/or lyrics of the songs automatically.

[0013] In accordance with some implementations of the subject matter described herein, there is provided a computer-implemented solution for automatic song generation. In this solution, an input from a user, such as an image, a word, a video, and/or an audio, is used to determine a creation intention of the user with respect to song creation. Such creation intention input by the user is further used to guide generation of a template for the song so that the generated template can indicate the melody of the song and a distribution of lyrics relative to the melody. Furthermore, the lyrics of the song are generated based on the melody and the distribution of lyrics indicated by the template. According to the solution of the subject matter described herein, since the generated lyrics match the melody in the template for the song, the lyrics may be directly combined together with the melody into a song that can be sung. In addition, the lyrics, melody, and/or song generated based on the input from the user all can reflect the creation intention of the user; thus, a personalized and high-quality song, lyrics and/or melody can be provided to the user.

[0014] Basic principles and various example implementations of the subject matter described here will now be described with reference to the drawings.

Example Environment

[0015] Fig. 1 illustrates a block diagram of a computing environment 100 in which implementations of the subject matter described herein can be implemented. It would be appreciated that the computing environment 100 shown in Fig. 1 is merely for purpose of illustration but will not limit the function and scope of the implementations of the subject matter described herein in any way. As shown in Fig. 1, the computing environment 100 includes a computing device 102 in form of a general-purpose computing device. Components of the computing device 102 may include, but are not limited to, one or more processors or processing units 110, a memory 120, a storage device 130, one or more communication units 140, one or more input devices 150, and one or more output devices 160.

[0016] In some implementations, the computing device 102 may be implemented as various user terminals or service terminals. The service terminals may be a server or large- scale computing device, and other devices provided by various service providers. The user terminals are, for example, any type of mobile terminals, fixed terminals, or portable terminals, including mobile phones, stations, units, devices, multimedia computers, multimedia tablets, Internet nodes, communicators, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, Personal Communication System (PCS) devices, personal navigation devices, Personal Digital Assistants (PDAs), audio/video players, digital camera/camcorders, positioning devices, television receivers, radio broadcast receivers, electronic book devices, game devices, or any combination thereof, including the accessories and peripherals of these devices or any combination thereof. It is also contemplated that the computing device 102 can support any type of interface to the user (such as "wearable" circuitry and the like).

[0017] The processing unit 110 can be a physical or virtual processor and perform various processes based on programs stored in the memory 120. In a multiprocessor system, multiple processing units execute computer executable instructions in parallel to improve the parallel processing capability of the computing device 102. The processing unit 110 can also be referred to as a central processing unit (CPU), microprocessor, controller, or microcontroller.

[0018] The computing device 102 typically includes various computer storage media. Such media can be any media accessible to the computing device 102, including but not limited to volatile and non-volatile media, and removable and non-removable media. The memory 120 can be a volatile memory (for example, a register, cache, or Random Access Memory (RAM)), non-volatile memory (for example, a Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), or flash memory), or any combination thereof. The memory 120 may include one or more program modules 122 configured to perform functions of various implementations described herein. The module 122 may be accessed and operated by the processing unit 110 to implement the corresponding functions. The storage device 130 can be any removable or non-removable media and may include machine-readable media, which can be used for storing information and/or data and can be accessed within the computing device 102.

[0019] Functions of components in the computing device 102 can be implemented in a single computing cluster or a plurality of computing machines that are communicated with each other via communication connections. Therefore, the computing device 102 can be operated in a networking environment using a logic link with one or more other servers, personal computers (PCs), or other general network nodes. The computing device 102 can further communicate, via the communication unit 140, with one or more external devices (not shown) such as a database 170, other storage devices, a server, a display device, and the like, or communicate with one or more devices enabling the user to interact with the computing device 102, or communicate with any devices (for example, a network card, modem, and the like) that enable the computing device 102 to communicate with one or more other computing devices. Such communication can be performed via input/output (I/O) interfaces (not shown).

[0020] The input device 150 may include one or more input devices such as a mouse, keyboard, tracking ball, voice input device, and the like. The output device 160 may include one or more output devices such as a display, loudspeaker, printer, and the like. In some implementations of automatic song generation, the input device 150 receives an input 104 from a user. Depending on the types of content that the user desires to input, the different types of input devices 150 may be used to receive the input 104. The input 104 is provided to the module 122 so that the module 122 determines, based on the input 104, a creation intention of the user with respect to the song and thus generates the corresponding melody and/or lyrics of the song. In some implementations, the module 122 provides the generated lyrics, melody, and/or the song formed by the lyrics and melody, as an output 106 to the output device 160 for output. The output device 160 may provide the output 106 in one or more forms such as a word, an image, an audio, and/or a video.

[0021] Example implementations for automatically generating lyrics, melody and song in the module 122 will be discussed in detail below.

Generation of Melody and Lyrics

[0022] Fig. 2 illustrates a block diagram of a system for automatic song generation in accordance with some implementations of the subject matter described herein. In some implementations, the system may be implemented as the module 122 in the computing device 102. In the implementation of Fig. 2, the module 122 is implemented for automatically generating a melody and lyrics. As shown, the module 122 includes a creation intention analyzing module 210, a lyrics generating module 220, and a template generating module 230. According to the implementations of the subject matter described herein, the creation intention analyzing module 210 is configured to receive the input 104 from a user, and determine, based on the input 104, a creation intention 202 of the user with respect to a song to be generated. The input 104 may be received from the user via the input device 150 of the computing device 102, and provided to the creation intention analyzing module 210.

[0023] In some implementations, the creation intention analyzing module 210 may analyze or determine the creation intention 202 based on a specific type of the input 104 or various different types of the input 104. Examples of the input 104 may be words input by the user, such as key words, dialogue between characters, labels, and various documents including words. Alternatively, or in addition, the input 104 may include images in various formats, videos and/or audios with various length and formats, or the like. The input may be received from the user via a user interface provided by the input device 150. In this case, according to the implementations of the subject matter described herein, it is possible to allow the user to control the song to be generated (including the lyrics and/or melody of the song) through simple input, without requiring the user to have music theory knowledge to guide the generation of lyrics, melody and/or song.

[0024] The creation intention of the user with respect to the song refers to one or more features in the input 104 of the user that are expected to be expressed by the song to be generated, including the theme, feeling, tone, style, key elements of the song, and/or the like. For example, if the input 104 is a family photo and the facial expressions of family members in the photo show happiness, the creation intention analyzing module 210 may analyze that the creation intention of the user is to generate song with a theme of "family," with an overall "happy" emotion, and the like.

[0025] Depending on the type of the input 104, the creation intention analyzing module 210 may apply different analysis technologies to extract the creation intention 202 from the input 104. For example, if the input 104 include a word(s), the creation intention analyzing module 210 may employ a natural language processing or text analysis technology to analyze the theme, emotion, key elements and the like described in the input word(s).

[0026] As another example, if the input 104 is an image, the creation intention analyzing module 210 may apply various image analysis technologies, such as image recognition, human face recognition, posture recognition, emotion detection, gender and age detection, to analyze objects and characters included in the image and information such as the expressions, postures and emotions of those objects and characters, so as to determine the overall theme, emotion, key elements shown in the image (for example, human beings, objects, environment, events, and the like included in the image).

[0027] Alternatively, or in addition, the creation intention analyzing module 210 may obtain other features associated with the image, such as the size, format, type (for example, an oil painting, line drawing, clip picture, or black-white image), overall tone, associated labels (which may be added by the user or automatically added), and metadata of the image. Then, the creation intention 202 is analyzed and determined based on the obtained information.

[0028] Fig. 3 illustrates a schematic diagram of analysis of creation intention of the input 104. In this example, the input 104 is an image. After reception of the image 104, the creation intention analyzing module 210 may employ a human face recognition and posture recognition technology to determine that the image 104 includes multiple characters, and then determine that the category of the image 104 is "crowd", as shown by the label 302 in Fig. 3. Furthermore, the creation intentional analyzing module 210 may further analyze the age and gender of each character in the image 104 (as shown in the label 304) based on gender and age detection and human face recognition, and may then determine, based on the ages, genders, and other information (such as human face similarity) that the crowd included in the image 104 is a family. [0029] In addition, it is possible to determine, from the expression detection technology, image recognition technology and image analysis technology, that the overall emotion of people in the image 104 is happiness and the people are in an outdoor environment. Therefore, the creation intention analyzing module 210 may determine that the creation intention of the user is to create a song to celebrate the happiness of the family. The song may include elements such as "outdoor", "closed," and "individuals". Of course, the creation intention analyzing module 210 may continue to determine information such as the type, format and size of the image 104 to further assist the determination of the creation intention.

[0030] In other examples, if the input 104 includes an audio and/or video, the creation intention analyzing module 210 may apply speech analysis technology (for the audio and video) and image analysis technology (for the video) to determine specific content included in the input audio and/or video. For example, it is possible to perform the analysis by converting speech in the audio and/or video into words and then using the above-mentioned natural language processing or text analyzing technology. It is also feasible to use the above-mentioned image analysis technology to analyze one or more frames of the video. In addition, spectrum properties of the speech in the audio and/or video may be analyzed to determine emotions of characters expressed by the audio and/or video or to identify the theme related to the speech.

[0031] It would be appreciated that the task of analyzing the creation intention can be performed by using various analysis technologies that are currently used or to be developed in the future, as long as the technologies can analyze corresponding types of words, images, audios and/or videos of input in one or more aspects to facilitate the song creation. In these implementations, the input 104 may include many types of input, and a corresponding analysis technology may be employed to analyze each of the types of input. Analysis results obtained from the different types of input may be combined to determine the creation intention 202. In some implementations, if the input 104 includes an explicit creation intention indication, for example, an indication of some aspects of the song such as the style and emotion, an indication of some key elements of the song, or an indication of partial melody and/or lyrics distribution of the song, the explicit creation intention may be extracted from the input 104. Although some examples of creation intention have been listed, it would be appreciated that other aspects that would affect the features of the song may also be analyzed from the input of the user and the scope of the subject matter described herein is not limited in this regard. [0032] Further referring to Fig. 2, the creation intention 202 determined by the creation intention analyzing module 210 may be provided as a key word(s) to the template generating module 230. The template generating module 230 is configured to generate a template 204 for the song based on the creation intention 202. The template 204 for the song may at least indicate a melody of the song, which may be expressed as a duration of a phoneme, a pitch trajectory, and a power trajectory, and other various parameters for generating the melody. In addition, the song template 204 may also indicate a distribution of lyrics relative to the melody, including the number of words in each section of the lyrics, and a duration of each phoneme in a word, a pitch trajectory, and a power trajectory. Therefore, the distribution of the lyrics matches the melody in the template 204 and thus the song formed by the melody and the lyrics generated therewith may be easily sung.

[0033] In some implementations, a plurality of predefined song templates may be determined and stored as "candidate templates." In this case, the template generating module 230 may be configured to select, based on the creation intention 202, the template 204 from the plurality of candidate templates for generation of current song. The plurality of candidate templates may be obtained from the existing songs. For example, one or more candidate templates may be determined by directly or manually adjusting the melodies and the distributions of the lyrics relative to the melodies in the existing songs. In another example, one or more candidate templates may be created by professionals having music theory knowledge. In addition, one or more candidate templates may be provided by the user, for example, created or obtained by the user from other sources. The plurality of candidate templates may be obtained in advance and stored in a storage device for use. For example, the plurality of candidate templates may be stored in the storage device 130 of the computer device 102 as local data, and/or may be stored in an external database 170 accessible to the computing device 120.

[0034] The music styles, tunes, rhythms and emotions of the candidate templates are known and thus may be recorded, for example, in form of labels. In this way, the template generating module 230 may select a candidate template as the template 204 from the plurality of candidate templates based on information included in the creation intention, such as the theme, emotion, elements, and the like. The template generating module 230 may select the template 204 to be used based on comparison of the label information associated with the candidate template (which records the music styles, tunes, rhythms, emotions, and the like of the candidate templates) and the creation intention 202. For example, if the creation intention 202 indicates that theme of the song to be generated is "family" and the emotion indicates "happiness", a candidate template with an emotion of happiness and a brisk tune and rhythm may be selected. In some implementations, two or more candidate templates may be determined based on the creation intention 202 for selection by the user, and then the template 204 to be used is determined by reception of a user selection.

[0035] Instead of the pre-defined candidate templates or as an alternative, in some other implementations, the template generating module 230 may generate, based on the creation intention 202, the template 204 to be used in real time. Specifically, the template generating module 230 may pre-divide one or more existing song melodies into a plurality of melody segments. The division of such melody segments may be performed on the basis of one or more syllables of a melody, and the segments may have any identical or different lengths. It is also possible that the existing songs are divided manually by a professional(s). The plurality of melody segments obtained from the division may be used as a basis of subsequent melody generation, and may be stored partially or totally in a local storage device 130 of the computing device 102 and/or an accessible external device such as the database 170. After the creation intention analyzing module 210 receives the creation intention 202, the template generating module 230 may select, based on the creation intention 202, melody segments to form a complete melody. When combining the melody segments, it requires not only that the melody is made to comply with the creation intention 202, but also that the transition between melody segments is smooth to make the whole melody sound pleasant to ears. Criteria and determination of "smooth" will be described in detail below.

[0036] Specifically, the template generating module 230 may select two or more candidate melody segments from the pre-divided melody segments, and concatenate, based on smoothness among the candidate melody segments, at least two of the candidate melody segments as the melody. The candidate melody segments are selected on the basis of the creation intention 202 so that the selected one or more candidate melody segments can express the creation intention 202 individually and/or in combination. For example, if the creation intention 202 indicates that the emotion of the song to be generated is "happiness", a melody segment capable of expressing a happy emotion may be selected from the pre- divided melody segments as a candidate melody segment. If the creation intention 202 further indicates other aspects that affect the song creation, one or more melody segments may also be selected based on those aspects correspondingly. [0037] In some implementations, the pre-divided melody segments may be classified or labeled, and then the candidate melody segments are determined based on comparison of the classifications and labels and the creation intention 202. In some other implementations, a pre-selected model may be predefined or trained to perform the selection of the candidate melody segments. The pre-selected model may be trained in a way that corresponding candidate melody segments may be selected according to the creation intention 202 as an input (for example, in form of a key word(s)). Indifferent training creation intentions and known melody segments matching with these creation intentions may be used as training data to train the pre-selected model. In addition, some negative samples (namely, some creation intentions and melody segments not matching with these creation intentions) may also be used to train the model so that the model has a capability of determining correct and incorrect results. The pre-selected model may be stored partially or totally in the local storage device 130 of the computing device 102, and/or an accessible external device such as the database 170.

[0038] As mentioned above, a smooth transition between the melody segments is important for the quality of the created song. Among the candidate melody segments, the template generating module 230 may determine smoothness among every two candidate melody segments to determine whether the two candidate melody segments may be concatenated together. The smoothness between adjacent candidate melody segments may be determined using various technologies, examples of which include, but are not limited to, measuring by analyzing a pitch trajectory of melody in the melody segments, consistency between corresponding pitch trajectories and/or other aspects that may affect perception of listeners.

[0039] In some implementations, the template generating module 230 may use a predefined smoothness determining model to determine whether two candidate melody segments has a smooth auditory transition. The smoothness determining model may be designed to output the smoothness based on various acoustic parameters of the input melody segments such as the spectrum, frequency, loudness, duration, and the like. The output may be a smoothness metric in a certain range or an indication (with a value 1 or 0) indicating whether the two input melody segments are smooth. Training data used for training such a smoothness determining model may include adjacent melody segments in existing songs (as positive samples) and melody segments randomly selected from various segments of existing songs (as negative samples). In some examples, such model may, for example, be any of various neutral network-based models (for example, DNN-based models or long short-term memory (LSTM))-based models) or any other models capable of completing smoothness determination. The template generating module 230 may input two candidate melody segments into the smoothness determining model, and determine, based on a comparison of the result output by the model and a predetermined threshold (or based on whether the result indicates a smooth result), whether the two candidate melody segments are smooth and can be concatenated together.

[0040] Alternatively, or in addition, a concatenating path of candidate melody segments (namely, a concatenated sequence of candidate melody segments) may be planned by the template generating module 230 through viterbi searching. Thus, the template generating module 230 may determine, based on the smoothness and/or the result of the viterbi searching, two or more candidate melody segments to be concatenated and their concatenating sequence. These concatenated candidate melody segments may form the melody indicated by the template 204.

[0041] Furthermore, in some implementations, the template generating module 230 may further determine, based on the generated melody, a distribution of lyrics indicated by the template 204. In some implementations, since the melody segments forming the melody all are obtained from division of the existing songs, the template generating module 230 may analyze the lyrics in the song corresponding to the concatenated candidate melody segments, so as to determine a distribution of lyrics indicated by the template. It would be understood that the lyrics and melody segments in the existing songs may be considered as matching with one another. Therefore, it is possible to easily analyze the distribution of lyrics matching with the concatenated candidate melody segments. In other implementations, the distribution of the lyrics relative to the melody may be determined based on the creation intention 202 and the melody that is formed. After determining the melody and the distribution of the lyrics relative to the melody, the template generating module 230 may obtain the corresponding template 204.

[0042] In some implementations, if the creation intention 202 includes an explicit indication of the user with respect to the melody and/or lyrics distribution, the template generating module 230 may also take these into account in generating the template so that the obtained template 204 can explicitly represent these creation intentions. To further improve the user experience, the template selected or generated based on the creation intention 202 may be first presented to the user as an intermediate template. The template generating module 230 then receives from the user a modification to the melody and/or the distribution of the lyrics indicated by the intermediate template, and obtains the final template 204 based on these modifications.

[0043] The template 204 determined by the template generating module 230 may be used to instruct the lyrics generating module 220 to generate the lyrics. Specifically, the lyrics generating module 220 is configured to generate the lyrics of the song based on the template 204. Since the template 204 indicates the distribution of the lyrics relative to the melody, the lyrics generating module 220 may generate the lyrics matching with the distribution. For example, the number of words in each section of the lyrics, and the duration of a phoneme in each word, the pitch trajectory, and the power trajectory all match with those indicated by the distribution, so that the generated lyrics and melody can form the song that can be sung. In addition, the lyrics generating module 220 may further obtain the creation intention 202 from the creation intention analyzing module 210, and generate the lyrics further based on the creation intention 202. The creation intention may guide the lyrics generated by the lyrics generating module 220 to represent the corresponding theme, emotion and/or various key elements.

[0044] In some implementations, the lyrics generating module 220 may compare one or more pieces of existing lyrics with the distribution indicated in the template 204. The existing lyrics may include lyrics included in various existing songs, or text that can be sung, such as some composed poems. If a piece of certain existing lyrics matches with the distribution indicated in the template 204, the lyrics may be selected. In some cases, the lyrics generating module 220 may further divide one or more existing lyrics into a plurality of lyrics segments, and determine whether the respective lyrics segments match a portion of distribution indicated in the template. Then, the lyrics of the song are formed by combining the matched lyrics segments. When the creation intention 202 is additionally taken into account, the lyrics generating module 220 may further select lyrics segments based on the creation intention 202 so that the selected lyrics segments represent one or more aspects of the creation intention 202 individually or in combination.

[0045] In some other implementations, the lyrics generating module 220 may use a predefined lyrics generation model to generate the lyrics. Such lyrics generation model may be trained as having a capability of generating different lyrics according different song templates (for example, different distributions of lyrics). Such lyrics generation model is used to obtain the lyrics matching with the distribution of lyrics indicated in the template 204. For example, the number of words in each section of the lyrics, and the duration of each phoneme in a word, a pitch trajectory, and a power trajectory all match those indicated by the distribution, so that the generated lyrics and melody can form the song that can be sung.

[0046] Alternatively, or in addition, the lyrics generation model may further be trained to generate corresponding lyrics based on the input creation intention 202 in various different aspects so that the lyrics can represent one or more aspects of the creation intention, for example, comply with a corresponding song theme, express a song mood, and/or include some key elements. In some implementations, if the creation intention 202 obtained by the lyrics generating module 220 from the creation intention analyzing module 210 cannot cover all aspects of the creation intention required by the lyrics generation model due to the limited user input 104, values of other aspects may be set as void so that the lyrics generating module 220 may use the limited creation intention 202 (as well as the template 204 for the song) as input of the lyrics generation model to generate the lyrics. It would be appreciated that in some implementations, if the creation intention 202 includes the explicit indication of the user with respect to the lyrics, for example, some key elements or words to be included by the lyrics, the lyrics generating module 220 may also take the indication into consideration when generating the lyric, so as to obtain lyrics that explicitly indicate the creation intention.

[0047] In some examples, the lyrics generation model may be built based on a neutral network-based model such as a recurrent neutral network (RNN)-based model, or any other learning models. The lyrics generation model may be trained using a plurality of pieces of existing lyrics. The existing lyrics may include lyrics included in various existing songs, or texts that can be sung, such as the composed poems. Upon training, the existing lyrics may be classified into different themes, styles and/or contents. The lyrics generation model is trained in a way that corresponding lyrics may be generated when a specific template and/or creation intention is received. In this case, specific templates and creation intentions may be used as training data of the lyrics generating module 220, so that the lyrics generation model can learn, from the training data, a capability of generating the lyrics for any specific template and/or creation intention. The trained lyrics generation model may be stored partially or totally in the local storage device 130 of the computing device 102, and/or an accessible external device such as the database 170. It would be appreciated that the lyrics generating model may be obtained using various model structures and/or training methods that are currently known or to be developed in the future. The scope of the subj ect matter described herein is not limited in this regard. [0048] After the lyrics is selected from the existing lyrics and/or generated by the lyrics generation model, in some implementations, the lyrics generating module 200 may directly provide the lyrics as the output 106. Alternatively, the user may be provided with a modification to the automatically-generated lyrics. The lyrics generating module 220 may first output the lyrics selected from the existing lyrics and/or generated by the lyrics generation model to the user as candidate lyrics, which may, for example, be displayed by the output device 160 to the user in form of a text and/or played to the user in form of audio. The user may input a modification indication 206 to the candidate lyrics via the input device 150. Such modification indication 206 may indicate adjustment of one or more words in the candidate lyrics, for example, replacement of the words with other words or modification of an order of the words. Upon receiving the input modification indication 206 for the lyrics from the user, the lyrics generating module 220 modifies the candidate lyrics based on the input modification indication 206 to obtain the lyrics 106 of the song for output.

[0049] The lyrics 106 may be provided to the output device 160 of the computing device 102, and may be output to the user in form of text and/or audio. In some implementations, the melody in the template 204 generated by the template generating module 230 may be provided to the output device 160 as the output 106. For example, the melody 106 may be composed in form of numbered musical notations and/or five-line staff and output to the user.

[0050] The automatic generation of the melody and lyrics has been discussed above. In some alternative implementations, the lyrics may further be used to combine with the template 204 to generate the song. Such song may also be played to the user. Example implementations of automatic song synthesis will be discussed below in detail.

Song Synthesis

[0051] Fig. 4 illustrates a block diagram of the module 122 according to implementations of automatic song synthesis. In the example shown in Fig. 4, in addition to automatic lyrics generation, the module 122 may further be used to perform automatic song synthesis based on the lyrics and melody. As shown in Fig. 4, the module 122 further includes a song synthesizing module 410. The song synthesizing module 410 receives the lyrics from the lyrics generating module 220 and the melody indicated by the template from the template generating module 230, and then combines the received lyrics and melody to generate the song that can be sung.

[0052] It would be appreciated that the song synthesizing module 410 shown in Fig. 4 is optional. In some cases, the module 122 may only provide separated lyrics and/or melody as shown in Fig. 2. In other cases, the song synthesizing module 410 combines the generated lyrics and melody into the song automatically or in response to a user input (for example, an instruction from the user to synthesize the song).

[0053] In some implementations, the song synthesizing module 410 may simply match the lyrics with the melody, and then output the song 106 to the user. For example, the melody is composed in form of numbered musical notations or five-line staff and displayed on the display device where the lyrics are displayed in association with the melody. The user may sing the song by recognizing the melody and lyrics.

[0054] In some other implementations, the song synthesizing module 410 may further determine a corresponding voice of a singer for the song so that the song 106 may be directly played. Specifically, the song synthesizing module 410 may obtain a voice model that is capable of representing a voice characteristic of the singer, and then use the lyrics as input of the voice model to generate a voice spectrum trajectory for the lyrics. In this way, the lyrics may be read by the singer indicated by the voice model. To make the singer's reading of the lyrics sound rhythmic, the lyrics synthesizing module 410 further synthesize the voice spectrum trajectory and the melody indicated by the template into a song singing waveform, which represents song performance matching with the melody.

[0055] In some implementations, the lyrics synthesizing module 410 may use a vocoder to synthesize the voice spectrum trajectory with the melody. The resulting singing waveform may be provided to the output device 160 (for example, a loudspeaker) of the computing device 102 to play the song. Alternatively, the singing waveform may be provided by the computing device 102 to other external devices to play the song.

[0056] The voice model used by the song synthesizing module 410 to generate the voice spectrum trajectory of the lyrics may be a predefined voice model, which may be trained using several voice segments so that the corresponding voice spectrum trajectory can be generated based on input words or lyrics. The voice model may be constructed based on, for example, a hidden Markov model (HMM) model or various neural network-based models (e.g., a DNN-based or long short-term memory (LSTM)-based model). In some implementations, the voice model may be trained using a plurality of voice segments of a certain singer. In some other implementations, the voice model may be trained using a plurality of voice segments of different singers so that the voice model can represent average speech features of these singers. Such voice model may also be referred to an average voice model. The predefined voice model may be stored partially or totally in the local storage device 130 of the computing device 102, and/or an accessible external device such as the database 170.

[0057] In some cases, the user might expect that the song can be sung with personalized voice. Therefore, in some implementations, the song synthesizing module 410 may receive one or more voice segments 402 of a specific singer input by the user, and train the voice model based on the voice segments. Usually, the user-input voice segments might be limited and insufficient to train a voice model that can work well. Hence, the song synthesizing module 410 may use the received voice segments 402 to adjust the predefined average voice model so that the adjusted average voice model can represent a voice characteristic of a singer in the voice segments 402. Of course, in other implementations, it is also possible to require the user to input sufficient voice segments of one or more specific singers so that the corresponding voice model can be trained for the voice of the singer(s).

Example Process

[0058] Fig. 5 illustrates a flowchart of a process of automatic song generation 500 in accordance with some implementations of the subject matter described herein. The process 500 may be implemented by the computing device 120, for example, may be implemented in the module 122 of the computing device 102.

[0059] At 510, in response to receiving an input from a user, the computing device 102 determines, based on the input, a creation intention of the user with respect to a song to be generated. At 520, the computing device 102 generates a template for the song based on the creation intention. The template indicates a melody of the song and a distribution of lyrics relative to the melody. At 530, the computing device 102 generates the lyrics of the song based at least in part on the template. Furthermore, in some implementations, the computing device 102 may generate the lyrics further based on the creation intention.

[0060] In some implementations, the process 500 may further include combining the lyrics and the melody indicated by the template to generate the song.

[0061] In some implementations, the process 500 may further include obtaining a voice model representing a voice characteristic of a singer; generating a voice spectrum trajectory for the lyrics using the voice model; synthesizing the voice spectrum trajectory and the melody indicated by the template into a singing waveform of the song; and playing the song based on the singing waveform.

[0062] In some implementations, obtaining the voice model includes receiving a voice segment of the singer; and obtaining the voice model by adjusting a predefined average voice model with the received voice segment, the average voice model being obtained with voice segments of a plurality of different singers.

[0063] In some implementations, generating the template based on the creation intention includes selecting, based on the creation intention, the template from a plurality of candidate templates.

[0064] In some implementations, generating the template based on the creation intention includes: dividing at least one existing song melody into a plurality of melody segments; selecting, based on the creation intention, a plurality of candidate melody segments from the plurality of melody segments; concatenating, based on smoothness among the plurality of candidate melody segments, at least two of the plurality of candidate melody segments to form the melody indicated by the template; and determining the distribution of the lyrics relative to the melody indicated by the template by analyzing lyrics in songs corresponding to the concatenated at least two candidate melody segments.

[0065] In some implementations, generating the lyrics includes: generating candidate lyrics based at least in part on the template; and modifying the candidate lyrics based on a further input received from the user to obtain the lyrics.

[0066] In some implementations, generating the lyrics includes: obtaining a predefined lyrics generation model, the lyrics generation model being obtained with a plurality of pieces of existing lyrics; and generating the lyrics based on the template using the lyrics generation model.

[0067] In some implementations, the input of the user includes at least one of an image, a word, a video, or an audio.

Example Implementations

[0068] Some example implementations of the subject matter described herein are listed below.

[0069] In one aspect, the subject matter described herein provides a computer- implemented method, comprising: in response to receiving an input from a user, determining, based on the input, a creation intention of the user with respect to a song to be generated; generating a template for the song based on the creation intention, the template indicating a melody of the song and a distribution of lyrics relative to the melody; and generating the lyrics of the song based at least in part on the template.

[0070] In some implementations, generating the lyrics further comprises: generating the lyrics further based on the creation intention. [0071] In some implementations, the method further comprises combining the lyrics and the melody indicated by the template to generate the song.

[0072] In some implementations, the method further comprises obtaining a voice model representing a voice characteristic of a singer; generating a voice spectrum trajectory for the lyrics using the voice model; synthesizing the voice spectrum trajectory and the melody indicated by the template into a singing waveform of the song; and playing the song based on the singing waveform.

[0073] In some implementations, obtaining the voice model comprises receiving a voice segment of the singer; and obtaining the voice model by adjusting a predefined average voice model with the received voice segment, the average voice model being obtained with voice segments of a plurality of different singers.

[0074] In some implementations, generating the template based on the creation intention comprises selecting, based on the creation intention, the template from a plurality of candidate templates.

[0075] In some implementations, generating the template based on the creation intention comprises: dividing at least one existing song melody into a plurality of melody segments; selecting, based on the creation intention, a plurality of candidate melody segments from the plurality of melody segments; concatenating, based on smoothness among the plurality of candidate melody segments, at least two of the plurality of candidate melody segments to form the melody indicated by the template; and determining the distribution of the lyrics relative to the melody indicated by the template by analyzing lyrics in songs corresponding to the concatenated at least two candidate melody segments.

[0076] In some implementations, generating the lyrics comprises: generating candidate lyrics based at least in part on the template; and modifying the candidate lyrics based on a further input received from the user to obtain the lyrics.

[0077] In some implementations, generating the lyrics comprises: obtaining a predefined lyrics generation model, the lyrics generation model being obtained with a plurality of pieces of existing lyrics; and generating the lyrics based on the template using the lyrics generation model.

[0078] In some implementations, the input includes at least one of an image, a word, a video, or an audio.

[0079] In another aspect, the subject matter described herein provides a device. The device comprises: a processing unit; and a memory coupled to the processing unit and including instructions stored thereon which, when executed by the processing unit, cause the device to perform acts including: in response to receiving an input from a user, determining, based on the input, a creation intention of the user with respect to a song to be generated; generating a template for the song based on the creation intention, the template indicating a melody of the song and a distribution of lyrics relative to the melody; and generating the lyrics of the song based at least in part on the template

[0080] In some implementations, generating the lyrics further comprises: generating the lyrics further based on the creation intention.

[0081] In some implementations, the acts further include combining the lyrics and the melody indicated by the template to generate the song.

[0082] In some implementations, the acts further include obtaining a voice model representing a voice characteristic of a singer; generating a voice spectrum trajectory for the lyrics using the voice model; synthesizing the voice spectrum trajectory and the melody indicated by the template into a singing waveform of the song; and playing the song based on the singing waveform.

[0083] In some implementations, obtaining the voice model comprises receiving a voice segment of the singer; and obtaining the voice model by adjusting a predefined average voice model with the received voice segment, the average voice model being obtained with voice segments of a plurality of different singers.

[0084] In some implementations, generating the template based on the creation intention comprises selecting, based on the creation intention, the template from a plurality of candidate templates.

[0085] In some implementations, generating the template based on the creation intention comprises: dividing at least one existing song melody into a plurality of melody segments; selecting, based on the creation intention, a plurality of candidate melody segments from the plurality of melody segments; concatenating, based on smoothness among the plurality of candidate melody segments, at least two of the plurality of candidate melody segments to form the melody indicated by the template; and determining the distribution of the lyrics relative to the melody indicated by the template by analyzing lyrics in songs corresponding to the concatenated at least two candidate melody segments.

[0086] In some implementations, generating the lyrics comprises: generating candidate lyrics based at least in part on the template; and modifying the candidate lyrics based on a further input received from the user to obtain the lyrics.

[0087] In some implementations, generating the lyrics comprises: obtaining a predefined lyrics generation model, the lyrics generation model being obtained with a plurality of pieces of existing lyrics; and generating the lyrics based on the template using the lyrics generation model.

[0088] In some implementations, the input includes at least one of an image, a word, a video, or an audio.

[0089] In a further aspect, the subject matter described herein provides a computer program product. The computer program product is tangibly stored on a transitory computer-readable medium and comprising machine-executable instructions which, when executed by a device, cause the device to: in response to receiving an input from a user, determine, based on the input, a creation intention of the user with respect to a song to be generated; generate a template for the song based on the creation intention, the template indicating a melody of the song and a distribution of lyrics relative to the melody; and generate the lyrics of the song based at least in part on the template

[0090] In some implementations, the machine-executable instructions which, when executed by a device, further cause the device to generate the lyrics further based on the creation intention.

[0091] In some implementations, the machine-executable instructions which, when executed by a device, further cause the device to combine the lyrics and the melody indicated by the template to generate the song.

[0092] In some implementations, the machine-executable instructions which, when executed by a device, further cause the device to obtain a voice model representing a voice characteristic of a singer; generate a voice spectrum trajectory for the lyrics using the voice model; synthesize the voice spectrum trajectory and the melody indicated by the template into a singing waveform of the song; and play the song based on the singing waveform.

[0093] In some implementations, the machine-executable instructions which, when executed by a device, cause the device to receive a voice segment of the singer; and obtain the voice model by adjusting a predefined average voice model with the received voice segment, the average voice model being obtained with voice segments of a plurality of different singers.

[0094] In some implementations, the machine-executable instructions which, when executed by a device, cause the device to select, based on the creation intention, the template from a plurality of candidate templates.

[0095] In some implementations, the machine-executable instructions which, when executed by a device, cause the device to divide at least one existing song melody into a plurality of melody segments; select, based on the creation intention, a plurality of candidate melody segments from the plurality of melody segments; concatenate, based on smoothness among the plurality of candidate melody segments, at least two of the plurality of candidate melody segments to form the melody indicated by the template; and determine the distribution of the lyrics relative to the melody indicated by the template by analyzing lyrics in songs corresponding to the concatenated at least two candidate melody segments.

[0096] In some implementations, the machine-executable instructions which, when executed by a device, cause the device to generate candidate lyrics based at least in part on the template; and modify the candidate lyrics based on a further input received from the user to obtain the lyrics.

[0097] In some implementations, the machine-executable instructions which, when executed by a device, cause the device to obtain a predefined lyrics generation model, the lyrics generation model being obtained with a plurality of pieces of existing lyrics; and generate the lyrics based on the template using the lyrics generation model.

[0098] In some implementations, the input includes at least one of an image, a word, a video, or an audio.

[0099] The functionally described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-Programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

[00100] Program code for carrying out methods of the subject matter described herein may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

[00101] In the context of this disclosure, a machine readable medium may be any tangible medium that may contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable medium may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

[00102] Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are contained in the above discussions, these should not be construed as limitations on the scope of the subject matter described herein, but rather as descriptions of features that may be specific to particular implementations. Certain features that are described in the context of separate implementations may also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation may also be implemented in multiple implementations separately or in any suitable subcombination.

[00103] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter specified in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.