Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR GENERATING ARTIFICIAL SOUND EFFECTS BASED ON EXISTING SOUND CLIPS
Document Type and Number:
WIPO Patent Application WO/2018/077364
Kind Code:
A1
Abstract:
The invention provides a method for generating a sound effect audio clip based on a mix of audible characteristics of two existing audio clips. The method comprises selecting first and second audio clips, mapping evolution of time of a plurality of predetermined audible characteristics of the first audio clip to arrive at first mapping data accordingly. The second audio clip is then modified based on the first mapping data, so as to at least partially apply evolution of time of audible characteristics from the first audio clip to the second audio clip, and outputting the sound effect audio clip in response to the modified second audio clip. Preferred audible characteristics are such as: amplitude, pitch, and spectral envelope (e.g. formant), which are each represented in mapping data as values representing the audible characteristics for the duration of the first audio clip at a given time resolution, where each value represents a value or a set of values representing the result of an analysis over a predetermined time windows. Especially, the second audio clip may also be mapped with respect to evolution of time of corresponding audible characteristics, and the modification of the second audio clip can then be performed in response to a mix of the two mapping data sets, e.g. by a frame-by-frame processing. A time alignment of the first and second audio clips may be performed, so that the two audio clips have the same duration prior to being processed.

Inventors:
KJÆR LARS-BO (DK)
Application Number:
PCT/DK2017/050351
Publication Date:
May 03, 2018
Filing Date:
October 27, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
TRANSF APS (DK)
International Classes:
G11B27/031; G10H1/36; G10L21/04; G10L21/055; G10L25/03
Foreign References:
US20060165240A12006-07-27
US20140229831A12014-08-14
US6266003B12001-07-24
Other References:
BONADA J ET AL: "Spectral Approach to the Modeling of the Singing Voice", AUDIO ENGINEERING SOCIETY CONVENTION PAPER, NEW YORK, NY, US, 21 September 2001 (2001-09-21), pages 1 - 10, XP002375996
Attorney, Agent or Firm:
PLOUGMANN VINGTOFT A/S (DK)
Download PDF:
Claims:
CLAIMS

1. A method for generating a sound effect audio clip, the method comprising - selecting first and second audio clips,

- mapping evolution of time of a plurality of predetermined audible charateristics of the first audio clip to arrive at first mapping data

accordingly,

- modifying the second audio clip based on the first mapping data, so as to at least partially apply evolution of time of audible characteristics from the first audio clip to the second audio clip, and - outputting the sound effect audio clip in response to the modified second audio clip.

2. Method according to claim 1, comprising mapping evolution of time of a plurality of predetermined audible charateristics of the second audio clip to arrive at second mapping data accordingly, such as mapping evolution of time of identical predetermined audible characteristics for the first and second audio clips.

3. Method according to claim 1 or 2, wherein said plurality of predetermined audible characteristics comprise characteristics descriptive of at least one of:

amplitude, pitch, and spectral envelope, such as formant.

4. Method according to any of the preceding claims, wherein said plurality of predetermined audible characteristics comprise at least two audible characteristics descriptive of respective two of: amplitude, pitch, and spectral envelope, such as formant.

5. Method according to any of the preceding claims, wherein said plurality of predetermined audible characteristics comprise at least three audible

characteristics descriptive of respective of: amplitude, pitch, and spectral envelope, such as formant.

6. Method according to claim 5, further comprising a fourth audible characteristics descriptive of a feature different from amplitude, pitch, and formant. 7. Method according to any of the preceding claims, wherein modifying the second audio clip comprises modifying the second audio clip in accordance with the first mapping data, so as to apply evolution of time of audible characteristics from the first audio clip to the second audio clip with respect to at least one audible characteristics.

8. Method according to any of the preceding claims, wherein said mapping comprises analysing the first audio clip with respect to the plurality of

predetermined audible characteristics over time using a predetermined time window.

9. Method according to any of claims 1-7, comprises user editing the first mapping data prior to modifying the second audio clip.

10. Method according to claim 9, wherein said user editing comprises transposing and/or scaling at least one the evolution of time of the predetermined plurality of audible characteristics up or down.

11. Method according to claim 2, wherein the step of modifying the second audio clip is performed in response to a combination of the first and second mapping data.

12. Method according to claim 2, wherein the step of modifying the second audio clip is performed in response to a mixing or averaging of data values of the first and second mapping data.

13. Method according to claim 12, wherein said mixing or averaging is performed in response to a user input, such as based on displaying the first and second mapping data versus time to the user.

14. Method according to any of the preceding claims, wherein the step of modifying the second audio clip is performed by a frameby-frame processingof the second audio clip.

5 15. Method according to any of the preceding claims, comprising performing a time stretching or time compression on the second audio clip, if a duration of the first and second audio clips differ by more than a preset value, so as to match its duration to the first duration.

10 16. Method according to claim 15, wherein said time stretching or time

compression comprises performing a frame remapping of the second audio clip.

17. Method according to any of the preceding claims, comprising repeating step 1) for a plurality of different first audio clips, and storing in a database data

15 representative of the first mapping data or modified version of the first mapping data obtained for the respective first audio clips, along with data representative of the first duration.

18. Method according to claim 17, comprising a user selecting one of the first 20 mapping data from the database prior to modifying the second audio clip.

19. Method according to claim 18, comprising displaying a visual representation of one or more of the plurality of first mapping data by means of a graphical representation of at least one of said plurality of audible characteristics versus

25 time, so as to allow the user to select one of the first mapping data.

20. Method according to any of the preceding claims, comprising allowing the user to select the second audio clip between a plurality of prestored audio clips.

30 21. Method according to any of the preceding claims, comprising playing the

sound effect audio clip to the user time aligned with displaying a prestored video sequence.

22. Method according to any of the preceding claims, comprising storing information to allow synchronisation of the modifying the second audio clip in relation to a time grid of a video sequence. 23. Method according to any of the preceding claims, wherein the first and second audio clips are received in a digital format.

24. Method according to any of the preceding claims, comprising the step of automatically generating one or more random parameters to be used in the generation of the modified second audio clip, such as in response to an input from a user.

25. Method according to claim 24, comprising the step of generating a number of random sets of parameter to be used in the generation of a number of different modified second audio clips, such as in response to the number being input from the user.

26. Method according to any of the preceding claims, comprising the step of synchronizing a tempo between the first and second audio clips in response to an input from a user.

27. An apparatus, such as a digital audio workstation, comprising a processor and a memory and being configured for carrying out the method according to any one of the preceding claims, such as comprising a plurality of processors arranged for parallel processing.

28. Computer program product having instructions which when executed cause a computing device or a computing system, such as the apparatus according to claim 27, to perform the method according to any one of claims 1-26.

29. Computer program product according to claim 28, being one of: an audio application, a digital audio workstation plug-in, and a stand-alone software product for a general computer.

30. Computer program product according to claim 28 or 29, designed for movie, tv, video or online audio production.

31. Computer program product according to claim 28 or 29, designed for musical composition or production, such as designed as a DJ tool.

32. Computer program product according to claim 28 or 29, designed for gaming audio production. 33. Computer program product according to claim 28 or 29, designed for Virtual Reality or Augmented Reality audio production.

34. Use of the method according to any of claims 1-26 for movie, tv, video or online audio production.

35. Use of the method according to any of claims 1-26 for musical composition or production, such as used by a DJ.

36. Use of the method according to any of claims 1-26 for gaming audio production.

37. Use of the method according to any of claims 1-26 for Virtual Reality or Augmented Reality audio production. 38. A computer readable medium having stored thereon a computer program product according to any of claims 28-33.

Description:
METHOD FOR GENERATING ARTIFICIAL SOUND EFFECTS BASED ON EXISTING SOUND CLIPS

FIELD OF THE INVENTION

The present invention relates to the field of signal processing, especially audio signal processing, and more specifically to a method and software for generating new sound effects, such as artificial sound like sound effects for movies, based on existing audio clips. More specifically, the invention relates to modifying or morphing audio signals with the purpose of creating new sound clips based on two or more sound clips, e.g. from a large database, in an easy way.

BACKGROUND OF THE INVENTION For sound engineers, sound designers or producers, a lot of different audio signal processing tools or Digital Audio Workstations (DAWs) plugins and applications are available for modifying, tweaking or "sweetening" sound and sound effects in movies or cartoons, elements in music productions or game audio. There is an increasing demand for manipulation of audio taken from e.g. databases. A wide range of tools which each can apply audio signal processing to modify audio, exist e.g. by applying various manipulations in the frequency, amplitude, formant or time domain. Most often, if a change require adjustments to both frequency, amplitude, formant or time to obtain the desired final sound, a number of different signal processing steps with different software are required. Since each signal processing step results in its own audible artefacts, subsequent processing steps may produce unwanted results due to artefacts created by a preceeding processing step, which then needs to be redone in a different way, or preceding steps may be discarded. This results in a very time consuming and non-creative process in providing the desired sound.

Well known ways of analysing, manipulating, synthesizing and generating sound are known, e.g. by Waves Audio, Serato Pitch'n'Time, or US 6,266,003.

However, in spite of a large number of existing tools, it may be challenging to generate artificial sounds which can not be generated directly from a sound recording, but needs to be generated by significant modifications e.g. a recording of the sound produced by a natural sound source. Furthermore, if the sound is intended to match the timing of a sequence in a movie or cartoon, the task is even harder. E.g. to provide the artificial sound of an accelerating supernatural motorbike in a science fiction cartoon, the audio engineer is faced with a considerable amount of work with several DAW plugins to achieve the result which has the desired audible characteristics and matches the timing of the sequence in the pre-produced visual part of the cartoon. SUMMARY OF THE INVENTION

Thus, according to the above description, it is an object of the present invention to provide a method, and software including a graphical user interface, to allow a user to easily, creatively and intuitively manipulate an audio clip in real time into a modified audio clip with desired behavioural characteristics of a parent section (or master) audio clip.

In a first aspect, the invention provides a method for generating a sound effect audio clip, the method comprising

- selecting first and second audio clips,

- mapping evolution of time of a plurality of predetermined audible charateristics of the first audio clip to arrive at first mapping data accordingly,

- modifying the second audio clip based on the first mapping data, so as to at least partially apply evolution of time of audible characteristics from the first audio clip to the second audio clip, and

- outputting the sound effect audio clip in response to the modified second audio clip.

The 'first audio clip' referred to, can be understood as a parent section audio clip or a master audio clip, which it will be called in following. In the same waw the 'second audio clip' referred to, can be understood as a child section audio clip or a slave audio clip, which it will be called in the following.

The method according to the invention enables anyone from the non-skilled to the top professional user to rapidly and interactively, explore a multi-dimensional space within the field of sound effect creation in real time in one single tool.

Hence, intuition and the creative focus is maintained by getting unique results much quicker than previously possible. It is possible to generate an endless amount of variations of sounds rapidly, by manipulating various parameters in conjunction.

The method listening to the result of combining a short audio clip, e.g. of a car motor sound, with a number of different other sound clips from a database, and arrive at a desired sound. The resulting mapping data can then be stored for later use, e.g. by applying the same mapping to a longer clip of the car sound, e.g. matching a scene of a movie, and then applying the stored mapping to the longer clip, thereby saving considerable time compared to fully manually creating the desired audible characteristics during the long audio clip. Especially, the method allows a user to take a master audio clip, which is then analyzed with respect to a predetermined audible characteristics (also called fingerprint or sound texture). This fingerprint can then be dynamically applied, e.g. in a user modified form, to a slave audio clip. As a result, the modified audio clip has the basic features of the slave audio clip, but with the behavioural characteristics of the master audio clip. This provides a valuable tool for a user in creating new sounds in a new and intuitive simpler way, and the method provides an easy way of copying audible features from one audio clip and applying them to another audio clip in an intuitive way without the need to manually sequentially apply a plurality of different signal processing tools. Especially, if mapping includes all of evolution of time of: amplitude, pitch and spectral envelope in the form of formant, a good mapping of the audible characteristics or sound texture of an audio clip is obtained.

It is to be understood that evolution of time of an audible characteristics is to be performed by analysing the characteristics with respect to a predetermined analysis algorithm using a suitable analysis time window lengths for the given audible characteristics. Mapping data can then be stored as values (or set of values) representing a result of an analysis of the master audio clip at a given time. E.g. with respect to pitch, it is to be understood that this may be analyzed in a simple way in case of monophonic audio clips, while more advanced analysis algorithms may be applied in case of non-monophonit audio clips. With respect to spectral envelope, preferably formant, the mapping data as a function of time may be in the form of predetermined filter coefficients of a predetermined spectral filter representation, e.g. a second order IIR filter or taps of a FIR filter etc., or as values representing amplitudes and bandwidths at a predetermined set of formant frequencies.

To modify according to the mapped audible characteristics, the skilled person knows processing techniques to apply so as to arrive at a sound effect audio clip with the desired characteristics corresponding to the mapping data.

Especially, the method can preferably be implemented as a software product which may be further implemented with features allowing the sound effect audio clips generated to be in time synchronisation with sequences of a video

production.

It may be preferred to utilize the method according to the invention together with a database or library of mapping data (fingerprints or sound textures) which may be stored e.g. with pitch and formant values stored as a representation relative to a C4 (=261.6 Hz) representation, to allow transposing to other values, e.g. to match a slave audio clip which is very different with respect to pitch, e.g. to match the sound of a large truck and a high pitch turbine.

It is to be understood that the method can be used in addition to existing spectral and dynamic filtering techniques for modifying or tweaking sound.

By 'duration' of an audio clip is understood the exact length of an audio clip, i.e. its temporal length when played back at the intended sample rate. In the following, preferred features and/or embodiments of the method will be defined.

By 'audio clip' is understood a piece of audio in an analog or digital

representation, having a certain duration in time, when played back as intended. E.g. the audio clip can be a recorded acoustic signal or a synthetically generated audio signal. E.g. an audio clip may preferably have a duration of such as 1-10 seconds, however it may be shorter or it may be longer. An audio clip preferably comprises a representation of a plurality of frames each comprising a plurality of samples.

By 'amplitude' versus time is understood the magnitude of an audio clip analyzed in small time segments.

By 'pitch' versus time is understood a representation of base frequency of an audio clip analyzed in small time segments.

By 'formant' is understood a prominent peak in the spectral envelope and/or to a resonance in sound sources, notably the human voice, musical instruments. In a preferred implementation, the formants may be a spectral envelope applied as a spectral weighting.

It is known in the art how to apply signal processing algorithms so as to analyse an audio clip with respect to all of these audible characteristics. It is preferred that audible characteristics comprises at least two audible characteristics descriptive of respective two of: amplitude, pitch, and formant. More preferably, said at least one audible descriptor comprises at least three audible descriptors versus time, descriptive of respective of: amplitude, pitch, and formant. Even further, a fourth audible characteristics descriptive of a feature different from amplitude, pitch, and formant may be comprised in the first mapping data.

Generating mapping data for use in modifying the second audio clip may comprise setting the mapping data equal to the first mapping data, i.e. in such

embodiment, the predetermined audible charatetistics of the first audio clip, the master audio clip, is merely copied and applied to the second audio clip, the slave audio clip. Alternative to this, generating modified mapping data may comprise modifying at least one of the audible characteristics of the first mapping data. This allows the user to influencing the final modified sound effect audio clip in a new and creative way. More alternatively, or additionally, generating the modified mapping data may comprise transposing at least one audible descriptor versus time descriptive of the first mapping data up or down, e.g. to match two very different audio clips. The step of modifying the second audio clip may comprise modifying the second audio clip in accordance with the first mapping data, so as to apply evolution of time of audible characteristics from the first audio clip to the second audio clip with respect to at least one audible characteristic. The mapping preferably comprises analyzing the first audio clip with respect to the plurality of predetermined audible characteristics over time using a

predetermined time window.

The user may edit the first mapping data prior to the step of modifying the second audio clip. Especially, the user editing may comprise transposing and/or scaling at least one of the evolution of time of the predetermined plurality of audible characteristics up or down.

The method may comprise the possibility of the user choosing a specific frequency spectrum range for tracking, also called a 'frequency eye'. Here is understood a specific frequency spectrum range of the analysed first (master) audio clip, which the selected second (slave) audio clip will use to track from. An example could be, that the 'frequency eye' is set to 200 - 500 Hz. This means that the second (slave) audio clip will only track to the portion of the first (master) audio clip in the 200 - 500 Hz range.

The method may comprise mapping evolution of time of a plurality of

predetermined audible charateristics of the second audio clip to arrive at second mapping data accordingly. Especially, the step of modifying the second audio clip is performed in response to a combination of the first and second mapping data. Especially, the step of modifying the second audio clip may be performed in response to a mixing or averaging of data values of the first and second mapping data. Specifically, said mixing or averaging is performed in response to a user input, such as based on displaying the first and second mapping data versus time to the user. Generating the modified mapping data may comprise comparing values of time evolution of at least one audible characteristics of the first and second mapping data, i.e. in embodiments where a second mapping data is generated for the second audio clip, the slave audio clip, as well. Especially, generating the modified mapping data may comprise mixing or averaging values of at least one audible characteristics of the first and second mapping data, so as to arrive at at least one modified audible characteristics versus time for the modified mapping data. This mixing between the two mapping data may be performed in response to a user input, such as based on displaying descriptors of the first and second audible characteristics versus time to the user.

The sound effect audio clip may be generated by a frame-by-frame application of the modified first mapping data to the second audio clip, wherein each frame comprises a plurality of consequtive time samples of audio. However, it is to be understood that other methods for modifying the second audio clip so that it matches the preferred time evolution of preferred audible characteristics, as known in the art. The method may comprise performing a time alignment of the first and second audio clips, e.g. in the form of performing a time stretching or time compression on the second audio clip, if the duration of the first and second audio clips differ by more than a preset value, so as to match their durations. Especially, said time alignment, e.g. time stretching or time compression, may comprise performing a sample or frame remapping of the second audio clip. Said time alignment may be performed in response to a user input, or it may be performed automatically in response to identifying that the first and second durations are different, e.g.

different by more than a preset value. The method may comprise repeating the mapping step for a plurality of different first audio clips, and storing in a database data representative of a plurality of first mapping data obtained for the respective first audio clips, e.g. along with data representative of the duration of the first audio clips. In this way a valuable library of audible mapping data or sound textures can be built which allows the user to efficiently being able to select among different sound textures in the work with creating new sounds. The method may comprise a user selecting one of the first mapping data from the database prior to performing subsequent steps of generating the sound effect audio clip. More specifically, the method may comprise displaying a visual representation of one or more of the plurality of audible characteristics by means of a graphical representation of the at least one parameter or descriptor of the characteristics versus time, so as to allow the user to select one of the first mapping data to be applied.

The method may comprise allowing the user to select the second audio clip between a plurality of prestored audio clips, preferably with a GUI that allows visualization of loaded audio clips, e.g. comprising a visualization of the audio clip signal level versus time.

The method may comprise playing the sound effect audio clip to the user time aligned with displaying a prestored video sequence.

The method may comprise storing information to allow synchronisation of the sound effect audio clip in relation to a time grid of a video sequence. Especially, the first and second audio clips may be in a digital format, so as to allow digital processing.

In a second aspect, the invention provides an apparatus, such as a DAW, comprising a processor and a memory and being configured for carrying out the method according to any one of the preceding claims. Especially, the apparatus may comprise a computer or server with at least one multicore processor and/or multiple processors, so as to allow parallel processing. Further, the apparatus preferably comprises a user interface preferably comprising a GUI, e.g. comprising a keyboard, a computer mouse, a color display, e.g. with a touch sensitive screen.

In a third aspect, the invention provides a computer program product having instructions which when executed cause a computing device or a computing system, such as the apparatus according to the second aspect, to perform the method according to the first aspect. Especially, the computer program product may be one of: an audio application, a digital audio workstation plug-in, and a stand-alone software product for a general computer. It is to be understood that the computer program product instructions in the form of program code which may be implemented on any type of audio processing platform, e.g. a sound card in a computer, a general processor in a mobile device e.g. in the form of a downloadable application for a programmable device. Preferably, the computer program product implements a GUI.

The computer program product may be specially designed for generation of audio clips for various applications. Especially, film or video production, musical production, such as designed as a DJ tool, gaming audio production, or Virtual Reality or Artificial Reality audio production.

In a fourth aspect, the invention provides a computer readable medium having stored thereon a computer program product according to the third aspect.

It is appreciated that the same advantages and embodiments described for the first aspect apply as well for the second, third, and fourth aspects. Further, it is appreciated that the described embodiments can be intermixed in any way between all the mentioned aspects.

BRIEF DESCRIPTION OF THE FIGURES

The invention will now be described in more detail with regard to the

accompanying figures of which

Fig. 1 illustrates a block diagram of an simple embodiment,

Fig. 2 illustrates a block diagram of another embodiment, Fig. 3 illustrates examples of master and slave audio clips with different durations,

Figs. 4-6 illustrate examples of audible descriptors versus time of master and slave fingerprints in the form of amplitude, pitch, and formant versus time, respectively, Fig. 7 illustrates steps of a method embodiment,

Figs. 8-10 show different examples of Graphical User Interface (GUI) screen shots, and

Figs, lla-llc show details of a GUI embodiment.

The figures illustrate specific ways of implementing the present invention and are not to be construed as being limiting to other possible embodiments falling within the scope of the attached claim set.

DETAILED DESCRIPTION OF THE INVENTION

In the following basic elements of the method according to the invention will be described. It is to be understood that the method can preferably be implemented as DAW software to be used on a dedicated audio workstation or a general computer, such as a personal computer or a tablet or the like. Preferably, such implementation comprises a user interface comprising audio and graphical interface elements, e.g. to allow a user to select between stored first audio clips to be used as a master audio clip, and stored second audio clips to be used as a slave audio clip. Such implementation will provide a creative and intuitive tool for an audio engineer or producer in generating a modified audio clip being a modified version of the slave audio clip but with sound texture caught from the master clip. The modification introduces artefacts, which are in fact intended and forms part of the process in creating a new artificial sound effects, still having the sound texture in common with two original sounds.

Fig. 1 shows a method embodiment, where a master audio clip M_AC in digital format with a first duration is analysed FP_A, so as to determine a master mapping FPM comprising at data representing time evolution of audible

characteristics of the master audio clip versus time, preferably analysed over a time window and represented as data values frame-by-frame. Preferably, the master mapping FPM comprises data representing time evolution of three audible characteristics, namely: amplitude, pitch, and formant. In this way a good description of the sound texture of the master audio clip M_AC is obtained. As already explained, the way to precisely analyse with respect to these

characteristics as a function of time and represent data values accordingly, is out of the scope of the present application, but this is within the knowledge of the skilled person.

The slave audio clip S_AC is also analysed with the same audible characteristics FP_A analysing algorithm as the master audio clip M_AC, i.e. to obtain for the slave audio clip S_AC slave mapping data FPS including data values corresponding to the same audible characteristics versus time as for the master audio clip M_AC. Next, the difference in the two mapping data FPM and FPS is generated, and the resulting modified mapping data M_FP is then applied to a modification algorithm, which receives a slave audio clip S_AC with a second duration in digital format and modifies it into a modified audio clip MOD_AC, i.e. the resulting sound effect audio clip. The modification algorithm generates a modified audio clip MOD_AC by applying signal processing to the slave audio clip S_AC, in order to arrive at the modified audio clip MOD_AC which hereby receives the modified M_FP audible characteristics, and thus the audible characteristics of the master audio clip M_AC, i.e with audible characteristics in accordance with the master characteristics FPM, namely identical to, or close to identical to, all of the audible characteristics versus time of the master mapping FPM, i.e. amplitude, pitch, and formant. The modified mapping data M_FP is generated as a mapping data difference between the master and slave audio clips in order to nullify the effect of the slave audio clip fingerprint FPS upon modification of the slave audio clip, thereby resulting in a modified audio clip MOD_AC which has the fingerprint FPM from the master audio clip M_AC. Preferably, separate modifications are performed, frame-by-frame, for each of the audible descriptors amplitude, pitch, and formant.

It is to be understood, that the process of generating the modified audio clip M_FP may be performed in alternative ways, so as to reduce the full effect of the characteristics in the mapping data FPM of the master audio clip M_AC. E.g. the difference between FPS and FPM may instead calculated as M_FP = FPS - k*FPM, where k is between 0 and 1, thus allowing the user to adjust the amount of influence from the master audio clip M_AC on the resulting modified audio clip MOD_AC to be output, i.e. the resulting sound effect audio clip. Fig. 2 illustrates an alternative embodiment, which to a large extent is similar to the embodiment illustrated in Fig. 1, however where the modified mapping data of the master M_FP is modified in accordance with a user input U_INP, e.g. the user may, based on a visual representation of the audio characteristics versus time, modify or manipulating the resulting characteristics versus time, e.g. one of or all of the audible descriptors therein, after the difference has been generated, so as to generate a modified mapping data M_FP which is then used in the final step of generating the modified audio clip MOD_AC by modifying the slave audio clip S_AC, thereby arriving at the sound effect audio clip.

Fig. 3 shows an example of a master audio clip M_AC and a slave audio clip S_AC, illustrated as time signals, and each having a durating, M_D and S_D. As seen, the length of the two clips are different, here S_D > M_D. Preferably, in the step of generating the modified audio clip, the master and slave audio clips M_AC, S_AC preferably have the same duration, so as to allow a frame-by-frame modification based on their fingerprints. This can be obtained in different ways, e.g. by a truncation of the one with the longest duration, here the slave audio clip S_AC. The dashed vertical line indicates a possible truncation of the slave audio clip S_AC to match to the duration of the master audio clip M_AC.

However, in case the user prefers to user the entire audio clips for some reason, a time stretching or time compression algorithm may be applied to either the master or slave audio clip M_AC, S_AC. Such algorithms are known in the art. Such time matching may be especially involve a frame remapping of the slave audio clip S_AC. In case the master and slave audio clips M_AC, S_AC are digital representation at different sample rates, an upsampling algorithm may be applied to either one or both, prior to prforming a frame-by-frame matching.

It is to be understood that such additional step of matching duration of the master audio clip M_AC to the slave audio clip S_AC may be added to the embodiments shown in of Fig. 1 and 2.

Figs. 4-6 illustrate examples of graphs showing preferred audio characteristics to be analysed versus time, preferably for both the master and the slave audio clips. Fig. 4 shows a graphical representation of amplitude versus time for a master audio clip M_AC and a slave audio clip S_AC, to illustrate the amount of alteration. The values of the amplitude parameter is compared, frame- by-frame. Especially, the difference between the two compared may be registered and stored, thus allowing in a subsequent step the slave audio clip S_AC to be modified or changed into the modified audio clip, so as to match that of the master audio clip M_AC, frame- by-frame.

Fig. 5 shows a graphical representation of pitch for a small portion of monophonic master and slave audios clip M_AC, S_AC, to illustrate the amount of alteration, at a specific point in time. The values of the pitch parameter is compared frame-by- frame. As described above for the amplitude, the difference between the two compared is registered and stored and the parameter, in this case the pitch, of the slave audio clip is changed, to match that of the master audio clip, frame-by- frame.

Fig. 6 shows a graphical representation of formant for a small portion of a master audio clip M_AC and a slave audio clip S_AC, to illustrate the amount of alteration, at a specific point in time. As explained for amplitude and pitch, the values of the formant parameter, each of the frequency bins, is compared, frame- by-frame. The difference between the two compared, is registered and stored and the formant, of the slave audio clip is changed, to match that of the master audio clip frame- by-frame. It is known in the art how to apply signal processing so as to analyse an audio clip to arrive at an audible descriptor versus time for amplitude, pitch as well as formant, and further it is known in the art how to modify an audio clip, so as to change characteristics of the audio clip to match each of the audible descriptors, e.g. frame- by-frame.

Fig. 7 illustrates steps of a method embodiment. First, a slave audio clip is received R_A_SL, and next, a master audio clip is received R_A_MST. The master audio clip and the slave audio clips are then analyzed AN_MST, AN_SL with respect to one or more selected audible descriptors versus time, so as to arrive at fingerprints for both the master and the slave audio clips. Preferably with these fingerprints comprising all of: amplitude, pitch, and formant.

Further, a time aligning TMA_MST_SL of the master and slave audio clips is performed on the fly during analysis, in case it is detected that the duration of the master audio clip is different from the duration of the slave audio clip. This may include applying a time stretching and/or a truncation so as to make the lengths or durations of the master and slave audio clips to match. Then a modified mapping data is generated G_MFGP in response to the

characteristics generated for the master audio clip. E.g. the user may modify the characteristics of the master audio clip by editing one or more of the audible descriptors of the characteristics, e.g. the user may edit only a temporal portion of one audible descriptor. Another option is to allow the user to mix the values for one audible descriptor between the values contained in the characteristics determined for the master and slave audio clips, e.g. to avoid a too dramatic modification to occur if the master and slave audio clips differ significantly with respect to one audible descriptor in a specific sample interval. This allows the user to influence the final step which is applying the modified mapping data to an algorithm which generates a modified audio clip G_A_MDF by modifying the time aligned slave audio clip in a processing step to ensure that the slave audio clip is modified so that the resulting modified audio clip has the same time evolution of audible characteristics as the audible characteristics values contained in the modified mapping data.

In preferred software implementations of the method for generating a modified audio clip, a highly multithreaded implementation is utilized to split long-lasting operations into small chunks of work, allowing full utilization of all the cores on modern computer or server CPUs. This helps to speed up processing time, as the analysis may require intense amounts of operations, especially in case the audio clips are long. Running the analysis in parallel therefore allows an uninterrupted workflow, which is of great benefit to the user, whether the software

implementation involves a standalone application or a plug-in. This is preferred, because of large sizes of audio blocks used during the processing in order to attain a high frequency resolution and audio quality, and further due to the potentially large number of streams that will run in parallel if the algorithm is used on multi-channel material.

In some software implementations, the audio clips are stereo audio clips, i.e. including two channels. However, in other software implementations, the audio clips may include more than two channels, e.g. four stereo channels, e.g. to match demands from the movie industry. In such multiple stream versions, the software is preferably designed with algorithms arranged for parallel processing. In the following, a description is found with specific details of an example of implementation of a computer program product to perform the method of the present invention. In this implementation embodiment, the algorithm is split into two parts: an analysis part and a synthesis part. In the program, in order to enable maximal parallelism, the associated memory may be completely

separated.

When importing an audio clip, the program starts a new job that:

1) Reads the whole audio clip (file) into memory, and

2) Splits the data into small, equally sized partitions.

After importing the whole audio clip, the audio data is split up into multiple frames that are analyzed in separate jobs. Due to the orientation towards a performing implementation using SSE, both the analysis and synthesis algorithms operate on stereo audio signals. Audio clips with higher channel counts would just result in more jobs being created. In a preferred implementation of the analysis algorithm, each job comprises performing the following steps:

1) Extract a frame of data from the PCM sample at the given time and multiply it by a Hanning window. The length of the window, and

consequently the length of the audio extracted at that time, is defined by the kFrameSize template argument. However, the length of Fast Fourier Transfom (FFT) used is considerably larger than this, as it is preferred to have a good extraction of deep frequencies, so the windowed frame is zero-padded to match the longer FFT size. 2) Perform a forward FFT and convert to a magnitude spectrum.

3) Extract the centroid and/or energy of this magnitude spectrum. The resulting value is used for calculating the "Energy" pitch-tracking mode, which works best for machine-type sounds that have a dominant resonant structure.

4) Calculate the Harmonic Product, e.g. over the frequency range 10 to 600 Hz on the spectrum. This may simply be the product of integer-spaced amplitudes in the magnitude spectrum and therefore shows a peak where there is the highest amount of overlap in the overtones of the sound. The resulting value is used for calculating the values used by the "Tone" pitch- tracking mode, which works better than the "Energy" mode for vocals or musical instruments.

5) Perform another forward FFT, this time using a smaller kFrameSize. This generates the data that is used for the formant analysis. The resulting complex spectrum is converted to a magnitude spectrum and additionally; storing the log-mapped version of this in a temporary buffer.

6) Perform a forward FFT on the temporary buffer. This results in the so- called cepstrum. The 16 first coefficients from this, but set the first to 0.5 for normalization, then run an inverse FFT on this buffer and undo the log mapping by mapping the result by the exponential function. The result is a smoothed-version of the magnitude spectrum, which is referred to as the formant spectrum.

7) Store the amplitude and instantaneous phase spectrum, the formant magnitude spectrum and the single amplitude and pitch-values in the frame data. For pitch, the energy- and tone-based values are stored separately in order to be able to cross-fade between them later in the synthesis part.

After all the jobs have finished, a final pass runs sequentially over all frames in order to convert the instantaneous phase information into instantaneous frequencies through standard phase-unwrapping methods as used in phase- vocoders. It also extracts the maximum spectral peak and calculates hierarchically pre-smoothed versions of the amplitude and pitch curves. The smoothing just applies repeated convolution with a custom low pass filter kernel to obtain increasing amounts of smoothing. This processing is similar to processing known within synthesizers where a wavetable is used to suppress aliasing, except that the resulting wavetables here are not decimated. The smoothing is preferably only applied to the single amplitude, energy and tone pitch values across all frames, not for the forma nt spectra.

In the synthesis algorithm, most of the computational intensive work has been done by the analysis, but still there may be cases where the synthesis function is preferably handled by the synthesis algorithm part. First, if the playback pitch is left at 100%, the algorithm will leave the master signal untouched and just use triangular windows to recombine the frames in the time domain. For other pitch values, the master will be applied to the phase vocoder based algorithm like any other slave signal.

The main driver of time is the master signal. There are 3 modes by which analysis frames of the slaves can be read during synthesis which affect which frames get read at what time. These are controlled via the "Stretch" parameter:

1) Stretch=0%: In this mode, the frame read-out speed follows that of the master, which means that if the slave is shorter than the master, it will emit silence between its end and until the master loops.

2) Stretch >0% : In this mode, the slave length will be stretched by the amount specified to match the master's length. This is effectively implemented as a scale factor applied to the frame index of the master. 3) Stretch<0%: In this mode, the slave is running on its own time-base and performs a crossfade between the amplitude, pitch and formant curves defined by the negated percentage. This means that the extracted parameters read near the end are gradually faded out while parameters from at the beginning are fading in. The way these modes are implemented in the code is such that the code always is set up to do a weighted mix between different frames, but for case 1 and 2, the read-out positions just happen to be the same and the weights are adjusted to let just one through. This simplifies the following code.

Finally, the implementation comprises a standard phase vocoder. For each frame instantaneous frequency values are read out by scaling the bin values by the pitch ratio defined by the ratio between the master and the slave. The pitch values, like the amplitude values, are read from the smoothed versions of the analyzed pitch values. Applying this value directly would result in the slave following the master 100%, as the division by the slave's pitch takes out the pitch of the slave completely. Therefore, this ratio is preferably interpolated, e.g. linearly- interpolated with a value of 1, and the amount of linear interpolation is defined by the m_PitchTransfer variable which is controlled by the "Pitch" parameter on the UI. Finally, that pitch value is scaled by ^PitchTranspose/l ) in order to apply musical transposition. The resulting pitch value is used as a scale factor for the bin lookup, so finally, the result is adding amplitudes from the slave-signals at those particular bins to the resulting spectrum. This works fine for amplitudes, however for the bin frequencies, these are just overwritten, which is known to introduce artifacts. Finally, before actually adding the amplitudes, they are multiplied by the ratio between the master and slave formant magnitude spectra : they are multiplied by the master's formant amplitude at the target bin and divided by the slave's formant amplitude at the mapped source-bin in order to erase the influence of the formant that is already incorporated into the slave's amplitude.

The phase vocoder mixes the windowed output of the inverse FFT into a sliding output buffer (m_SlidingOutput) in order to do classic overlap-add synthesis. Before mixing the output of the sliding buffer to a master output (the one that accumulates signals from master and slaves), each phase vocoder output is preferably filtered by a cascade of 2 pairs of 12 dB Butterworth lowpass- and high-pass filters. In practical implementations of the method, the software implements a Graphical User Interface (GUI) allowing visualization of the analysis data and extracted feature curves. Fig. 8-10 shows examples of screen shots of an implementation of a GUI.

In Fig. 8, the screen is shown in the state before any audio clips have been loaded. The GUI allows the user to drag and drop master and slave audio clips (files) onto the marked areas, in order to load audio files into the program. As seen, in the shown implementation, the user can load one master clip MASTER and three slave clips SLAVE A, SLAVE B, and SLAVE C. Preferably, the program is arranged to accept WAV and aiff type audio clips, however it may be prepared for accepting most common other formats as well. In Fig. 9, a master clip has been loaded into the MASTER panel. To the left, a graph indicates various audible descriptors versus time for the master clip, as well as the raw time signal of the master clip. In the middle part of the MASTER panel, the user can adjust pitch mode and speed. To the right in the MASTER panel, a small arrow below a vertical bar marked M indicates a length of the master clip.

In Fig. 10, two slave audio clips have been loaded into the SLAVE A and SLAVE B panels. As in the MASTER panel, a graph to the left indicates raw time signal as well as various audible descriptors of the slave clips. In the middle, the user can adjust Loop X-fade to stretch, the degree of pitch following, the degree of amplitude following, and the degree of formant following. To the right, the user can access and adjust pitch mode, pitch transpose, pitch smooth, and amplitude smooth.

In Fig. 11a, a possible GUI screen layout is shown where the screen row shows a master panel occupying the upper part, namely fields M_l, M_2, M_3. One screen row shows a first slave panel, SLA_1, SLA_2, and SLA_3, and another screen row shows a second slave panel, SLB_1, SLB_2, and SLB_3, while yet another screen row SLC is free for loading of yet a third slave panel. The left column fields, i.e. M_l, SLA_1, SLA_2, preferably show graphs indicating the amplitude of the audio clip versus time, as well as various other parameters over time for the audio clip. The other columns may be dedicated to respective sliders to adjust various parameters. In the middle column, the user may adjust such as Loop X-fade to stretch, the degree of pitch following, the degree of amplitude following, and the degree of formant following. To the right, the user may adjust such as pitch mode, pitch transpose, pitch smooth, and amplitude smooth.

In Fig. lib, a specific example of the content of field M_l from Fig. 11a is shown.

In Fig. 11c, a specific example of the content of field SLA_2 from Fig. 11a is shown.

In a practical implementation of a computer program product, it may be preferred to implement a numbder of of automatic functions. A few example will be given below.

In some embodiment, a randomize function may be added to each respective child section. The idea is that the software, at the touch of a button (a 'random' botton), makes different 'random' suggestions for settings of all parameters. It is done by making settings with values of the individual parameter increased by a few percent, saving it as a setting or snapshot. When prompted by pressing the (random) button, these settings are mixed randomly by an algorithm scrambling the settings for each parameter based on the incremented settings. Hence creating a random setting for the whole section. In other words, a randomize function. This should work separately on each child section.

In some embodiment, the described radomize function may be used in each section, a global option of a window, that defines a number which indicates a desired number of different versions of a sound, e.g. 2-20, such as 10. This will be done automatically by the software by replaying the play selection of the master 10 times and for each time the software will engage the randomize button for the respective child sections, hence yielding a different result 10 times in a row and generate separate audio clip files which are thus different, i.e. an automatic audio clip generation function. Such automatic generating of different audio clips may be especially usefule for gaming audio production. In some embodiments, an automatic pitch and/or tempo difference detection may be implemented. By comparing the pitch of the sound in the parent and child section, a readout in the pitch transpose parameter of the child section will indicate the pitch of the sound/beat in the parent section. An option for automatic lock of pitch is possible by clicking a square with a little lock next to the pitch transpose parameter of the child section. The application will then transpose the needed change in pitch to the sound in the slave section to maintain equal pitch with that of the master. In some embodiments, the software may be configured to synchronize a music audio clip with the tempo of the DAW (Digital Audio

Workstation) used, by locking to the general tempo of the sequencer/DAW or a piece of music, e.g. events, beats or music can be synchronized to time of a particular BPM (Beats Per Minute). Tempo may be indicated graphically in the waveform window (corresponding to left columns of Fig. 11a) by BPM indicators. By using peak detection, e.g. like beat detective in known audio tools, a tempo of the content in the parent section of the software is calculated or can be set. By the user inputting the BPM lock indicator from the user interface of the software, the tempo of the DAW and that of the software is compared and the difference in tempo is calculated and the time stretch or compress needed are applied, to match the length of the audio in the parent section window, in time with that of the tempo in the used DAW. E.g. a function may be implemented to allow the user to double the tempo to the double of that in the DAW. Thus, if the tempo of the DAW is 120 bpm, by pressing a λ χ2' button, it will be calculated as if the tempo was 240 bpm. This is also possible to do with a selected portion of audio within the play section start and end cursors, a play selection. This is non-destructive and can be undone in steps. Such tempo locking functionality may especially be useful for music composition or music production.

In some embodiment, the software may include play selection markers. Default play selection may be set as the full time width of the audio waveform display. The markers may be set by either grabbing the letter [S] for start or [E] for end or by grabbing the line connected to the S/E. By shift/double clicking either the letter S or E or the line connected to the S/E, start and end position of the Play Selection, the markers will swap. The software may be implemented so that it will always play from S to E, meaning that if E is before S, the file will be played in reverse. Double clicking either the letter S or E or the line connected to the S/E, start and end position of the play selection markers, will return start and end to the beginning or the ending of the parent audio file respectively. The software may be configured so that double clicking within the range in the parent section will make the software playback from where the used has clicked. Pressing [S] or [E], will place the start or the end of a play selection at the playback cursor respectively.

Especially, the mentioned function where the playback is reversed, will help the user to create surprising new sounds, and thus such function in the software will help the user to provide new creative audio clips.

Some embodiments of the method or computer program product may involve artificial intelligence, which is used to "automatically" modify audio clips corresponding to multiplicity of times it is used. This may e.g. be used in gaming audio, where it can be used when the sounds for games are being programmed, or where a generative combination of how a sound actually sounds, customized to the gamer's behaviour can be created with the method or computer program product according to the invention. Especially, such gaming sound functionality may be generated on the basis of a randomize audio clip generation algorithm as described in the foregoing.

It is to be understood that the method according to the invention as well as a computer program product according to the invention may be used for various applications. Especially, the fields of application can be divided into e.g. : 1) movie, tv, video or online content sound design and production, 2) music composition and production, e.g. designed as a DJ tool, 3) gaming audio production, or 4) Virtual Reality or Augmented Reality audio production. It is to be understood that the GUI as well as the included variety of functions may be suited to various needs in the mentioned application areas l)-4), thus the computer program product may be specially designed to each application area.

To sum up: The invention provides a method for generating a sound effect audio clip based on a mix of audible characteristics of two existing audio clips. The method comprises selecting first and second audio clips, mapping evolution of time of a plurality of predetermined audible charateristics of the first audio clip to arrive at first mapping data accordingly. The second audio clip is then modified based on the first mapping data, so as to at least partially apply evolution of time of audible characteristics from the first audio clip to the second audio clip, and outputting the sound effect audio clip in response to the modified second audio clip. Preferred audible characteristics are such as: amplitude, pitch, and spectral envelope (e.g. formant), which are each represented in mapping data as values representing the audible characteristics for the duration of the first audio clip at a given time resolution, where each value represents a value or a set of values representing the result of an analysis over a predetermined time windows.

Especially, the second audio clip may also be mapped with respect to evolution of time of corresponding audible characteristics, and the modification of the second audio clip can then be performed in response to a mix of the two mapping data sets, e.g. by a frame-by-frame processing. A time alignment of the first and second audio clips may be performed, so that the two audio clips have the same duration prior to being processed.

As a further option, a looping mode may be included in a practical

implementation. If the slave audio clip is shorter than the master audio clip, it can be stretched to fit the length of the master audio clip or as long as desired from 0 - 100% of the master. If the slave audio clip is longer, it will be truncated starting from the start and the cross fade option can be chosen setting a value in percent from 0 - 80% cross fade. This is typically used for long monotonous audio clips. To sum up, the invention provides a method for generating a sound effect audio clip based on a mix of audible characteristics of two existing audio clips. The method comprises selecting first and second audio clips, mapping evolution of time of a plurality of predetermined audible charateristics of the first audio clip to arrive at first mapping data accordingly. The second audio clip is then modified based on the first mapping data, so as to at least partially apply evolution of time of audible characteristics from the first audio clip to the second audio clip, and outputting the sound effect audio clip in response to the modified second audio clip. Preferred audible characteristics are such as: amplitude, pitch, and spectral envelope (e.g. formant), which are each represented in mapping data as values representing the audible characteristics for the duration of the first audio clip at a given time resolution, where each value represents a value or a set of values representing the result of an analysis over a predetermined time windows. Especially, the second audio clip may also be mapped withrespect to evolution of time of corresponding audible characteristics, and the modification of the second audio clip can then be performed in response to a mix of the two mapping data sets, e.g. by a frame- by-frame processing. A time alignment of the first and second audio clips may be performed, so that the two audio clips have the same duration prior to being processed.

Although the present invention has been described in connection with the specified embodiments, it should not be construed as being in any way limited to the presented examples. The scope of the present invention is to be interpreted in the light of the accompanying claim set. In the context of the claims, the terms "including" or "includes" do not exclude other possible elements or steps. Also, the mentioning of references such as "a" or "an" etc. should not be construed as excluding a plurality. The use of reference signs in the claims with respect to elements indicated in the figures shall also not be construed as limiting the scope of the invention. Furthermore, individual features mentioned in different claims, may possibly be advantageously combined, and the mentioning of these features in different claims does not exclude that a combination of features is not possible and advantageous.