Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD, DEVICE AND SOFTWARE FOR APPLYING AN AUDIO EFFECT, IN PARTICULAR PITCH SHIFTING
Document Type and Number:
WIPO Patent Application WO/2021/175460
Kind Code:
A1
Abstract:
The present invention provides a method for processing music audio data, comprising the steps of providing input audio data representing a first piece of music containing a mixture of predetermined musical timbres, decomposing the input audio data to generate at least a first audio track representing a first musical timbre selected from the predetermined musical timbres, and a second audio track representing a second musical timbre selected from the predetermined musical timbres, applying a predetermined first audio effect to the first audio track, applying no audio effect or a predetermined second audio effect, which is different from the first audio effect, to the second audio track, and recombining the first audio track with the second audio track to obtain recombined audio data.

Inventors:
MORSY KARIEM (DE)
Application Number:
PCT/EP2020/074034
Publication Date:
September 10, 2021
Filing Date:
August 27, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ALGORIDDIM GMBH (DE)
International Classes:
G10H1/36; G10H1/00; G10H1/02; G10H1/057; G10H1/46
Domestic Patent References:
WO2015066204A12015-05-07
Foreign References:
US20160358594A12016-12-08
US20140018947A12014-01-16
Other References:
LAURE PRETET ET AL: "Singing voice separation: a study on training data", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 June 2019 (2019-06-06), XP081374010, DOI: 10.1109/ICASSP.2019.8683555
PRETET: "Singing Voice Separation: A study on training data", ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, 2019, pages 506 - 510, XP033566106, DOI: 10.1109/ICASSP.2019.8683555
Attorney, Agent or Firm:
WEICKMANN & WEICKMANN PARTMBB (DE)
Download PDF:
Claims:
Claims

1. Method for processing music audio data, comprising the steps of a. providing input audio data representing a first piece of music containing a mixture of predetermined musical timbres, b. decomposing the input audio data to generate at least a first audio track representing a first musical timbre selected from the predetermined musical timbres, and a second audio track representing a second musical timbre selected from the predetermined musical timbres, c. applying a predetermined first audio effect to the first audio track, d. applying no audio effect or a predetermined second audio effect, which is different from the first audio effect, to the second audio track, e. recombining the first audio track with the second audio track to obtain recombined audio data.

2. Method of claim 1, wherein the first audio effect is a pitch scaling effect changing the pitch of audio data of the first audio track while maintaining its playback duration.

3. Method of claim 1 or claim 2, wherein the pitch scaling effect shifts the pitch of the audio data of the first audio track up or down by a predetermined number of semitones.

4. Method of at least one of the preceding claims, wherein step b of decomposing the audio data generates a first audio track and a second audio track which are complements, such that their sum substantially equals to the input audio data.

5. Method of at least one of the preceding claims, wherein the first musical timbre is a harmonic vocal timbre or a harmonic instrumental timbre and/or wherein the second musical timbre is a non-harmonic vocal timbre or a non-harmonic instrumental timbre, preferably a drum timbre.

6. Method of at least one of the preceding claims, wherein in step b of decomposing the audio data, there is generated the first audio track, the second audio track, and a third audio track representing a third musical timbre, wherein the first audio track, the second audio track and the third audio track are complements, such that their sum substantially equals to the input audio data, wherein in step c, the predetermined first audio effect is applied to the first audio track, but not to the second audio track and not to the third audio track, and wherein in step d, the first audio track, the second audio track and the third audio track are recombined to obtain the recombined audio data.

7. Method of at least one of the preceding claims, wherein step b of decomposing the input audio data includes processing the input audio data by an Al system containing a trained neural network.

8. Method of at least one of the preceding claims, wherein output data obtained from the recombined audio data are further processed, preferably stored in a storage unit, and/or played back by a playback unit and/or mixed with second-song output data.

9. Method of at least one of the preceding claims, wherein obtaining the recombined audio data and/or further processing the output data is performed within a time smaller than 5 second, preferably smaller than 200 milliseconds, after the start of decomposing the input audio data.

10. Method of at least one of the preceding claims, further including the steps of determining a first key of the first piece of music of the input audio data, providing second-song input data representing a second piece of music, determining a second key of the second piece of music of the second-song audio data, determining a pitch shift value based on the first key and the second key, wherein in step c, the pitch of the first audio track is shifted by the pitch shift value, while maintaining the pitch of the second track, wherein the method preferably further comprises a step of mixing output data obtained from the recombined audio data with second-song output data obtained from the second-song input data, such as to obtain mixed output data, and wherein the method preferably further comprises a step of playing back playback data obtained from the mixed output data.

11. Device for processing music audio data, comprising an input unit for receiving input audio data representing a first piece of music containing a mixture of predetermined musical timbres, a decomposition unit for decomposing the input audio data received from the input unit to generate at least a first audio track representing a first musical timbre selected from the predetermined musical timbres, and a second audio track representing a second musical timbre selected from the predetermined musical timbres, a first effect unit for applying a predetermined first audio effect to the first audio track, but not to the second audio track, a recombination unit for recombining the first audio track with the second audio track to obtain recombined audio data.

12. Device of claim 11 , wherein the first effect unit is a pitch scaling unit for changing the pitch of audio data of the first audio track while maintaining its playback duration.

13. Device of claim 11 or claim 12, wherein the decomposition unit includes an Al system containing a trained neural network, wherein the neural network is trained to separate audio data of a predetermined musical timbre from audio data containing a mixture of different musical timbres.

14. Device of at least one of claims 11 to 13, further comprising a storage unit adapted to store the output data, and/or playback unit adapted to play back the output data, and/or a mixing unit adapted to mix the output data with second-song output data.

15. Device of at least one of claims 11 to 14, further comprising a first key detection unit for determining a first key of the first piece of music of the input audio data, a second-song input unit for providing second-song input data representing a second piece of music, a second key detection unit for determining a second key of the second piece of music of the second-song audio data, a pitch shift calculation unit for determining a pitch shift value based on the first key and the second key, wherein the first effect unit is a pitch scaling unit adapted to shift the pitch of the first audio track by the pitch shift value, while maintaining the pitch of the second track.

16. Device of claim 15, further comprising a mixing unit adapted to mix output data obtained from the recombined audio data with second-song output data obtained from the second-song input data, such as to obtain mixed output data, and preferably a playback unit adapted to play back playback data obtained from the mixed output data.

17. Device of at least on of claims 11 to 16, further comprising a second-song input unit for providing second-song input data representing a second piece of music, a mixing unit adapted to mix output data obtained from the recombined audio data with second-song output data obtained from the second-song input data, such as to obtain mixed output data, and a crossfading unit having a crossfading controller that can be manipulated by a user to assume a control position within a control range, wherein the crossfading unit sets a first volume level of the output data and a second volume level of the second-song output data depending on the control position of the crossfading controller, such that the first volume level is maximum and the second volume level is minimum when the crossfading controller is at one end point of the control range, and the first volume level is minimum and the second volume level is maximum when the crossfading controller is at the other end point of the control range.

18. Device of at least on of claims 11 to 17, comprising a computer having a microprocessor, a storage unit an input interface and an output interface, wherein at least the input unit, the decomposition unit, the first effect unit and the recombination unit are formed by a software program running on the computer, wherein the software is preferably adapted to control the computer such as to carry out a method of at least one of claims 1 to 11.

19. Software adapted to run on a computer to control the computer such as to carry out a method of at least one of claims 1 to 10.

Description:
METHOD, DEVICE AND SOFTWARE FOR APPLYING AN AUDIO EFFECT, IN

PARTICULAR PITCH SHIFTING

Description

The present invention relates to a method for processing music audio data comprising the steps of providing input audio data representing a piece of music containing a mixture of predetermined musical timbres and applying an audio effect to the input audio data. Furthermore, the present invention relates to a device processing music audio data and a software suitable to run on a computer to control the computer to process audio data.

Methods, devices and software of the above described type are conventionally known for various applications in the fields of music production and recording, live mixing, DJ mixing, music broadcasting, etc. The processing of audio mostly implies the application of one or more audio effects, which modify certain sound parameters of the music such as to change the character of the sound without substantially changing the musical composition as such. Examples of known audio effects are reverb effects, delay effects, chorus effects, equalizers, filters, pitch shifting or pitch scaling effects, and tempo shifts (time-stretching / resampling). By virtue of such audio effects, the character of the sound is changed, which differentiates audio effects from mere volume changes. Namely, while volume changes just scale the amplitude of the audio signal by a constant factor leaving the character of the sound unchanged, audio effects typically modify the shape of the waveform of the audio signal.

Another audio processing application is a sound editing environment such as a digital audio workstation (DAW) or similar software, which allows import of a mixed mono or stereo audio file and editing the audio file by application of one or more audio effects.

Such audio effects include editing effects such as time stretching, resampling, pitch shifting, reverb, delay, chorus, equalizer (EQ) etc. Digital audio workstations are used by producers or mixing/mastering engineers, in recording studios, postproduction studios or the like.

In most audio processing applications, the input audio data are mono or stereo audio files containing one (mono) or two (stereo) mixed audio tracks of a piece of music. The mixed audio tracks may be produced in recording studios by mixing a plurality source tracks, which are programmed on a computer (for example a drum computer) or obtained from directly recording individual instruments or vocals. In other cases, mixed audio tracks are obtained from live recording of a concert or from recording the output of a playback device, for example a vinyl player. Mixed audio tracks are often distributed by music distributors via streaming or downloading or broadcasted by radio or TV broadcasting services.

It has been found that the application of audio effects can sometimes distort the character of the sound such that the music sounds less natural and the presence of the audio effect becomes audible more than desired. In particular, if the audio effect is applied for the purpose of correcting some acoustic shortfall or for the purpose of matching the sound of one song to that of another song, such as in a DJ environment in which a smooth transition from one song to the another song is desired, it is generally an aim to apply the effect in such a manner that the listener will not recognize the presence of the effect or will at least not perceive a significant change of the character of the piece of music.

For example, the audio effect may be a pitch scaling effect changing the pitch of audio data while maintaining its playback duration, which might be desired by DJs to match the key of one song to that of another song such as to smoothly crossfade between the two songs (without the clashing of different keys). Conventional pitch scaling will lead to an unnatural distortion of the music, when the pitch is shifted by more than one or two semitones. This results in a limitation of the creative freedom of the DJ.

It is therefore an object of the present invention to improve the results of audio effects applied to mixed audio tracks and to avoid unnatural distortion of the music due to audio effects or to provide new options for modifying the character of a piece of music by virtue of audio effects. Specifically, it is an object of the invention, to provide a method, a device and a software for processing audio data, which allow pitch scaling by more than one or two semitones without unnatural distortion of the music.

In order to achieve the above object, according to a first aspect of the present invention there is provided a method for processing music audio data, comprising the steps of (a) providing input audio data representing a first piece of music containing a mixture of predetermined musical timbres, (b) decomposing the input audio data to generate at least a first audio track representing a first musical timbre selected from the predetermined musical timbres, and a second audio track representing a second musical timbre selected from the predetermined musical timbres, (c) applying a predetermined first audio effect to the first audio track, (d) applying no audio effect or a predetermined second audio effect, which is different from the first audio effect, to the second audio track, (e) recombining the first audio track with the second audio track to obtain recombined audio data.

Thus, according to an import feature of the present invention the input audio data are decomposed to obtain at least two different audio tracks of different musical timbres, the first audio effect is applied to only one of the two audio tracks, and the audio tracks are then recombined again to obtain recombined audio data. As a result, it becomes possible to apply the first audio effect in a more sophisticated and differentiated manner to affect only selected musical timbres.

For example, a reverb effect may be applied to only a vocal component but not, or only with reduced intensity to a drum component of the audio track, such as to provide new options for modifying the character of the sound of a piece of music by virtue of a reverb effect. In another example, when a PA system for music entertainment is controlled by a DJ, it becomes possible to apply a reverb effect to only a specific instrument, for example a drum, if this instrument is found to cause accoustic problems in the specific surrounding or room of the venue.

The second audio track may receive no audio effect at all such as to remain unchanged, i.e. audio data of the second audio track at the time of its generation in step (b) and at the time of its recombination in step (e) are equal. Alternatively, the second audio track may receive a predetermined second audio effect, which is different from the first audio effect.

In the present disclosure, an audio effect is defined by an effect type, such as reverb, chorus, delay, pitch scaling, tempo shifts, etc, and at least one effect parameter, such as a chorus intensity, delay time/intensity, pitch shift value (e.g. number of semitones or cents up/down), or a tempo shift value (e.g. sample rate change ratio). Furthermore, in the present disclosure, two audio effects are different, if they differ in effect type or in at least one effect parameter. Thus the feature that the second audio effect is different from the first audio effect includes cases in which the second audio effect has an effect type which is different from the effect type of the first audio effect, as well as cases in which first and second audio effects have the same effect type but different effect parameters. In addition, in the present disclosure, although some audio effects may involve volume changes, mere volume changes do not qualify as audio effects.

It should be noted that the first audio effect or any audio effect according to the invention may be applied to the entire audio track or only to a time interval of the audio track. Also effect automations are possible in which effect parameter are changed over the playing time.

In an embodiment of the invention, the method according to the first aspect of the invention may be used in a DJ equipment (such as a DJ software, a DJ device etc.) in order to allow the application of audio effects to only selected musical timbres of a song or to allow different audio effects to different musical timbres of a song.

In a further embodiment of the invention, the method according to the first aspect of the invention may be used in a sound editing environment such as a digital audio workstation (DAW) or similar software, which has a functionality to import a mixed mono or stereo audio file as input audio data and to edit the input audio data by application of one or more audio effects. The decomposed first and second audio tracks may then be edited differently and separately from one another, by applying (or not applying) audio effects such as time stretching, resampling, pitch shifting, reverb, delay, chorus, equalizer (EQ) etc. Such digital audio workstation may be used by producers or mixing/mastering engineers, in recording studios, postproduction studios or the like, and it allows to process mixed audio files (for example mixed songs obtained from music distribution services or record labels or from live recording a mixture of different instruments or other sound sources). Thus, even if individual tracks of certain musical timbres of a mixed song are not available, the user may obtain access to individual audio tracks of specific musical timbres for the purpose of applying desired audio effects in a more targeted and sophisticated manner.

After application of the first audio effect to the individual audio tracks, in particular the first audio track, the first audio track (with the first audio effect applied) and the second audio track (with no audio effect applied or a different audio effect applied) are recombined again to form a single audio track, which may be stored to a storage medium or further processed or played back.

In a preferred embodiment of the invention, the first audio effect is a pitch scaling effect changing the pitch of audio data of the first audio track while maintaining its playback duration / rate. The inventors have found that a pitch scaling effect achieves a much more natural result, when applied only to some of the musical timbres of the piece of music. For example, drum timbres do not have a musical pitch and thus do not need to be pitch shifted, which avoids distortion of the drums, in particular when shifting the pitch by more than one or two semitones up or down. Thus, in such example, only harmonic instrumental timbres (timbres having melodic components or contain actual notes of different pitches according to the key/harmonies of the music) may be pitch shifted such as to shift the key of the piece of music to the desired key, while other timbres, such as drums or maybe spoken, non-melodic vocals, such as in Rap music, may remain unchanged with regard to their pitch.

The advantages of the present invention with regard to pitch scaling become particularly prominent, if, in a preferred embodiment, the pitch is shifted by more than 2 semitones, more preferably more than 5 semitones, even more preferably more than 11 semitones. In particular, pitch shifts by more 5 semitones or even more than 11 semitones allow great freedom for matching the keys of two different songs.

The pitch scaling effect may shift the pitch of the audio data of the first audio track up or down by a predetermined number of semitones. This allows pitch shifts for musical purposes, such as to transpose a song to a different key, which might be useful for a DJ for matching the key of one song to the key of another song, in order to allow simultaneous playback of both songs for several artistic reasons, such as smooth crossfades between the two songs (without clashing of different harmonies).

In another embodiment of the invention, the first audio effect may be a time shifting effect, in particular quantization effect, which is adapted to insert time stretchings or time compressions or perform cutting out time intervals of the audio track at selected positions within an audio track in order to shift certain portions or the audio track such as to match a beat of the piece of music (timing corrections). For example, if one of the musical timbres is found to have incorrect timing or if timing of one of the timbres is to be modified for any other purposes, the user may do such timing changes on the desired audio track, for example the first audio track, without affecting the timing of the audio tracks of the other musical timbres. This feature is particularly relevant when the method is implemented in a digital audio workstation. For example, such method allows to correct or modify the timing of a vocal part of a song without changing the timing of the accompaniment part (remaining or non-vocal timbres of the song). In general, the present invention allows post production of mixed songs by granting access to the original (or near original) audio tracks representing the individual musical timbres (instruments, vocals, etc.) that make up the mixed song, even if, in a post-production situation, such original audio tracks are no longer available to the user. Preferably, step b of decomposing the audio data generates a first audio track and a second audio track which are complements, such that their sum substantially equals the input audio data. This allows, in step (e) of recombining the first and second audio tracks, to easily return to the audio signal of the original input audio data by removing the audio effect applied to the first or second audio track, respectively.

In a further embodiment of the invention, the first musical timbre is a harmonic vocal timbre ( a vocal timbre having melodic components or containing actual notes of different pitches according to the key/harmonies of the music) or a harmonic instrumental timbre (an instrumental timbre having melodic components or containing actual notes of different pitches according to the key/harmonies of the music) and/or the second musical timbre is a non-harmonic vocal timbre or a non-harmonic instrumental timbre, preferably a drum timbre. This allows to apply different audio effect settings to harmonic timbres and non harmonic timbres, respectively, which improves the quality of effects that influence harmonic parameters of the piece of music, for example pitch scaling effects, harmonizer effects or flanger effects. Such effect types have been found by the inventors to achieve much more naturally sounding results when applied to only harmonic timbres of the music, such as guitars, vocals, bass, piano or synthesizer sounds, while keeping the remaining, non-harmonic timbres, essentially free from such effect or applying the effect with reduced intensity.

In a further embodiment of the invention, in step b of decomposing the audio data, there is generated the first audio track, the second audio track, and a third audio track representing a third musical timbre, wherein the first audio track, the second audio track and the third audio track are complements, such that their sum substantially equals the input audio data, wherein in step c, the predetermined first audio effect is applied to the first audio track, but not to the second audio track and not to the third audio track, and wherein in step d, the first audio track with the first audio effect applied, the second audio track and the third audio track are recombined to obtain the recombined audio data. In this embodiment, the input audio data are separated into three audio tracks of different musical timbres, which allows different effect settings to be applied to three different components of the music.

Methods according to the first aspect of the invention use a step of decomposing input audio data to obtain a first and second audio tracks containing different musical timbres. Several decomposing algorithms and services are known in the art as such, which allow decomposing audio signals to separate therefrom one or more signal components of different timbres, such as vocal components, drum components or instrumental components. Such decomposed signals and decomposed tracks have been used in the past to create certain artificial effects such as removing vocals from a song to create a karaoke version of a song, and they could be used in step (b) of the method of the present invention.

However, in preferred embodiments of the present invention, step b of decomposing the input audio data may include processing the input audio data by an Al system containing a trained neural network. An Al system may implement a convolutional neural network (CNN), which has been trained by a plurality of data sets for example including a vocal track, a harmonic/instrumental track and a mix of the vocal track and the harmonic/instrumental track. Examples for conventional Al systems capable of separating source tracks such as a singing voice track from a mixed audio signal include: Pretet, “Singing Voice Separation: A study on training data”, Acoustics, Speech and Signal Processing (ICASSP), 2019, pages 506-510; “spleeter” - an open-source tool provided by the music streaming company Deezer based on the teaching of Pretet above, “PhonicMind” (https://phonicmind.com) - a voice and source separator based on deep neural networks, Open-Unmix” - a music source separator based on deep neural networks in the frequency domain, or “Demucs” by Facebook Al Research - a music source separator based on deep neural networks in the waveform domain. These tools accept music files in standard formats (for example MP3, WAV, AIFF) and decompose the song to provide decomposed/separated tracks of the song, for example a vocal track, a bass track, a drum track, an accompaniment track or any mixture thereof.

In a further preferred embodiment of the invention, output data obtained from the recombined audio data are further processed, preferably stored in a storage unit, and/or played back by a playback unit and/or mixed with second-song output data, wherein obtaining the recombined audio data and/or further processing the output data is preferably performed within a time smaller than 5 second, preferably smaller than 200 milliseconds, after the start of decomposing the input audio data. This has the advantage that the method may run as a continuous process at the time at which the effect is actually needed, for example during a live performance of a DJ. For example, if the time between decomposition and further processing of the audio data is smaller than 200 milliseconds, a DJ can perform a pitch shift basically immediately during a live performance.

In another embodiment of the invention, the method may further comprise the steps of determining a first key of the first piece of music of the input audio data, providing second- song input data representing a second piece of music, determining a second key of the second piece of music of the second-song audio data, and determining a pitch shift value based on the first key and the second key, wherein in step (c), the pitch of the first audio track is shifted by the pitch shift value, while maintaining the pitch of the second track, wherein the method preferably further comprises a step of mixing output data obtained from the recombined audio data with second-song output data obtained from the second- song input data, such as to obtain mixed output data, and wherein the method preferably further comprises a step of playing back playback data obtained from the mixed output data. In such embodiment, the method is specifically suited for an application by a DJ, for example in a DJ equipment, when the keys of two songs are to be matched automatically in order to allow for smooth transitions between the two songs. According to an advantage of the invention, sound artefacts or distortions can be avoided or substantially reduced even when the key of a song is shifted by more than one or two semitones.

According to a second aspect of the present invention, the above object is achieved by a device for processing music audio data, comprising an input unit for receiving input audio data representing a first piece of music containing a mixture of predetermined musical timbres, a decomposition unit for decomposing the input audio data received from the input unit to generate at least a first audio track representing a first musical timbre selected from the predetermined musical timbres, and a second audio track representing a second musical timbre selected from the predetermined musical timbres, an effect unit for applying a predetermined first audio effect to the first audio track and for applying no audio effect or a predetermined second audio effect, which is different from the first audio effect, to the second audio track, and a recombination unit for recombining the first audio track with the second audio track to obtain recombined audio data.

A device of the second aspect can be formed by a computer having a microprocessor, a storage unit an input interface and an output interface, wherein at least the input unit, the decomposition unit, the first effect unit and the recombination unit are formed by a software program running on the computer. In this manner, the computer is preferably adapted to carry out a method according to the first aspect of the invention.

In a device of the second aspect of the invention, first effect unit may be a pitch scaling unit for changing the pitch of audio data of the first audio track while maintaining its playback duration or playback rate. Such device may show particular advantages when forming part of DJ equipment in which transposition of a song from one key to another is desired. It has been found that sound distortions by pitch scaling can be reduced or avoided, if the pitch scaling effect is applied only to some of the musical timbres included in a piece of music.

The decomposition unit preferably includes an Al system containing a trained neural network, wherein the neural network is trained to separate audio data of a predetermined musical timbre from audio data containing a mixture of different musical timbres. As described above, such Al system are able to separate different musical timbres of a song with high quality.

A device of the second aspect of the invention may further comprise a storage unit adapted to store the output data, which allows further processing of the output data, for example at any later point in time. In another embodiment, the device may have a playback unit adapted to play back the output data, such that the device is prepared to be used as a music player or for public audition of music through connection to a PA system. In another embodiment, the device may have a mixing unit adapted to mix the output data with second-song output data, which allows the use of the device as DJ equipment.

In another embodiment, the device may further comprise a first key detection unit for determining a first key of the first piece of music of the input audio data, a second-song input unit for providing second-song input data representing a second piece of music, a second key detection unit for determining a second key of the second piece of music of the second-song audio data, a pitch shift calculation unit for determining a pitch shift value based on the first key and the second key, wherein the first effect unit is a pitch scaling unit adapted to shift the pitch of the first audio track by the pitch shift value, while maintaining the pitch of the second track. In this manner it is possible to match the keys of two songs automatically to enable simultaneous playback of both songs or parts thereof in a DJ environment without incurring sound distortions due to pitch scaling even if the keys of the songs differ from one another by more than one semitone.

In an embodiment of the invention the device is a DJ device. For use as a DJ device, the device may then further comprise a mixing unit adapted to mix output data obtained from the recombined audio data with second-song output data obtained from the second-song input data, such as to obtain mixed output data, and preferably a playback unit adapted to play back playback data obtained from the mixed output data. To obtain a fully integrated DJ system, in which the automatic pitch scaling described above is directly available as a feature, the device may further comprise a second-song input unit for providing second- song input data representing a second piece of music, a mixing unit adapted to mix output data obtained from the recombined audio data with second-song output data obtained from the second-song input data, such as to obtain mixed output data, and a crossfading unit having a crossfading controller that can be manipulated by a user to assume a control position within a control range, wherein the crossfading unit sets a first volume level of the output data and a second volume level of the second-song output data depending on the control position of the crossfading controller, such that the first volume level is maximum and the second volume level is minimum when the crossfading controller is at one end point of the control range, and the first volume level is minimum and the second volume level is maximum when the crossfading controller is at the other end point of the control range.

In another embodiment of the invention, the device of the second aspect may be a computer running a digital audio workstation (DAW).

According to a third aspect of the present invention, the above mentioned object of the invention is achieved by a software adapted to run on a computer to control the computer such as to carry out a method of the first aspect of the invention. Such software may be executed/run on known operating systems and platforms, in particular iOS, macOS, Android or Windows running on computers, tablets, and smartphones. The software may be a digital audio workstation (DAW) or a DJ software.

The invention will be further explained by way of a specific embodiment shown in the attached drawing in which

Figur 1 shows a function diagram of a device according to the specific embodiment.

In Figure 1, components of a device according to the specific embodiment are shown, which may all be integrated as hardware or software modules installed on a computer, for example a tablet computer or a smartphone. Alternatively, these hardware or software modules may be parts of a stand-alone DJ device, which includes a housing on which control elements such as control knobs or sliders are mounted to control functions of the device.

The device may include an input interface 12 for receiving input audio data or audio signals. The input interface, may be adapted to receive digital audio data as audio files via a network or from a storage medium. Furthermore, the input interface 12 may be configured to decode or decompress audio data, when they are received as encoded or compressed data files. Alternatively, the input interface 12 may comprise an analog-digital converter to sample analog data received from an analog audio input (for example a vinyl player or a microphone) and to obtain digital audio data as input audio data.

The input audio data provided by input interface 12 are then routed to an input section 14 which contains a first-song input unit 16 and a second-song input unit 18, which are adapted to provide audio data of two different songs according to a user selection. In particular, the device may have a user input interface, for example a touch screen, to allow a user to choose songs from a song database and to load it into the first-song input unit 16 or the second-song input unit 18. The audio file of the selected song may be completely loaded into a local memory of the device or portions of the audio file may be continuously streamed (for example via internet from a remote music distribution platform) and further processed before receiving the entire file. In this way, the first-song input unit 16 provides first-song audio input data according to a first song selected by a user, and the second-song input unit 18 provides second-song audio input data according to a second song selected by a user.

The first-song audio input data may then be routed to a first key detection unit 20 to detect a first key of the first song, while the second-song audio input data are routed to a second key detection unit 22 to detect a second key of the second song. First and second key detection units 20, 22 are preferably arranged to detect a key or root or fundamental tone of the piece of music according to the 12 semitones of the chromatic scale (e.g. one of C, C sharp, D, D sharp, E, F, F sharp, G, G sharp, A, A sharp, B), including the mode (major or minor). A conventional key detection module may be used as first and second key detection unit, respectively. Furthermore, first and second keys may be detected one after another by one and the same key detection unit.

First and second keys may be input into a pitch shift calculation unit 24, which calculates a pitch shift value based on a difference between the two keys. The pitch shift value may be a number of semitones by which the first key needs to be shifted up or down in order to match the second key. Alternatively the pitch shift value may be a number of semitones by which the first key needs to be shifted up or down in order to assume a key that differs from the second key by a fifth. It has been found that two songs may be mixed and play simultaneously without audible harmonic interference, for example during a crossfading between the two songs, if both songs are at the same key or if their keys differ by a fifth. After passing the key detection unit 20 the first-song audio input data are routed to a decomposition unit 26 which contains an Al system having a trained neural network adapted to decompose the first song audio input data to generate at least a first audio track representing a first musical timbre, a second audio track representing a second musical timbre, and a third audio track representing a third musical timbre. For example, the first musical timbre. In the present example, the first musical timbre may be a harmonic timbre (e.g. including a sum of vocals, guitars, keys, synthesizers, etc.), the second musical timbre may be a non-harmonic timbre, such as a percussion timbre, and the third musical timbre may be another non-harmonic timbre, such as a drum timbre.

Only the first audio track representing the first musical timbre is then routed into a pitch shifting unit 28, which shifts the pitch of the audio data by a predetermined number of semitones up or down, based on the pitch shift value received from the pitch shift calculation unit 24. The second audio track and the third audio track are not routed to the pitch shifting unit 28 but rather bypass the pitch shifting unit 28. Thus, in the present example, only the first audio track including the harmonic timbres is submitted to the pitch shifting, whereas the second and third tracks which include the non-harmonic timbres, maintain their pitch.

First audio track, including pitch shift, second audio track and third audio track are then routed into a recombination unit 30 in which they are recombined again into a single audio track (mono or stereo track). Recombination may be performed by simply mixing the audio data.

The recombined audio data obtained from recombination unit 30 may then be passed through a first-song effect unit 32 in order to apply some other audio effect, such as a high pass or low pass filter, or an EQ filter, if desired, and to output the result as first-song output data.

On the other hand, the second-song audio input data obtained from the second-song input unit 18 may be passed to any desired effect units as well, similar as those described for the first embodiments. In the illustrated example, the second-song audio input data are passed through a second-song effect unit 34 in order to apply an audio effect, such as a high pass or low pass filter, or an EQ filter, and to output the result as second-song output data. First-song and second-song output data may then be passed through a tempo matching unit 36 which detects a tempo (BPM value) of both songs and changes the tempo of at least one of the two songs (without changing its pitch) such that both songs have matching tempi. Matching tempi means that the BPM value of one of the two songs equals the BPM value or a multiple of the BPM value of the other song. Such tempo matching units are known in the art as such.

Afterwards, first-song and second-song output data (matched in tempo, if applicable) may be routed into a mixing unit 38, in which they are mixed with one another to obtain mixed output data (mono or stereo) that contain a sum of both signals. Mixing unit 38 may contain or may be connected to a crossfader, which can be manipulated by a user to assume a control position within a control range, wherein the crossfader sets a first volume level of the first-song output data and a second volume level of the second-song output data depending on the control position of the crossfading controller, such that the first volume level is maximum and the second volume level is minimum when the crossfading controller is at one end point of the control range, and the first volume level is minimum and the second volume level is maximum when the crossfading controller is at the other end point of the control range. Mixing unit 38 then mixes (sums) the first-song and second-song output data according to the first volume level and the second volume level, respectively, to obtain mixed output data (mono or stereo).

The mixed output data may then be passed through a sum effect unit 40 to apply any further audio effect, if desired. The output of the sum effect unit 40 may be denoted as playback data and may be played back by an output audio interface 42. Output audio interface 42 may include and audio buffer and a digital to analog converter to generate a sound signal. Alternatively, the playback data may be transmitted to another device for playback, storage or further processing.