Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
IDENTIFICATION OF AUDIO COMPONENTS IN AN AUDIO MIX
Document Type and Number:
WIPO Patent Application WO/2019/053544
Kind Code:
A1
Abstract:
A method for audio processing includes training a computerized classifier (56, 58) to recognize respective audio outputs of a predefined set of musical instruments (30, 32, 34). Upon receiving audio data, the classifier outputs a vector (92) of respective scores (94) for the musical instruments, indicating a likelihood that each musical instrument played in the audio data. An audio segment is input to the classifier, which outputs the vector of the respective scores for the audio segment. Different, respective threshold values are set for the different musical instruments. The respective scores of the musical instruments for the audio segment are compared to the respective threshold values, and one or more of the musical instruments for which the respective scores are no less than the respective threshold values are identified as having played in the audio segment.

Inventors:
MOR YOAV (IL)
KOHN BENJAMIN (IL)
Application Number:
PCT/IB2018/056693
Publication Date:
March 21, 2019
Filing Date:
September 02, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INTUITIVE AUDIO LABS LTD (IL)
International Classes:
G06F3/16; G06N3/02; G06N3/08; G10L15/16; G10L25/30
Foreign References:
US20170054779A12017-02-23
Other References:
KUBERA, E. ET AL.: "Recognition of Instrument Timbres in Real Polytimbral Audio Recordings", JOINT EUROPEAN CONFERENCE ON MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, 31 August 2010 (2010-08-31), Berlin , Heidelberg, pages 97 - 110, XP047463271, Retrieved from the Internet
KUBERA, E. ET AL.: "A Mining Audio Data for Multiple Instrument Recognition in Classical Music", INTERNATIONAL WORKSHOP ON NEW FRONTIERS IN MINING COMPLEX PATTERNS, 6 July 2014 (2014-07-06), pages 246 - 260, XP055582947, Retrieved from the Internet
Attorney, Agent or Firm:
D. KLIGLER I. P. SERVICES LTD. (IL)
Download PDF:
Claims:
CLAIMS

1. A method for audio processing, comprising:

training a computerized classifier to recognize respective audio outputs of a predefined set of musical instruments, such that upon receiving audio data, the classifier outputs a vector of respective scores for the musical instruments in the set, each score indicating a likelihood that a corresponding musical instrument played in the audio data;

inputting to the classifier an audio segment comprising music played by one or more of the musical instruments, and receiving from the classifier the vector of the respective scores for the audio segment;

categorizing a musical genre of the audio segment;

setting different, respective threshold values for the musical instruments in the set responsively to the musical genre of the audio segment; and

comparing the respective scores of the musical instruments for the audio segment to the respective threshold values, and identifying one or more of the musical instruments for which the respective scores are no less than the respective threshold values as having played in the audio segment.

2. The method according to claim 1, wherein categorizing the musical genre comprises calculating a signature of the audio segment by finding spectral peaks in the audio segment and temporal spacing between the peaks, and looking up the genre using the calculated signature. 3. The method according to claim 1, wherein identifying the one or more of the musical instruments comprises:

identifying a first musical instrument as having played in the audio segment based on a respective score and threshold value for the first musical instrument;

finding one or more second musical instruments having an affinity with the first musical instrument;

responsively to the affinity, reducing the respective threshold values that were set for the one or more second musical instruments; and

identifying one or more of the second musical instruments having respective scores no less than the reduced respective threshold values as having played in the audio segment. 4. The method according to claim 1, wherein inputting the audio segment comprises:

inputting a stereophonic segment having left and right audio channels; upmixing the left and right audio channels to derive a mid-channel and two side-channel signals; and

applying the classifier to each of the mid-channel and two side-channel signals.

5. A method for audio processing, comprising:

training a computerized classifier to recognize respective audio outputs of a predefined set of musical instruments, such that upon receiving audio data, the classifier outputs a vector of respective scores for the musical instruments in the set, each score indicating a likelihood that a corresponding musical instrument played in the audio data;

inputting to the classifier an audio segment comprising music played by one or more of the musical instruments, and receiving from the classifier the vector of the respective scores for the audio segment;

comparing the respective scores of the musical instruments for the audio segment to a threshold value, and identifying a first musical instrument having a respective score that is no less than the threshold value as having played in the audio segment;

finding one or more second musical instruments having an affinity with the first musical instrument;

responsively to the affinity, reducing the threshold value for the one or more second musical instruments; and

identifying one or more of the second musical instruments for which the respective scores are no less than the reduced respective threshold value as having played in the audio segment.

6. The method according to claim 5, wherein inputting the audio segment comprises:

inputting a stereophonic segment having left and right audio channels;

upmixing the left and right audio channels to derive a mid-channel and two side-channel signals; and

applying the classifier to each of the mid-channel and two side-channel signals.

7. A method for audio processing, comprising:

training a computerized classifier to recognize respective audio outputs of a predefined set of musical instruments, such that upon receiving audio data, the classifier outputs a vector of respective scores for the musical instruments in the set, each score indicating a likelihood that a corresponding musical instrument played in the audio data;

receiving a stereophonic audio segment having left and right audio channels and comprising music played by one or more of the musical instruments; upmixing the left and right audio channels to derive a mid-channel and two side-channel signals;

inputting to the classifier to each of the mid-channel and two side-channel signals, and receiving from the classifier the vector of the respective scores for the audio segment based on the mid-channel and two side-channel signals; and

comparing the respective scores of the musical instruments for the audio segment to at least one threshold value, and identifying one or more of the musical instruments for which the respective scores are no less than the at least one threshold value as having played in the audio segment. 8. The method according to any of claims 1-7, and comprising:

storing a plurality of audio segments and the musical instruments identified as having played in each of the audio segments in a database; and

searching the database to find the audio segments in which a given musical instrument played. 9. The method according to any of claims 1-7, wherein the computerized classifier comprises a convolutional neural network.

10. The method according to claim 9, wherein inputting the audio segment comprises:

dividing the audio segment into a sequence of time windows;

computing a respective frequency spectrum of the audio data in each of the time windows; and

extracting amplitude values for a predefined series of frequencies in the respective frequency spectrum of each of the time windows for input to the convolutional neural network.

11. Apparatus for audio processing, comprising:

an interface, which is configured to receive an audio segment comprising music played by one or more musical instruments within a predefined set; and

a processor, which is configured to run a computerized classifier that has been trained to recognize respective audio outputs of the predefined set of musical instruments, such that upon receiving the audio segment from the interface, the classifier outputs a vector of respective scores for the musical instruments in the set, each score indicating a likelihood that a corresponding musical instrument played in the audio segment,

wherein the processor is further configured to categorize a musical genre of the audio segment, to set different, respective threshold values for the musical instruments in the set responsively to the musical genre of the audio segment, to compare the respective scores of the musical instruments for the audio segment to the respective threshold values, and to identify one or more of the musical instruments for which the respective scores are no less than the respective threshold values as having played in the audio segment. 12. The apparatus according to claim 11, wherein the processor is configured to categorize the musical genre by calculating a signature of the audio segment by finding spectral peaks in the audio segment and temporal spacing between the peaks, and looking up the genre using the calculated signature.

13. The apparatus according to claim 11, wherein the processor is configured to identify a first musical instrument as having played in the audio segment based on a respective score and threshold value for the first musical instrument, to find one or more second musical instruments having an affinity with the first musical instrument, and responsively to the affinity, to reduce the respective threshold values that were set for the one or more second musical instruments and to identify one or more of the second musical instruments having respective scores no less than the reduced respective threshold values as having played in the audio segment.

14. The apparatus according to claim 11, wherein the audio segment comprises a stereophonic segment having left and right audio channels, and wherein the processor is configured to upmix the left and right audio channels to derive a mid-channel and two side-channel signals, and to apply the classifier to each of the mid-channel and two side-channel signals. 15. Apparatus for audio processing, comprising:

an interface, which is configured to receive an audio segment comprising music played by one or more musical instruments within a predefined set; and

a processor, which is configured to run a computerized classifier that has been trained to recognize respective audio outputs of the predefined set of musical instruments, such that upon receiving the audio segment from the interface, the classifier outputs a vector of respective scores for the musical instruments in the set, each score indicating a likelihood that a corresponding musical instrument played in the audio segment,

wherein the processor is further configured to compare the respective scores of the musical instruments for the audio segment to a threshold value, to identify a first musical instrument having a respective score that is no less than the threshold value as having played in the audio segment, to find one or more second musical instruments having an affinity with the first musical instrument and responsively to the affinity, to reduce the threshold value for the one or more second musical instruments, and to identify one or more of the second musical instruments for which the respective scores are no less than the reduced respective threshold value as having played in the audio segment.

16. The apparatus according to claim 15, wherein the audio segment comprises a stereophonic segment having left and right audio channels, and wherein the processor is configured to upmix the left and right audio channels to derive a mid-channel and two side-channel signals, and to apply the classifier to each of the mid-channel and two side-channel signals.

17. Apparatus for audio processing, comprising:

an interface, which is configured to receive a stereophonic audio segment having left and right audio channels and comprising music played by one or more musical instruments within a predefined set; and

a processor, which is configured to upmix the left and right audio channels to derive a mid- channel and two side-channel signals, and to input each of the mid-channel and two side-channel signals to a computerized classifier that has been trained to recognize respective audio outputs of the predefined set of musical instruments, such that upon receiving the audio segment from the interface, the classifier outputs a vector of respective scores for the musical instruments in the set, each score indicating a likelihood that a corresponding musical instrument played in the audio segment,

wherein the processor is configured to receive from the classifier the vector of the respective scores for the audio segment based on the mid-channel and two side-channel signals, to compare the respective scores of the musical instruments for the audio segment to at least one threshold value, and to identify one or more of the musical instruments for which the respective scores are no less than the at least one threshold value as having played in the audio segment.

18. The apparatus according to any of claims 11-17, wherein the processor is configured to store a plurality of audio segments and the musical instruments identified as having played in each of the audio segments in a database, and to search the database to find the audio segments in which a given musical instrument played.

19. The apparatus according to any of claims 11-17, wherein the computerized classifier comprises a convolutional neural network. 20. The apparatus according to claim 19, wherein the processor is configured to divide the audio segment into a sequence of time windows, to compute a respective frequency spectrum of the audio data in each of the time windows, and to extract amplitude values for a predefined series of frequencies in the respective frequency spectrum of each of the time windows for input to the convolutional neural network.

21. A computer software product, comprising a non-transitory computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive an audio segment comprising music played by one or more musical instruments within a predefined set, and to run a computerized classifier that has been trained to recognize respective audio outputs of the predefined set of musical instruments, such that upon receiving the audio segment from the interface, the classifier outputs a vector of respective scores for the musical instruments in the set, each score indicating a likelihood that a corresponding musical instrument played in the audio segment,

wherein the instructions further cause the computer to categorize a musical genre of the audio segment, to set different, respective threshold values for the musical instruments in the set responsively to the musical genre of the audio segment, to compare the respective scores of the musical instruments for the audio segment to the respective threshold values, and to identify one or more of the musical instruments for which the respective scores are no less than the respective threshold values as having played in the audio segment.

22. The product according to claim 21, wherein the instructions cause the computer to categorize the musical genre by calculating a signature of the audio segment by finding spectral peaks in the audio segment and temporal spacing between the peaks, and looking up the genre using the calculated signature.

23. The product according to claim 21, wherein the instructions cause the computer to identify a first musical instrument as having played in the audio segment based on a respective score and threshold value for the first musical instrument, to find one or more second musical instruments having an affinity with the first musical instrument, and responsively to the affinity, to reduce the respective threshold values that were set for the one or more second musical instruments and to identify one or more of the second musical instruments having respective scores no less than the reduced respective threshold values as having played in the audio segment.

24. The product according to claim 21, wherein the audio segment comprises a stereophonic segment having left and right audio channels, and wherein the instructions cause the computer to upmix the left and right audio channels to derive a mid-channel and two side-channel signals, and to apply the classifier to each of the mid-channel and two side-channel signals.

25. A computer software product, comprising a non-transitory computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive an audio segment comprising music played by one or more musical instruments within a predefined set and to run a computerized classifier that has been trained to recognize respective audio outputs of the predefined set of musical instruments, such that upon receiving the audio segment from the interface, the classifier outputs a vector of respective scores for the musical instruments in the set, each score indicating a likelihood that a corresponding musical instrument played in the audio segment,

wherein the instructions further cause the computer to compare the respective scores of the musical instruments for the audio segment to a threshold value, to identify a first musical instrument having a respective score that is no less than the threshold value as having played in the audio segment, to find one or more second musical instruments having an affinity with the first musical instrument and responsively to the affinity, to reduce the threshold value for the one or more second musical instruments, and to identify one or more of the second musical instruments for which the respective scores are no less than the reduced respective threshold value as having played in the audio segment.

26. The product according to claim 25, wherein the audio segment comprises a stereophonic segment having left and right audio channels, and wherein the instructions cause the computer to upmix the left and right audio channels to derive a mid-channel and two side-channel signals, and to apply the classifier to each of the mid-channel and two side-channel signals.

27. A computer software product, comprising a non-transitory computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive a stereophonic audio segment having left and right audio channels and comprising music played by one or more musical instruments within a predefined set, and to upmix the left and right audio channels to derive a mid-channel and two side-channel signals, and to input each of the mid-channel and two side-channel signals to a computerized classifier that has been trained to recognize respective audio outputs of the predefined set of musical instruments, such that upon receiving the audio segment from the interface, the classifier outputs a vector of respective scores for the musical instruments in the set, each score indicating a likelihood that a corresponding musical instrument played in the audio segment,

wherein the instructions further cause the computer to receive from the classifier the vector of the respective scores for the audio segment based on the mid-channel and two side-channel signals, to compare the respective scores of the musical instruments for the audio segment to at least one threshold value, and to identify one or more of the musical instruments for which the respective scores are no less than the at least one threshold value as having played in the audio segment.

28. The product according to any of claims 21-27, wherein the instructions cause the computer to store a plurality of audio segments and the musical instruments identified as having played in each of the audio segments in a database, and to search the database to find the audio segments in which a given musical instrument played.

29. The product according to any of claims 21-27, wherein the computerized classifier comprises a convolutional neural network. 30. The product according to claim 29, wherein the instructions cause the computer to divide the audio segment into a sequence of time windows, to compute a respective frequency spectrum of the audio data in each of the time windows, and to extract amplitude values for a predefined series of frequencies in the respective frequency spectrum of each of the time windows for input to the convolutional neural network.

Description:
IDENTIFICATION OF AUDIO COMPONENTS IN AN AUDIO MIX

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Patent Application 62/557,768, filed September 13, 2017, which is incorporated herein by reference. FIELD OF THE INVENTION

The present invention relates generally to computerized processing of audio data, and particularly to methods, systems and software for classifying and searching through audio data files.

BACKGROUND

Although Web-based search engines have traditionally used semantic searching, some search engines are now capable of searching through databases of musical works based on musical features, such as the melody and changes in amplitude. For example, the "Shazam" application, developed by Shazam Entertainment Ltd., is able to identify music, movies, advertising, and television shows, based on short audio segments. Shazam identifies songs by computing an audio "fingerprint," based on a time-frequency spectrogram, of an audio input provided by a user, and then matching this fingerprint against a database of fingerprints of known songs.

Some methods for automatic classification of audio sources, including music, make use of neural networks. For example, U.S. Patent Application Publication 2007/0083365 describes a neural network classifier, which is said to provide the ability to separate and categorize multiple arbitrary and previously unknown audio sources down-mixed to a single monophonic audio signal. This is accomplished by breaking the monophonic audio signal into baseline frames (possibly overlapping), windowing the frames, extracting a number of descriptive features in each frame, and employing a pre-trained nonlinear neural network as a classifier.

SUMMARY

Embodiments of the present invention that are described hereinbelow provide improved methods, systems and software for identifying audio components, such as passages played by different instruments in audio recordings.

There is therefore provided, in accordance with an embodiment of the invention, a method for audio processing, which includes training a computerized classifier to recognize respective audio outputs of a predefined set of musical instruments, such that upon receiving audio data, the classifier outputs a vector of respective scores for the musical instruments in the set. Each score indicates a likelihood that a corresponding musical instrument played in the audio data. An audio segment is input to the classifier, including music played by one or more of the musical instruments, and the vector of the respective scores for the audio segment is received from the classifier. A musical genre of the audio segment is categorized, and different, respective threshold values are set for the musical instruments in the set responsively to the musical genre of the audio segment. The respective scores of the musical instruments for the audio segment are compared to the respective threshold values, and one or more of the musical instruments for which the respective scores are no less than the respective threshold values are identified as having played in the audio segment.

In some embodiments, categorizing the musical genre includes calculating a signature of the audio segment by finding spectral peaks in the audio segment and temporal spacing between the peaks, and looking up the genre using the calculated signature.

Additionally or alternatively, identifying the one or more of the musical instruments includes identifying a first musical instrument as having played in the audio segment based on a respective score and threshold value for the first musical instrument and finding one or more second musical instruments having an affinity with the first musical instrument. Responsively to the affinity, the respective threshold values that were set for the one or more second musical instruments are reduced, and one or more of the second musical instruments having respective scores no less than the reduced respective threshold values are identified as having played in the audio segment.

There is also provided, in accordance with an embodiment of the invention, a method for audio processing, which includes training a computerized classifier to recognize respective audio outputs of a predefined set of musical instruments, such that upon receiving audio data, the classifier outputs a vector of respective scores for the musical instruments in the set. An audio segment is input to the classifier, including music played by one or more of the musical instruments, and the vector of the respective scores for the audio segment is received from the classifier. The respective scores of the musical instruments for the audio segment are compared to a threshold value, and a first musical instrument having a respective score that is no less than the threshold value is identified as having played in the audio segment. One or more second musical instruments having an affinity with the first musical instrument are found. Responsively to the affinity, the threshold value for the one or more second musical instruments is reduced. One or more of the second musical instruments for which the respective scores are no less than the reduced respective threshold value are identified as having played in the audio segment. There is additionally provided, in accordance with an embodiment of the invention, a method for audio processing, which includes training a computerized classifier to recognize respective audio outputs of a predefined set of musical instruments, such that upon receiving audio data, the classifier outputs a vector of respective scores for the musical instruments in the set. A stereophonic audio segment is received, having left and right audio channels and including music played by one or more of the musical instruments. The left and right audio channels are upmixed to derive a mid-channel and two side-channel signals. Each of the mid-channel and two side- channel signals is input to the classifier, and the vector of the respective scores for the audio segment is received from the classifier, based on the mid-channel and two side-channel signals. The respective scores of the musical instruments for the audio segment are compared to at least one threshold value, and one or more of the musical instruments for which the respective scores are no less than the at least one are identified as having played in the audio segment.

In some embodiments, the method includes storing a plurality of audio segments and the musical instruments identified as having played in each of the audio segments in a database, and searching the database to find the audio segments in which a given musical instrument played.

In some embodiments, the computerized classifier includes a convolutional neural network. In a disclosed embodiment, inputting the audio segment includes dividing the audio segment into a sequence of time windows, computing a respective frequency spectrum of the audio data in each of the time windows, and extracting amplitude values for a predefined series of frequencies in the respective frequency spectrum of each of the time windows for input to the convolutional neural network.

There is further provided in accordance with an embodiment of the invention, apparatus for audio processing, including an interface, which is configured to receive an audio segment including music played by one or more musical instruments within a predefined set, and a processor, which is configured to run a computerized classifier that has been trained to recognize respective audio outputs of the predefined set of musical instruments, such that upon receiving the audio segment from the interface, the classifier outputs a vector of respective scores for the musical instruments in the set, each score indicating a likelihood that a corresponding musical instrument played in the audio segment. The processor is further configured to categorize a musical genre of the audio segment, to set different, respective threshold values for the musical instruments in the set responsively to the musical genre of the audio segment, to compare the respective scores of the musical instruments for the audio segment to the respective threshold values, and to identify one or more of the musical instruments for which the respective scores are no less than the respective threshold values as having played in the audio segment.

There is moreover provided, in accordance with an embodiment of the invention, apparatus for audio processing, in which the processor is configured to compare the respective scores of the musical instruments for the audio segment to a threshold value, to identify a first musical instrument having a respective score that is no less than the threshold value as having played in the audio segment, to find one or more second musical instruments having an affinity with the first musical instrument and responsively to the affinity, to reduce the threshold value for the one or more second musical instruments, and to identify one or more of the second musical instruments for which the respective scores are no less than the reduced respective threshold value as having played in the audio segment.

There is furthermore provided, in accordance with an embodiment of the invention, apparatus for audio processing, including an interface, which is configured to receive a stereophonic audio segment having left and right audio channels and including music played by one or more musical instruments within a predefined set, and a processor, which is configured to upmix the left and right audio channels to derive a mid-channel and two side-channel signals, and to input each of the mid-channel and two side-channel signals to a computerized classifier that has been trained to recognize respective audio outputs of the predefined set of musical instruments. The processor is configured to receive from the classifier a vector of respective scores for the audio segment based on the mid-channel and two side-channel signals, to compare the respective scores of the musical instruments for the audio segment to at least one threshold value, and to identify one or more of the musical instruments for which the respective scores are no less than the at least one as having played in the audio segment.

There are also provided, in accordance with embodiments of the invention, computer software products, including a non-transitory computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to carry out the methods described above.

The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which: BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is block diagram that schematically illustrates a system for audio analysis, in accordance with an embodiment of the invention; Fig. 2 is a flow chart that schematically illustrates a method for audio analysis, in accordance with an embodiment of the invention;

Fig. 3 is a flow chart that schematically illustrates a method for computing a spectrogram, in accordance with an embodiment of the invention;

Fig. 4 is a block diagram that schematically illustrates a method for analysis of an audio segment by a neural network, in accordance with an embodiment of the invention; and

Fig. 5 is a flow chart that schematically illustrates a method for classifying and searching audio passages, in accordance with an embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS

OVERVIEW

Although applications such as the above-mentioned Shazam enable users to search for and identify a wide range of musical works by their audio content, in some cases musicians and audiophiles need to search for music with finer granularity, for example to find audio segments in which a specific instrument is playing. Identifying a solo instrument may not be too difficult, but most recorded music contains a mix of different instruments, possibly together with vocal components. The user may wish to search not only for solo passages of a chosen instrument, but also for ensemble and even orchestral pieces in which the instrument is playing. Identifying the instrumental components of a musical mix is a difficult task, which is not well covered by existing search tools.

Embodiments of the present invention that are described herein address this need by refining the abilities of a classifier to identify the musical instruments that are playing in a given audio segment. In the disclosed embodiments, a computerized classifier is trained to recognize the respective audio outputs of a predefined set of musical instruments, for example by training a convolutional neural network (CNN) using a training group of musical passages played by known instruments. Following this training, the classifier, upon receiving a new audio segment, will output a vector of respective scores for the musical instruments in the set. Each score in this vector indicates the likelihood that the corresponding musical instrument played in the audio data under analysis. To decide which instrument or instruments played in a given audio segment, the computer compares the scores in the vector to a threshold value or values and will identify a given instrument as having played in the audio segment when its respective score is no less than the applicable threshold. In the disclosed embodiments, the ability of the computer to reach a precise decision based on the classification results - meaning that as many of the participating instruments as possible are correctly identified - can be enhanced in a number of ways:

In some embodiments, the computer categorizes the musical genre of the audio segment, which is a good indicator of the types of instruments that are likely to be taking part (for example, violin in a classical piece, as opposed to guitar in popular music). For purposes of identifying the genre, the computer may calculate a signature of the audio segment, for example by finding spectral peaks in the audio segment and the temporal spacing between the peaks. This signature can be used in looking up both the name and the genre of the musical piece in question. Once the genre is known, it can be used in adjusting the threshold values, for example by lowering the threshold values for instruments that are commonly played in the identified genre, or adjusting the weights in the neural network to favor such instruments.

Additionally or alternatively, the computer can define and make use of affinity among groups of instruments, based on the observation that certain instruments (such as violin, viola and cello) often appear together. Thus, once the computer has identified one of the instruments playing in an audio segment, it can reduce the threshold values for other instruments in the same affinity group. The affinities among instruments may be used as independent factors in setting threshold values, or they may be influenced by the choice of genre.

In many cases, the computer will receive stereophonic audio segments for classification, in which a number of instrumental components, and possibly vocal components, as well, are mixed together in each of the left and right stereo channels. In some embodiments, the computer upmixes the stereo channels to derive a mid-channel signal (which is typically based on the sum of the left and right stereo channels) and two side-channel signals (based on the difference). Each of these upmixed channels is input to the classifier, which is then able to generate more precise scores for the participating instruments than would be achieved using the original stereo channels. This upmixing technique can be used together with or separately from the genre- and affinity-based methods described above.

SYSTEM DESCRIPTION

Fig. 1 is block diagram that schematically illustrates a system 20 for audio analysis, in accordance with an embodiment of the invention. A computer processor 22 receives an audio segment via an input interface 24, for example a recording of an instrumental ensemble 26 made by stereo microphones 28. In the pictured example, ensemble 26 comprises a group of instruments including a violin 30, a viola 32, and a trombone 34, inter alia. Interface 24 may comprise an actual audio input interface, which converts analog audio signals into a sequence of samples. Alternatively or additionally, interface 24 may comprise a digital interface, such as a network interface, which receives the audio segment in digital form, for example as digital audio data contained in a sequence of packets.

Processor 22 comprises a general-purpose processing unit, which is programmed in software to carry out the functions that are described herein. The software may be downloaded to processor 22 in digital form, for example over a network. Additionally or alternatively, the software may be stored on tangible, non-transitory computer-readable media, such as electronic, optical or magnetic memory media.

In processing input audio segments, processor 22 makes use of metadata that are stored in a memory 36. In the pictured embodiment, memory 36 is attached locally to processor 22; alternatively or additionally, processor 22 may access metadata that are stored in remote memories, for example in on-line databases. Memory 36 should therefore be viewed as encompassing both local and remote sources of musical data and metadata.

Specifically, processor 22 is able to compute fingerprints of musical segments and to look up the results in a fingerprint database 38, which identifies the musical works to which the fingerprints belong. The fingerprint identifications in database 38 often include tags indicating the genre of each of the musical works. Various on-line sources of fingerprinting techniques and fingerprint data can be used in this regard. For example, the AcoustID service (available on line at acoustid.org) provides an application programming interface (API) for generating fingerprints from audio segments by detecting the strength of each of twelve pitch classes, corresponding to musical notes.

Multiple open databases are available containing records of musical tracks along with their respective fingerprints and musical genres. One example of such a database is MusicBrainz (available on line at musicbrainz.org). Using this sort of on-line databases, processor 22 can correlate each known musical track with a genre, including sub-genres. For example, the genre "Classical Music" includes such sub-genres as "Arabic Classical Music," "Brazilian Classical Music," "East Asian Classical Music," "Latin Classical Music," "Modern Classical Music," "Western Classical Music," etc. The term "genre," as used in the context of the present description and in the claims, includes sub-genres, as well.

A genre database 40 in memory 36 indicates which musical instruments are typically associated with each genre. For example, for each genre, database 40 may assign different, respective weights to different instruments depending on the likelihood of a particular instrument to be used in playing the genre in question. For example, "Classical Music" will give higher weights to violin, viola and cello, and lower weights to electric guitar and accordion. The use of these weights is described further hereinbelow.

In addition, an affinity database 42 in memory 36 indicates affinities between different instruments, i.e., which combinations or groups of instruments are likely to be played together. For example, violin 30 and viola 32 will have high affinity weights for one another, while their affinity weights with respect to trombone 34 will probably be lower. Instrument groupings are often a function of genre, and databases 40 and 42 may therefore be integrated together. Alternatively, processor 22 may maintain and access affinity database 42 separately, irrespective of the musical genre.

Once processor 22 has identified one or more of the musical instruments playing in a given audio segment, such as a musical track, it stores the audio segment and the corresponding instrument identifications in a database 44 of classified tracks. Database 44 may be dedicated to this sort of instrument identification. Alternatively or additionally, the instrument identifications may be added to an existing on-line music database as tags, in addition to tags identifying the composer, genre, etc., for example. A client computer 46 can then search database 44 in order to find the audio segments in which a given musical instrument has played.

METHODS FOR AUDIO ANALYSIS

Fig. 2 is a flow chart that schematically illustrates a method for audio analysis, in accordance with an embodiment of the invention. The method is described here, for the sake of concreteness and clarity, with reference to the elements of system 20, as shown in Fig. 1 and described above. Alternatively, the principles of this method may be applied in other sorts of systems having suitable computing and data resources.

Processor 22 receives an audio segment 50 through interface 24 and, as an initial step, computes a spectrogram 52 of the segment. At this step, processor 22 divides the audio segment into a sequence of time windows and computes a respective frequency spectrum of the audio data in each of the time windows. Thus, processor 22 transforms the input time-domain signal, such as a two-channel stereo input signal xi/x r , into a corresponding sequence of two-channel spectrograms Xi(f,t), X r (f,t). Further details of this step are described hereinbelow with reference to Fig. 3.

Processor 22 upmixes the left and right audio channels to derive a mid-channel signal (X m ) and two side-channel signals (X s ), at an upmixing step 54. In the present example, these signals are calculated by summation and subtraction of the resulting spectrograms, respectively:

Xs( i ) = j \X l {f, t) - X r (f, t) \

In another embodiment, processor 22 adjusts the mid- and side-channel signals to improve the separation of the three channels. In this embodiment, the center channel (CI) is calculated by multiplying the mid-channel signal (in the frequency domain) by a magnitude coefficient a, which amplifies the intensity of the center while decreasing the relative intensity of the side channels. The value of a is typically set to a value in the range between 0.5 and 1, giving:

Cl(f, ti) = a (1 - ) (X t (f, tl ) + X r (f, ti )

Based on the extracted CI, the left and right channels are given by the formulas:

L . t = X l (J, t i - 0.5 - a -1.

R(J, tt) = X r (J, tt) - 0.5 - a -1. CKJ. tt)

The spectrogram output of each of the channels following steps 52 and 54 has the form of a matrix of amplitude values for a predefined series of frequencies in the respective frequency spectrum of each of the sequence of time windows. (A matrix of this sort is shown in fig. 4, for example.) Processor 22 inputs these matrices to one or more classifiers. In the pictured example, the classifier comprises two convolutional neural networks 56, 58; but alternatively, a larger or smaller number of neural networks may be used. Additionally or alternatively, other types of computerized classifiers may be applied, as are known in the art. Neural networks 56 and 58 may be trained, for example, to operate on time windows of different sizes (such as respective sequences of windows of lengths 0.1 sec, 0.25 sec, and 0.68 sec) or to recognize different groups of instruments (such as strings, wind, and keyboard instruments). Details of an example neural network and its operation are described hereinbelow with reference to Fig. 4.

Each neural network 56, 58 outputs a vector of respective scores for the musical instruments in a predefined set. Each score indicates the likelihood, as inferred by the neural network, that the corresponding musical instrument played in the audio segment under analysis. Neural networks 56, 58 typically operate on each of the three up-mixed audio channels that result from step 54 and thus output separate results for each channel. In one embodiment, the different results obtained from the three up-mixed audio channels are added to one another to give a total score for each instrument. In another embodiment, the results from the three channels are combined using a weighted sum, with different weights for the center and side channels. In an alternative embodiment, processor 22 does not immediately combine the scores, but rather continues to process each of the three channels independently.

Processor 22 compares the scores of the musical instruments that are output by neural networks 56 and 58 to respective threshold values 60, 62. Although these thresholds may be set to constant values, in the present embodiment processor 22 adjusts the threshold values using applicable metadata in order to improve the precision of the results. For example, the processor may apply the information in fingerprint database 38, genre database 40, and affinity database 42 (Fig. 1) for this purpose. Thus, the scores of different instruments in the score vectors output by the neural networks will be compared to different, respective threshold values. Methods for setting and adjusting the threshold values are described further hereinbelow with reference to Fig. 5.

Comparing the respective scores of the musical instruments for the audio segment to respective threshold values 60, 62 gives results 68 and 70, which identify one or more of the musical instruments for which the respective scores are no less than the applicable threshold values as having played in the audio segment. Processor 22 can enter these results in database 44, as well as outputting the classification results for other uses.

SPECTROGRAM COMPUTATION

Fig. 3 is a flow chart that schematically illustrates details of a method for computation of spectrogram 52, in accordance with an embodiment of the invention. Processor 22 divides audio segment 50 into a sequence of time windows, at a time division step 72. Typically, although not necessarily, each audio channel in the segment is divided into N overlapping windows. For example, segments of duration 0.68 sec, sampled at a rate of 48K samples/sec, may be divided into N = 128 windows that are 512 data points in length, with 50% overlap between successive windows, for a total of 32,768 data points per segment. Alternatively, different sizes and numbers of windows may be used, with different overlap ratios or no overlap.

Processor 22 transforms the data in each window to the frequency domain, for example using a short-time Fourier transform (STFT), at a transformation step 74. Alternatively, other methods of time-frequency decomposition may be applied at this step. Processor 22 then reduces the transformed data by extracting amplitude values for a predefined series of frequencies in the respective frequency spectrum of each of the time windows, at a value extraction step 76. In the present example, processor 22 uses the Mel scale in extracting these values. The Mel scale is a perceptual scale of pitches that are judged by listeners to be equal in distance from one another. In the present embodiment, processor 22 converts the values of frequency/in the spectrogram (in units of Hz) to the Mel scale using the formula:

/

mel = 2595 * log 10 (l +— ) The result of this process is a matrix of amplitude values for input to neural network 56, with size equal to the number of frequencies times the number of windows in each segment, for example 96 frequencies for each of 128 windows.

Alternatively, the number of frequencies for which corresponding amplitude values are input to the classifier can be reduced using other techniques, such as principal component analysis (PC A). Further alternatively, the number of frequencies is not reduced, and the entire output of the STFT (or other time/frequency transform) is input to the neural network.

CLASSIFICATION OF AUDIO SEGMENTS

Fig. 4 is a block diagram that schematically shows details of the structure and operation of neural network 56, in accordance with an embodiment of the invention. As explained above, the input to the neural network is a matrix 80 of amplitude values 82 at M different frequencies in each of N successive windows, for example M = 96 and N = 128. Matrix 80 is input to an initial layer 84 of neural network 56, which convolves matrix 80 with a set of kernels and applies an activation function to the convolution results, such as activation based on a leaky rectified linear unit (Leaky ReLU), as is known in the art. The results of initial layer 84 are output to a second layer 86 of the neural network, followed by a sequence of intermediate layers 88, leading to an output layer 90.

In an example embodiment, neural network 56 includes the following layers:

• First layer: 2D convolution and LeakyReLU activation.

• Second layer: 2D convolution, LeakyReLU activation, Max Pooling and Dropout.

· Third layer: 2D convolution and LeakyReLU activation.

• Fourth layer: 2D convolution, LeakyReLU activation, Max Pooling and Dropout.

• Fifth layer: 2D convolution and LeakyReLU activation. • Sixth layer: 2D convolution, LeakyReLU activation, Max Pooling and Dropout.

• Seventh layer: 2D convolution, LeakyReLU activation and Global Max-Pooling.

• Eighth layer: Dense-connected, LeakyReLU, Dropout and prediction by sigmoid activation. This sort of convolutional neural network can be implemented, for example, using the MATLAB Neural Network Toolbox™ or any other suitable software or hardware-based tools that are known in the art.

The output of layer 90 is a vector 92 containing X scores 94, labeled Ii, I 2 , Ix, corresponding to the X musical instruments that neural network 56 has been trained to recognize. As explained above, each score 94 indicates the likelihood that the corresponding musical instrument played in the current audio segment. Scores 94 may have the form of probabilities, for example, with values between zero and one.

Processor 22 optionally places scores 94 that are output by neural network 56 for each matrix 80 and for each instrument in a buffer, and may then smooth the scores using a moving average algorithm or another suitable smoothing technique. This sort of smoothing reduces abrupt changes, which are improbable in common sound tracks and may lead to artifacts in the classification results.

Fig. 5 is a flow chart that schematically shows details of the above methods for classifying and searching audio passages, in accordance with an embodiment of the invention. The method of Fig. 5 comprises three main stages: a training stage 100, a classification stage 102, and a search stage 104. In training stage 100, neural networks 56, 58 are trained to recognize respective audio outputs of a predefined set of musical instruments, typically using a training set of musical passages played by known instruments. In addition, in stage 100, genre database 40 and affinity database 42 are populated with weights for each of the musical instruments, depending on the genres of the musical passages in the training set and the instrument combinations that occur in these musical passages.

Classification stage 102 is initiated by input of an unknown musical passage, i.e., a passage whose instrumental makeup is not known in advance, at a passage input step 106. Neural networks 56, 58 generate vectors 92 of instrument scores 94, at a score generation step 108, as described above. In addition, processor 22 computes a fingerprint of the musical passage, and looks up the fingerprint in fingerprint database 38, at a fingerprint lookup step 110. Fingerprint database provides a tag indicating the genre of the passage. Alternatively, processor 22 may look up the fingerprint and/or some other feature or features of the passage directly in genre database 40. In either case, processor 22 reads a set of threshold values for the appropriate genre from genre database 40, at a threshold lookup step 112. As explained above, processor 22 thus sets different, respective threshold values for the musical instruments depending upon the genre of the passage undergoing classification. Processor 22 compares respective scores 94 of the X musical instruments in vector 92 for the present audio passage to the respective threshold values, at a comparison step 114. When the score for one or more of the instruments (for example, instrument J) is greater than or, at least, equal to the respective threshold, processor 22 identifies the corresponding musical instrument or instruments as having played in the audio passage. The processor saves this classification of the passage in database 44, at an instrument classification step 116.

In addition, after identifying a first musical instrument as having played in the audio passage at step 114, processor 22 may update the threshold values for other musical instruments that are listed in database 42 as having an affinity with the first musical instrument, at an affinity update step 118. Thus, for example, once processor has identified violin 30 as having played in a certain audio segment, it reduces the threshold value for viola 32, since there is a high likelihood of these two instruments playing together. Processor 22 returns to step 114 to compare scores 94 to these reduced threshold values, and may then identify an additional instrument or instruments as having played in the audio passage if their respective scores are no less that the respective reduced threshold values.

As an example of the operation of the above classification process, assume that neural network 56 has classified a certain segment of classical music in which both cello and piano are playing, and returned a score 94 for the cello of 99%, while the score for the piano is only 65%. A reasonable threshold of 70%, for example, would still exclude the piano and lead to a false negative result. In steps 110 and 112, however, processor 22 identifies the audio segment as belonging to a particular classical music genre, for which databases 40 and 42 indicate that piano and cello often appear together. Given the high confidence result level returned for the cello classifier (99%), processor 22 will lower the specific threshold for the piano at step 118 to a value below 65%, which will result in the inclusion of both instruments in the final classification.

The rule applied by processor 22 at step 118 can be expressed as follows, for example: If genre = classical music and cello > cello_threshold and piano < piano_threshold, then piano_threshold = piano_threshold-10.

Alternatively or additionally, processor 22 may dynamically change the weights in neural networks 56 and 58 for groups of instruments. Thus, for example, detection of a violin playing in a given segment will cause higher weights to be assigned to the viola, cello and tuba in the neural network, while lower weights are assigned to the electric guitar. Alternatively or additionally, genre information can be used in adjusting the neural network weights.

After classification of a sufficient number of musical pieces in stage 102, client computer 46 can search database 44 in stage 104 in order to find the audio segments in which a given musical instrument has played.

Although the embodiments described above are directed specifically to classification of instrumental components in an audio mix, the principles of the present invention may also be applied, mutatis mutandis, to other sorts of audio sources, such as vocal music or even natural sounds (for example, rain, thunder, and birdsong). It will thus be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.