AUDIO ANALYSIS SYSTEM AND METHOD USING AUDIO SEGMENT CHARACTERISATION

Title:

AUDIO ANALYSIS SYSTEM AND METHOD USING AUDIO SEGMENT CHARACTERISATION

Document Type and Number:

WIPO Patent Application WO/2014/096832

Kind Code:

Abstract:

A method of matching an input audio signal to one or more audio segments within a plurality of audio segments, the method comprising: receiving the input audio signal; processing the input audio signal to determine structural parameter feature data related to the received input audio signal; analysing the determined structural parameter feature data to extract semantic feature data; comparing the feature data of the input audio signal to pre-processed feature data relating to the plurality of audio segments in order to match one or more audio segments within a similarity threshold of the input audio signal; outputting a search result on the basis of the matched one or more audio segments wherein semantic feature data is extracted from the structural parameter data using a supervised learning technique.

Inventors:

MAGAS MICHELA (GB)
LAURIER CYRIL (ES)

Application Number:

PCT/GB2013/053362

Publication Date:

June 26, 2014

Filing Date:

December 19, 2013

Export Citation:

Click for automatic bibliography generation Help

Assignee:

MAGAS MICHELA (GB)
LAURIER CYRIL (ES)

International Classes:

G10H1/00

Foreign References:

US20070240557A1	2007-10-18
US20060075886A1	2006-04-13
US20080300702A1	2008-12-04

Other References:

CASEY M A ET AL: "Content-Based Music Information Retrieval: Current Directions and Future Challenges", PROCEEDINGS OF THE IEEE, IEEE. NEW YORK, US, vol. 96, no. 4, 2 April 2008 (2008-04-02), pages 668 - 696, XP011206028, ISSN: 0018-9219
MICHELA MAGAS: "Michela Magas on music search and discovery -YouTube", BIG AWARDS' AT RAVENSBOURNE COLLEGE, GREENWICH, LONDON, 6 March 2012 (2012-03-06), Internet, XP055107138, Retrieved from the Internet [retrieved on 20140312]

Attorney, Agent or Firm:

RICHARDSON, Mark et al. (Fleet Place House2 Fleet Place, London EC4M 7ET, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1 . A method of matching an input audio signal to one or more audio segments within a plurality of audio segments, the method comprising:

receiving the input audio signal;

processing the input audio signal to determine structural parameter feature data related to the received input audio signal;

analysing the determined structural parameter feature data to extract semantic feature data;

comparing the feature data of the input audio signal to pre-processed feature data relating to the plurality of audio segments in order to match one or more audio segments within a similarity threshold of the input audio signal;

outputting a search result on the basis of the matched one or more audio segments wherein semantic feature data is extracted from the structural parameter data using a supervised learning technique.

2. A method as claimed in Claim 1 , wherein the comparing step comprises comparing the

determined structural parameter feature data and extracted semantic audio feature data to pre- processed structural parameter feature data and semantic feature data relating to the plurality of audio segments in order to match one or more audio segments that have structural parameter feature data and semantic feature data that is within a similarity threshold of the determined structural parameter feature data and extracted semantic feature data.

3. A method as claimed in Claim 1 or Claim 2, wherein the pre-processed feature data relating to the plurality of audio segments is stored in a data store.

4. A method as claimed in Claim 3, wherein the data store comprises a plurality of audio segment characterisations, each audio segment characterisation comprising the pre-processed feature data relating to an audio segment.

5. A method as claimed in Claim 4, wherein the audio segment characterisation comprises metadata identifying start/end points of the audio segment within a longer audio file.

6. A method as claimed in any one of claims 3 to 6, wherein the audio segment is stored in the data store.

7. A method as claimed in any one of claims 3 to 6, wherein the audio segment is stored in a further data store and the audio segment characterisation.

8. A method as claimed in any preceding claim, further comprising segmenting the input audio signal to identify one or more audio segments.

9. A method as claimed in Claim 8, wherein segmenting comprises determining feature data within the input audio signal.

10. A method as claimed in Claim 9, wherein segmenting comprises identifying candidate audio segments based on changes in the determined feature data over time.

1 1 . A method as claimed in Claim 9 or 10, wherein segmenting comprises identifying candidate audio segments using a novelty curve technique.

12. A method as claimed in Claim 9 to 11 , wherein segmenting comprises using a peak detection algorithm to identify novelty peaks in order to identify candidate audio segments

13. A method as claimed in any one of Claims 8 to 12, further comprising a further segmentation step arranged to fine tune start and end points of identified audio segments by analysing changes in the determined feature data over a finer time scale than during the initial segmentation step.

14. A method as claimed in any one of claims 8 to 13, wherein segmenting the input audio signal comprises filtering the identified audio segments on the basis of one or more heuristic rules.

15. A method as claimed in any preceding claim, wherein processing the input audio signal comprises analysing the input audio signal waveform to extract temporal feature data.

16. A method as claimed in Claim 15, wherein analysing the input audio signal waveform

comprises measuring loudness with RMS.

17. A method as claimed in any preceding claim, wherein processing the input audio signal comprises performing a fast Fourier transform on the input audio signal in order to extract spectral feature data.

18. A method as claimed in Claim 17, further comprising analysing components of the fast Fourier transform to determine changes in frequency.

19. A method as claimed in any preceding claim, wherein processing the input audio signal

comprises generating a chromagram in order to extract tonal feature data.

20. A method as claimed in Claim 19, further comprising analysing chromas within the generated chromagram and extracting tonal feature data based on the distribution of chromas within the input audio signal.

21 . A method as claimed in any preceding claim, further comprising conducting a statistical

analysis of the extracted feature data in order to determine structural parameter feature data.

22. A method as claimed in any one of Claims 15 to 21 when dependent on Claim 8, wherein processing of the input audio signal comprises processing identified audio segments.

23. A method as claimed in any preceding claim, wherein analysing the determined structural parameter feature data comprises inputting the determined structural feature data into a supervised learning based classifier model in order to extract semantic feature data.

24. A method as claimed in Claim 23, wherein the supervised learning model comprises a Support Vector Machine.

25. A method as claimed in Claim 23 or Claim 24, wherein the classifier model outputs semantic feature data including one or more from the group of: musical style; mood of music;

instruments used within the input audio signal.

26. A method as claimed in any preceding claim, wherein comparing the feature data comprises weighting feature data that is assessed using a similarity algorithm.

27. A method as claimed in Claim 26, wherein two or more types of feature data are used to

compare the feature data of the input audio signal with the pre-processed feature data and the weighting given to each type of feature data is customisable.

28. A method as claimed in any preceding claim, further comprising pre-processing the identified structural feature data in order to normalise the feature data.

29. A system for matching an input audio signal to one or more audio segments within a plurality of audio segments, the system comprising:

an input arranged to receive the input audio signal;

a processor arranged to: process the input audio signal to determine structural parameter feature data related to the received input audio signal; analyse the determined structural parameter feature data to extract semantic feature data; and compare the feature data of the input audio signal to pre-processed feature data relating to the plurality of audio segments in order to match one or more audio segments within a similarity threshold of the input audio signal;

an output arranged to output a search result on the basis of the matched one or more audio segments

wherein the processor is arranged to extract semantic feature data from the structural parameter data using a supervised learning technique.

30. A method of matching an input audio file, the input audio file being associated with structural parameter feature data and semantic feature data, to one or more audio segments within a plurality of audio segments, the method comprising:

receiving the input audio file;

determining the structural parameter feature data and semantic feature data associated with the input audio file, the semantic feature data having been extracted from the structural parameter data using a supervised learning technique;

comparing the feature data of the input audio file to pre-processed feature data relating to the plurality of audio segments in order to match one or more audio segments within a similarity threshold of the input audio file;

outputting a search result on the basis of the matched one or more audio segments.

31. A system for matching an input audio file, the input audio file being associated with structural parameter feature data and semantic feature data, to one or more audio segments within a plurality of audio segments, the system comprising:

an input arranged to receive the input audio file;

a processor arranged to determine the structural parameter feature data and semantic feature data associated with the input audio file, the semantic feature data having been extracted from the structural parameter data using a supervised learning technique; and to compare the feature data of the input audio file to pre-processed feature data relating to the plurality of audio segments in order to match one or more audio segments within a similarity threshold of the input audio file; an output arranged to output a search result on the basis of the matched one or more audio segments.

32. A method of building a data store of audio segment characterisations, comprising

receiving an input audio file;

processing the input audio to determine structural parameter feature data related to the received input audio signal;

analysing the determined structural parameter feature data to extract semantic feature data wherein semantic feature data is extracted from the structural parameter data using a supervised learning technique;

storing the determined structural parameter feature data and extracted semantic feature data in a data store.

33. A method of segmenting an input audio signal, the input audio file being associated with structural parameter feature data and semantic feature data, into one or more audio segments, the method comprising:

receiving the input audio signal;

processing the input audio signal to determine structural parameter feature data related to the received input audio signal;

segmenting the input audio signal to identify one or more audio segments

wherein segmenting comprises identifying candidate audio segments based on changes in the determined feature data over time.

Description:

AUDIO ANALYSIS SYSTEM AND METHOD USING AUDIO SEGMENT CHARACTERISATION

Technical Field

The present invention relates to an audio analysis system and method. In particular, the present invention relates to a method of analysing audio sources in order to extract audio features and parameters that may be used to search for similar audio data. The present invention comprises a method of building a database of automatically extracted audio segment files and a method of analysing an input audio stream (a "seed" query) against the database in order to identify audio segment files that are similar to the seed query.

Background to the Invention

Known methods addressing audio similarity identification may rely exclusively on low-level feature extraction derived from spectral analysis using Fourier transform (e.g. US8158871 ,

US20080300702), or other particular methods such as Wavelets (e.g. WO2004/075093 A2).

However in order to produce culturally-relevant results, this usually requires further combining with user-generated metadata. Where a combination of low-level acoustic metadata and high-level cultural metadata is used, "cultural metadata" often refers to text-based information describing listeners' reactions to a track or song (e.g. US8073854), as opposed to automatised methods of identification of high-level data.

Methods utilising high-level data to map clusters of music for the purpose of efficient music searches may therefore rely on considerable human input of semantic descriptors and assessment values, including human-edited music similarity graphs, similarity graphs generated using collaborative filtering techniques, similarity graphs generated as a function of monitored radio or network broadcast playlists, and similarity graphs constructed from music metadata (e.g.

US7571 183). This may include focusing on a way to capture emotional data from users via a specific device (e.g. US2010145892); or focusing on a playiist generation method based on high- level descriptors (e.g. EP2410444). None of the above methods focus on the description of an individual musical phrase within a music piece, and instead attempt to assign an emotional measure to the track as a whole.

Methods of audio similarity measures based on the automatised generation of music metadata by segmentation, focus on the identification of specific classes for the purpose of making corresponding metadata markers available to music search engines. This may include identifying repetition of set classes, such as 'stanza' or 'refrain' (e.g. US7345233); or creating a music summary, which makes classifiers such as 'sad' and 'jazz' available as metadata for a similarity music search, and clustering music pieces according to this classification (e.g. US76261 11). The above methods rely on high-level semantic metadata derived from audio analysis to be served to search engines, rather than a high-level audio feature-led search.

The object of the present invention is to provide an audio search system that provides a more effective audio search system.

Summary of the invention

According to a first aspect of the present invention there is provided a method of matching an input audio signal to audio files stored in a data store, the method comprising: receiving the input audio signal; processing the input audio signal to determine structural parameter feature data related to the received input audio signal; analysing the determined structural parameter feature data to extract semantic feature data; comparing the feature data of the input audio signal to pre- processed feature data relating to audio files stored in the data store in order to match one or more audio files within a similarity threshold of the input audio signal; outputting a search result on the basis of the matched one or more audio files wherein semantic feature data is extracted from the structural parameter data using a supervised learning technique.

None of the above prior art examples discussed above build an effective audio search system by combining the automatic segmentation algorithms with a weighed similarity measure using both low-level and high-level analysis. However, the present invention provides an improved system and a series of methods for identifying and sampling similar segments of music.

An audio signal is taken and is then analysed in order to extract audio related feature data from the input audio. The extracted features may be matched to features extracted from a plurality of segments of audio stored in a database and audio samples may be returned in order of best result. The input audio may be an audio signal (e.g. of a music track playing on the radio/TV), may be a segment of audio (e.g. an extract of an audio track that is played) or may be a segment of audio that has been returned as an earlier search result and which is reused as a starting query. The segment may be analysed in real time and audio features are extracted.

The system may generate audio segments marked by timestamps that have been arrived at using an automatic segmentation algorithm. Such a segmentation algorithm may employ a principle such as significant change in a feature of audio to determine start and end points of an audio segment. These audio segments (or "musical phrases") may be analysed to extract salient high and low level feature data (semantic and structural feature data respectively) which is associated with those audio segments and their value relationships, in order to describe those audio segments with semantic and human-related contextual meaning. The value relationships may comprise relationships between various extracted feature data (e.g. where the extraction of feature data has identified a loudness rating and a musical type, then a value relationship between high values of "loud" and "metal" may be derived. By weighting results further value relationships may be derived, e.g. defining a "young" or "old" target audience for the search results).

Thus, an audio segment once fully analysed may be associated with an audio segment

characterisation that either contains or references: (1) an automatically extracted segment of audio of varied length, which differs from the preceding or following segment within the input audio source by means of at least one salient, human-meaningful differing feature, described by means of its start and end points; and 2) contains that audio segment's related set of human-meaningful feature descriptors and their value relationships, stored as metadata. The features of each audio segment may be analysed according to a series of low level (e.g. objective measures such as tempo, rhythm descriptors, tonal descriptors etc.) audio and high level feature (e.g. subjective measures such as mood, style etc.) extraction algorithms.

The present invention provides an improved audio analysis system compared to known systems. In particular, it is noted that the present invention provides improved effectiveness by focussing on parts of audio tracks instead of complete audio tracks combined with the use of automatised weighting of high-level comparison measures, e.g. the mood of an audio extract/track. It is also noted that the analysis methods according to embodiments of the present invention allow "on the fly" similarity queries to be run.

The first aspect of the present invention provides a method in which an input audio signal is processed to determine structural parameter feature data such as descriptors that are extracted from temporal and spectral representations of the input audio. Such data is also referred to as "low level" feature data and includes, by way of example, the following audio features grouped by type:

-Timbre: Bark bands, Mel Frequency Cepstrum Coeficients (MFCCs), pitch salience, High

Frequency Component (HFC), Loudness Spectral flatness, Spectral flux rolloff, complexity, centroid, kurtosis, skewness, crest, decrease, spread.

-Tonal: dissonance, chords change rate, mode, key strength, tuning diatonic strength, tristimulus -Rhythm: bpm, bpm confidence, silence rate, onset rate Having determined structural parameter feature data, semantic feature data is then extracted from the "low level" feature data. Semantic feature data (also referred to as "High-level features") may be defined as descriptors capturing the semantics of an audio sample, and their corresponding value relationships. These features can model concepts such as mood, style and a variety of others. High level feature data is extracted via a supervised learning technique.

The low and high level feature data determined and extracted from the input audio signal is then compared to pre-processed feature data relating to audio segments stored in the data store in order to determine one or more matching audio segments.

It is noted that the input audio signal may comprise a complete audio/music track or an extract thereof, subject to the automatic identification of the length of the audio segment. Search results output by the method may also be fed back in as the input audio signal.

It is further noted that a single data store may store both the pre-processed feature data relating to the stored audio segments and the audio segment itself. Alternatively, the data store may instead store the pre-processed feature data relating to the audio segment and the audio segment itself may be stored in a further data store (i.e. the actual sound file that the audible, playable component is within may be located in a different location). In this alternative example, the pre- processed feature data relating to the audio segment may further include links to the actual related audio segments/complete audio files.

Conveniently, the comparing step may comprise comparing the determined structural parameter feature data and extracted semantic audio feature data to pre-processed structural parameter feature data and semantic feature data relating to the plurality of audio segments in order to match one or more audio segments that have structural parameter feature data and semantic feature data that is within a similarity threshold of the determined structural parameter feature data and extracted semantic feature data.

The pre-processed feature data relating to the plurality of audio segments may conveniently be stored in a data store such as a database. Furthermore, the data store may comprise a plurality of audio segment characterisations, each audio segment characterisation comprising the pre- processed feature data relating to an audio segment. The audio segment characterisation may comprise further data identifying start/end points of the audio segment within a longer audio file. The audio segment characterisation may also comprise meta-data that defines the relationships between one or more types of feature data. In one arrangement, the audio segment may be stored in the same data store as the audio segment characterisation. In another arrangement, the audio segment may be stored in a further data store (e.g. a third party's music library) and the audio segment characterisation may be stored in the data store with an identifier of the actual sound file (the audible, playable component) or a link or other direction to the third party database.

The search result output may comprise an actual audio segment (i.e. an actual sound file representing the audio segment, if available) or a link or direction to a further data store/database (if stored elsewhere).

In the event that the input audio signal is an extended audio stream the method may further conveniently comprise segmenting the input audio signal to identify one or more audio segments. Segmenting the input audio in this manner may conveniently reduce processor burden when determining feature data and improve effectiveness of analysis and comparison.

Segmenting the input audio in this manner may comprise determining feature data within the input audio signal. Furthermore, segmenting may comprise identifying candidate segments based on changes in the determined feature data over time. Segmenting may comprise identifying candidate segments using a novelty curve technique or using a peak detection algorithm to identify novelty peaks in order to identify candidate audio segments

Conveniently, a first segmentation process may determine an initial audio segment which may then be normalised in a further segmentation process. Normalisation may comprise analysing the feature data over a finer time scale than during the initial/first segmentation step.

Segmenting the input audio signal may also comprise filtering the identified audio segments on the basis of one or more heuristic rules.

Processing the input audio signal may comprise analysing the input audio signal waveform to extract temporal feature data. Analysing the input audio signal waveform may comprise measuring loudness with RMS.

Processing the input audio signal may also comprise performing a fast Fourier transform on the input audio signal in order to extract spectral feature data. The method may further comprise analysing components of the fast Fourier transform to determine changes in frequency. Processing the input audio signal may comprise generating a chromagram in order to extract tonal feature data. The method may further comprise analysing chromas within the generated chromagram and extracting tonal feature data based on the distribution of chromas within the input audio signal.

The method may further comprise conducting a statistical analysis of the extracted feature data in order to determine structural parameter feature data.

Where audio segments have been identified in the input audio stream the method may comprise processing identified audio segments.

Analysing the determined structural parameter feature data may comprise inputting the determined structural feature data into a supervised learning based classifier model in order to extract semantic feature data. The supervised learning model may comprise a Support Vector Machine (SVM). The classifier model may be arranged to output semantic feature data including one or more from the group of: musical style; mood of music; instruments used within the input audio signal.

Conveniently, comparing the feature data may comprise weighting feature data that is assessed using a similarity algorithm.

Two or more types of feature data may be used to compare the feature data of the input audio signal with the pre-processed feature data and the weighting given to each type of feature data may be customisable. For example, two types of semantic feature data (such as mood and style) and one type of structural feature data (such as tone) may be used to compare input audio to pre- processed feature data using a comparison algorithm. The various types (groups) of feature data may be weighted relative to one another depending on the context of the search.

The method may further comprise pre-processing the identified structural feature data in order to normalise the feature data.

According to a second aspect of the present invention there is provided a system for matching an input audio signal to one or more audio segments within a plurality of audio segments, the system comprising: an input arranged to receive the input audio signal; a processor arranged to: process the input audio signal to determine structural parameter feature data related to the received input audio signal; analyse the determined structural parameter feature data to extract semantic feature data; and compare the feature data of the input audio signal to pre-processed feature data relating to the plurality of audio segments in order to match one or more audio segments within a similarity threshold of the input audio signal; an output arranged to output a search result on the basis of the matched one or more audio segments wherein the processor is arranged to extract semantic feature data from the structural parameter data using a supervised learning technique.

According to a third aspect of the present invention there is provided a method of matching an input audio file, the input audio file being associated with structural parameter feature data and semantic feature data, to one or more audio segments within a plurality of audio segments, the method comprising: receiving the input audio file; determining the structural parameter feature data and semantic feature data associated with the input audio file, the semantic feature data having been extracted from the structural parameter data using a supervised learning technique;

According to a fourth aspect of the present invention there is provided a system for matching an input audio file, the input audio file being associated with structural parameter feature data and semantic feature data, to one or more audio segments within a plurality of audio segments, the system comprising: an input arranged to receive the input audio file; a processor arranged to determine the structural parameter feature data and semantic feature data associated with the input audio file, the semantic feature data having been extracted from the structural parameter data using a supervised learning technique; and to compare the feature data of the input audio file to pre-processed feature data relating to the plurality of audio segments in order to match one or more audio segments within a similarity threshold of the input audio file; an output arranged to output a search result on the basis of the matched one or more audio segments.

According to a fifth aspect of the present invention there is provided a method of building a data store of audio segment characterisations, comprising: receiving an input audio file; processing the input audio to determine structural parameter feature data related to the received input audio signal; analysing the determined structural parameter feature data to extract semantic feature data wherein semantic feature data is extracted from the structural parameter data using a supervised learning technique; storing the determined structural parameter feature data and extracted semantic feature data in a data store. According to a sixth aspect of the present invention there is provided a method of segmenting an input audio signal, the input audio file being associated with structural parameter feature data and semantic feature data, into one or more audio segments, the method comprising: receiving the input audio signal; processing the input audio signal to determine structural parameter feature data related to the received input audio signal; segmenting the input audio signal to identify one or more audio segments wherein segmenting comprises identifying candidate audio segments based on changes in the determined feature data over time.

It will be appreciated that any of the preferred and/or optional features of the first aspect of the invention may be incorporated alone or in appropriate combination in any of the second, third, fourth, fifth and sixth aspects of the invention. It is also noted that references to audio files within the third to sixth aspects of the present invention may include complete musical tracks or fragments thereof.

The present invention provides a method and system for analysing audio data. The invention extends to the following:

1 ) a system for recognising similar features in at least two audio segments, comprising: a sampler of an existing audio stream, a real time feature extractor, a media search engine connected to the features extractor, a media search engine able to match the features to a database of audio segment characterisations, including both low level and high level descriptors, a media search engine able to return audio samples of matching audio segments, an audio visual display which allows to sample the matching audio segments.

Examples of embodiment of such a system include using a known audio sample to locate similar but unknown alternatives (such as, but not exclusively, music by unknown artists); using a catalogued audio sample to find versions of the same (such as, but not exclusively, live recordings of commercial music); using an audio sample to locate linking points in other audio (such as, but not exclusively, in music mixing); using an audio sample to discover music relationships between audio samples (such as, but not exclusively, in relationships between music from different world cultures).

2) a method for identifying meaningful audio segments in an audio file.

It is recommended to have consistent segments so that the statistical moments derived from the frame-based analysis are more representative. As an example, the more consistent the segment is, the more specific the statistical mean of its audio descriptors and the more relevant the similarity matching will be. This consistency is defined in terms of statistical consistency of audio descriptors.

The main purpose of this module is to segment audio into meaningful and consistent segments. The relevant "cutting point" is located to maximise the consistency of each part. Although the present invention is not limited to use any particular type of segmentation, it is recommended to use a particular algorithm based on a classifier-based segmentation algorithm.

The audio is first divided into frames from which descriptors are extracted. When salient changes are identified, a segmentation marker is generated. The salient changes are evaluated using the accuracy of a classifier, by comparing descriptors prior to and following the candidate

segmentation marker.

The segmentation marker method may use the following steps:

1) Compute the novelty curve at each point in time (using a classifier-based method)

2) Detect peaks in the curve

3) Search locally for the exact position of a segmentation marker

4) Validate with decision heuristics (normalise according to length constraints and other predefined rules)

Embodiment of such a method may be in allowing very precise search of the desired mood in a particular piece of music, regardless of whether the audio segment comes from a 30-second production piece or from a concept album.

3) a method comprising of a feature extractor / a series of feature extractors to identify high level and low level mood-sensitive features.

The audio features (also called audio descriptors or feature vectors) are variables extracted from the audio signal describing some aspect of the information it contains. A rich set of audio features is extracted based on temporal and spectral representations of the audio signal. The resulting values are stored into audio clips. An audio clip is linked to an audio sample and contains all the audio features extracted from it. After the audio feature extraction step, each audio sample has an audio profile associated.

The audio features are divided into two types: low-level and high-level.

Low-level stands for descriptors that are closer to the signal, extracted from temporal and spectral representations. See below for examples.

High-level stands for descriptors that are based on the output from a trained machine learning algorithm, using curated databases as described below. Low-Level Features

To extract low-level descriptors, for each audio stream, the stereo channels are merged into a mono mixture. Then frame-based features are summarised with their component-wise statistics across the audio stream. Here is an example set of audio features grouped by type:

-Timbre: Bark bands, Mel Frequency Cepstrum Coeficients (MFCCs), pitch salience, High

Frequency Component (HFC), Loudness Spectral flatness, Spectral flux rolloff, complexity, centroid, kurtosis, skewness, crest, decrease, spread.

-Tonal: dissonance, chords change rate, mode, key strength, tuning diatonic strength, tristimulus -Rhythm: bpm, bpm confidence, silence rate, onset rate

Following this procedure, for each audio sample, feature statistics are computed (such as but not limited to: minimum, maximum, mean, variance and derivatives). It is recommended to then standardise those values across the whole music collection values, easing their combination to build similarity measures.

High-Level Features

High-level features are defined as descriptors capturing the semantics of an audio sample. These features can model concept such as mood, style and others.

Using the low-level features extracted previously, predictive models (classifiers) are built for each of these high-level categories and based on ground truth databases. Ground truth are curated databases made of representative examples for each category. From these databases, each predictive model can learn specific classes. This is obtained using supervised learning algorithms from the machine learning field. Efficient models such as Support Vector Machines (SVMs) are recommended. The main requisite is that it could produce classification probabilities for each category. Indeed this property allows use of these probabilities as descriptors.

The output of this process are classifiers of mood, style, instruments and others (it can be extended indefinitely). Each classifier can be applied to get a set of probability estimations. For each audio sample, classification probabilities for all high-level features are computed and included in the audio profile.

The feature extraction process may group all audio descriptors (low-level and high-level feature vectors) into vectors, matrix or other data structure and store them in audio profiles. Audio profiles can be saved in memory, databases, files or any other available way to store data. 4) a method for identifying similar features between audio segments.

The main objective of this system is to provide a good matching to a query. In this context, the method to compute the matching is essential. Also, it has to be flexible enough to be adapted (manually or automatically) to as many use cases as possible.

The method employed is a mood matching similarity algorithm computing a measure used to compare instances and find the closest one (most similar).

The mood matching similarity measure is computed based on the extracted features contained in the audio profiles (both low-level and high-level).

This similarity measure is a weighted sum of several similarity measures. With a linear combination, a final similarity score is used to retrieve similar audio samples. This similarity measure can be customised with the coefficients of the linear combination for each component.

The importance of each descriptor is customisable with the coefficients of the weighted Pearson correlation measure on high-level features. This can be customised automatically, optimising the coefficients according to a set of rules and constraints.

Comparing audio profiles and especially the high-level descriptors, the similarity measure is computed and used to retrieve similar audio samples. A Nearest Neighbour Search Algorithm can be used to find the most similar results.

5) a server and database search system.

The system can be used with a client/server approach. The audio segment characterisations may be stored on a server, into a database, the audio segment characterisations containing both audio segments and metadata related to those segments.

The database is linked to the audio information and the server is able to stream audio data to the user.

Before any query can be made, the database needs to be analysed with the previously described algorithms (segmentation and feature extraction).

When a similarity query is issued from an audio segment already included in the database, the similarity matching is done directly using the computed audio profiles.

However, a similarity query may be issued from any audio sample uploaded by the user or provided by any other means. In this case, there is a first step of feature extraction to obtain the audio profile. This is done "on-the-fly" by the server and then matched against the database.

6) a method for extracting features from streaming audio in real time, comprising a real-time features extractor. The present invention may extend to a desktop application (or smart device application) which sits in the background and "listens" to what the user is listening to (e.g. YouTube, Spotify, iTunes Library), analyses a few seconds at a time, and sends real-time analysed data to a data store to find similar audio segments from a collection it is connected to.

Brief description of the drawings

In order that the invention may be more readily understood, reference will now be made, by way of example, to the accompanying drawings in which:

Figure 1 shows a flow chart according to an embodiment of the present invention detailing how a user may search for audio samples;

Figure 2 shows a flow chart according to an embodiment of the present invention that shows the search procedure of Figure 1 in more detail;

Figure 3 shows a flow chart according to an embodiment of the present invention detailing how a database of audio segment characterisations may be created;

Figure 4 shows the process of segmenting an audio stream according to an embodiment of the present invention;

Figure 5 shows the process according to an embodiment of the present invention of extracting low level audio features from an audio segment;

Figure 6 shows the process according to an embodiment of the present invention of extracting high level audio features from an audio segment;

Figure 7 shows the process of defining a feature vector for an audio segment according to an embodiment of the present invention; Figure 8 shows the process of comparing audio segment characterisations according to an embodiment of the present invention;

Figure 9 illustrates the various components of an input audio signal and the associated storage within a database in accordance with an embodiment of the present invention.

Detailed description of the invention

In the description below various different terms will be used to refer to audio data at different points within the system/method according to the present invention. The following terminology is used:

As described above the present invention is arranged to process an input audio signal and to match the input audio to one or more stored audio segments. The input audio signal may comprise a complete music track, a part of a track or may be a constant input that is processed to match audio segments continuously (i.e. the method of the present invention may constantly "sniff" an input audio signal). Within the description below the input audio signal is also referred to as an "audio source" or "audio stream".

Inputs to a search engine in accordance with embodiments of the present invention are variously referred to as "seed" inputs. In particular, with reference to Figure 1 below a "seed query" refers to an audio signal that has been processed to derive audio data. For example, an input audio signal (or "seed audio stream" may be processed to generate the seed query).

The seed query may not necessarily relate to a complete audio track but to a portion or segment thereof. Generation of a "seed query" may therefore involve determining an audio segment which is then analysed to extract a number of feature vectors (both high level and low level features as mentioned above), the resulting audio segment being used as the seed query. Extracted feature vectors may also be used as the basis of a "seed query".

Low level features within the feature vectors described below relate to structural features of the input audio. "Low level" features and "low level feature vectors" are therefore equivalent terms to "structural parameter feature data" as used above. High level features within the feature vectors described below relate to semantic features (such as mood or genre). "High level" features and "high level feature vectors" are therefore equivalent terms to "semantic feature data" as used above.

Within the discussion below an "audio segment characterisation" refers to an (automatically) extracted audio segment and its associated high and low level feature vectors and any related metadata.

A similarity database (as described below) may comprise a selection of audio segment characterisations.

Turning to Figure 1 , an example of the user experience of searching for audio samples according to an embodiment of the present invention is shown.

In step 10, a seed audio stream is derived from an audio source (e.g. online streaming source or music library). A similarity search is triggered (step 20) from the seed query into the similarity database of pre-processed audio segment characterisations. Similarity results are returned (step 30) by the audio segment characterisation database by order of relevance to the seed query and aligned according to the resulting clip starting point. Results afford sampling of matching audio segments in real time with fast audio loops. Any resulting audio segment can serve as the starting point for a new query (step 40).

It is noted that in step 10 above the seed query takes the form of a segment of audio that has been processed to extract feature vectors. An audio segment may be identified either by real-time analysis of a limited streaming window or by pre-processing a longer stream of audio with an automatic segmentation algorithm.

It is further noted that in step 40 if a resulting audio segment characterisation is used as the starting point for a new query then the subsequent search will occur more quickly than the initial search because the seed query in the new search is using feature vector information retrieved from the database as part of the preceding search rather than having to process a new audio stream to derive such data.

Figure 2 shows the process of searching in more detail. An initial starting query comes, in step 50, from an audio stream (e.g. online streaming source or music library). It is noted that the audio stream is most likely an audio sample rather than a whole track. However, in some embodiments an entire track may be used. If the audio stream is from an outside source (step 60) then the stream is segmented and similarity features are extracted in real time (step 70) to form the seed query (step 75). The similarity features extracted in step 70 may take the form of one or more audio feature vectors. The seed query is then sent to the similarity database for matching with pre-processed audio segments (step 20) and results are returned in step 30.

If the audio stream is from an inside source (Step 80, i.e. the seed query has come from an earlier search as per step 40 above), then the corresponding audio feature vectors relating to the audio stream may be retrieved from an audio segment/audio segment characterisation database and matched to other audio segments from the database. Similarity measures may be performed and results sent back to the front end interface, aligned along the starting point of each matching audio segment.

Figure 3 relates to the process of creating a database of audio segment characterisations, that is to say a database of pre-processed feature data relating to audio segments that have been analysed and characterised for use in the search processes of Figures 1 and 2 above.

In step 85, an audio stream is used as a starting point for building the audio segment

characterisation database. The audio stream is segmented, in step 90, into audio segments using an automatic segmentation algorithm, indicating each segment start and end point. Low level and high level similarity features are extracted, in step 100, from each segment and stored in the audio segment characterisation database (120) as an audio segment characterisation (the audio segment plus associated feature vectors). The database is indexed, in step 1 10, according to each feature type for each audio segment characterisation.

It is noted that in the following description reference is made to the audio segment

characterisations comprising a sound file that is stored in the same database along with associated feature vectors. It is noted however that the audio segment characterisation does not necessarily require the sound file to be stored in the same database.

As an alternative the start and end points of the audio segment may be stored as part of the audio segment characterisation and the actual sound file (that the audible, playable component is within) may be located in a different location (e.g. the audio segment characterisation may be contained within the audio segment characterisation database and the original sound file may be stored within a third party's music library). In the following description therefore the term "audio segment characterisation" should be taken to encompass:

(i) A similarity database where the audio segment characterisation comprises the audio

segment and associated feature vectors, and;

(ii) A similarity database where the audio segment characterisation comprises details of the start and end points of a segment within an audio track (the start/end points defining the audio segment) which is stored in a different location to the audio segment characterisation database and the feature vectors associated with the audio segment.

The processes described above in relation to Figures 1 to 3 operate to match an audio seed query with a similar audio segment stored as an audio segment characterisation in a database (the audio segment characterisation comprising the audio segment, or reference to it, and associated feature vectors). Figure 4 illustrates the process of audio segmentation. It is noted that the input audio stream (50, 85) may relate to a sampled audio stream (50) to be searched and matched against a database of pre-processed audio segment characterisations. However, Figure 4 would more commonly relate to the process of building the database of pre-processed audio segment characterisations and consequently the audio stream (85) in the description of Figure 4 relates to a new audio stream that is to be analysed and included in the audio segment characterisations database 120.

The process of Figure 4 aims to identify meaningful audio segments within the input audio stream by segmenting the input audio according to the most consistent groupings of musical features, along lines of significant transitions, including, but not exclusively, spectral distribution, tonal sequences, rhythm analysis, genre and mood.

The segmentation process comprises a feature extraction step 130 followed by a two stage segmentation determination process.

In Step 130 the input audio stream (85) is analysed to extract features within the audio input. For example, the mood of the audio may be analysed. A rhythm analysis may also be performed. It is noted that the extraction step 130 may draw upon some or all of the low level and high level analysis techniques described in relation to Figure 5 and 6 below. Once the audio input signal has passed through the feature extraction step 130 the input audio signal may be associated with an initial signal description according to the features that have been extracted. In Step 140 a first pass through the audio input signal is made to identify candidate segmentation points. The signal description may be analysed throughout the time period (the length of the input audio signal) at a first level of temporal granularity of the input sample to identify changes in the signal description. For example, changes in mood, beat, harmony and rhythm descriptors

(amongst other features) may be determined. Step 140 therefore represents a fast statistical analysis that shows time points with significant changes in the features indicating novelty and potential segment candidates.

Novelty curves may be computed in order to detect potential segment candidates. Novelty curves are sequences of novelty estimations in time and may be computed using different techniques. A moving window centred on the current analysed time may be used to compute a novelty estimation. Each half of the window can train a binary classifier (such as, but not limited to, Support Vector Machines). The novelty estimation, at this particular timestamp, relies on the classifier cross-validation value (how well the classifier can separate those two halves). At each timestamp, this process allows a novelty estimation to be computed. The combined novelty estimations create a novelty curve. Multiple novelty curves may be analysed to combine different feature types. For instance, a mood novelty curve (based on mood features) may be used to detect changes in mood, combined with a rhythm novelty curve (based on rhythm features) to detect changes in rhythm. Depending on the context, the segmentation process can be tuned to work on one or several aspects. Segment candidates can be identified and aggregated from both independent and combined novelty curves . A peak detection algorithm may be performed on the novelty curve to identify novelty peaks and consequently detect segment candidates.

The output of Step 140 is a series of initial audio segment candidates. These are filtered in Step 150 to remove unlikely candidates. Analysis in step 150 is performed on an heuristics based analysis that identifies segment candidates based on a set of pre-determined rules that relate to audio processing in the context of musical tracks. For example, if the input audio sample has been identified as a "live" recording then the heuristics rules of step 150 may be designed to ignore segments containing clapping at the end of the track. Further rules may define the length of the introduction to the musical track (in other words the rules may prevent introductions from being too long relative to the length of the track) and may also define a minimum length of segment (e.g. no segments to be less than 3 seconds in length).

In Step 160 a second pass through the audio input signal is made in order to fine tune the initial segmentation analysis in step 140. This second pass comprises a granular statistical analysis that fixes a start and end points with precision. The analysis in this step 160 may be at a second, finer level of time grain in order to fine tune the start and end points of the segment. Steps 140 and 160 may therefore be seen as analogous to the process of searching through a video clip for the start of a scene. An initial scan through the video at high speed may be made to identify the rough location for the start of the scene before being followed by a slower speed scan to identify the actual start point.

In step 180, the segments identified in step 170 may be cut according to the granular analysis, and individually labelled.

As part of the process of building a database of processed audio data, the audio segments (80) will form part of the audio segment characterisations in the database. The audio segment characterisations will also be associated with low level and high level feature vectors as described in Figures 5 to 7 below.

In Figure 5 an objective analysis of an input audio stream (particularly an audio segment identified in accordance with the method described in Figure 4) is made in order to extract objective audio features from the audio stream. Such objective features are hereinafter referred to as "Low level similarity features".

In step 190 an audio stream is provided. It is noted that although the input audio at step 190 could comprise the whole of a track (i.e. the input audio could be the same as step 85) it is preferred if the audio segment input at step 190 comprises a segment that has been identified in the process of Figure 4.

In step 200 the input audio segment is pre-processed. Pre-processing may include sample-rate conversion or audio normalisation, to adjust audio streams to a consistent amplitude and time representation.

In step 210 a windowing process is undertaken in which the audio segment is subdivided further into audio frames which are used as a unit of analysis).

Each frame 220 (sub-divided segment of audio) is described according to select parameters. For example:

E.g. temporal descriptors 230 measure levels of noise or loudness with RMS (root- mean square). spectral descriptors 240 require a Fourier transform to switch measurement from the temporal domain to the audio frequency domain. Spectral features map a change of bias, peaks or frequency.

tonal descriptors 250 analyse the chromagram of included chromas and extract tonal descriptors according to their distributions. A chromagram is the map of the frequencies of the tonal scale.

exact rhythm descriptors 260 are drawn from spectral 240 and chroma 250 frequency representations.

A statistical summary (in Step 270) of the descriptors is conducted by analysing spectral, tonal, temporal and rhythm values and indicating the mean, variance and derivatives, and storing (Step 280) the values as low level feature vectors that relate to the input audio segment.

In Figure 6 a subjective analysis of the low level feature vectors identified in Steps 270 and 280 above is made in order to extract subjective audio features related to the audio segment input in Step 190. Such subjective features are hereinafter referred to as "High level similarity features".

The process of high level analysis is aided by previously trained classifiers with supervised learning algorithms for detection of high level concepts (e.g. mood, style), and application of the classifier model on the low level feature vectors. This process requires some pre-processing of feature vectors (e.g. scaling of feature values for comparison). The high level feature vectors are constructed from probability predictions derived from each classifier model.

It is noted that any suitable supervised learning technique may be used in accordance with the process shown in Figure 6 in order to analyse the audio stream. Support Vector Machine processes are currently preferred as the mechanism for modelling such high level features, but it is noted that other suitable processes may be used, e.g. artificial neural networks, decision trees, logistic regression etc.

The process of Figure 6 comprises two parts - an offline process 300 in which a supervised learning algorithm is used to build a classifier model and an online process 302 in which the classifier model is applied to the features that are input for analysis.

The offline process 300 comprises training the model with "ground truth" data in which examples of music in a plurality of categories is made [the model may be trained using low level feature vectors extracted from music in accordance with the process of Figure 5]. Crowd-sourced data using semantic tagging (e.g. "happy", "sad" music) may also be incorporated at this point for training the model with the supervised learning algorithm. Once the classifier model is built, any audio segment may be input into the model in order to determine subjective features regarding the music.

In the online process 302 the output of the objective analysis of Figure 5 is input (step 304) for analysis by the classifier model. A pre-processing step 306 normalises the input data and selects certain low level features for analysis. In step 308 the classifier model analyses the low level features and outputs a number of predictions, e.g. the likely style of the music (represented as a probability), the likely mood of the music (again represented as a probability) and the likely instruments that appear in the music segment (again represented as a probability).

In step 310 a series of high level feature vectors are output.

Figure 7 illustrates how the low level (280, 404) and high level feature vectors (310, 406) from the processes of Figures 5 and 6 may be combined into a single feature vector that describes the audio segment that was originally input into the process of Figure 5 (the audio segment in turn being the segment identified by the process of Figure 4).

As shown in Figure 7, the methodology comprises extracting low-level features and generating low level feature vectors. These allow extraction of high level features and generating high level feature vectors. The feature vectors are then concatenated (step 320) and final values are presented.

Final values are stored into a feature vector containing all the descriptors. Feature vectors can be stored in different formats including binary or text files. Below is an example of a simplified text file representing a single feature vector. It contains the low-level and high-level feature values of an audio segment. As noted above, the combination of an audio component (audio segment), or a reference to it, and related feature vectors is stored as an audio segment characterisation (410).

Simplified example of feature vectors (the full version may contain hundreds of descriptors) spectral_centroid_mean: 1175.602417

spectral_centroid_variance:220702.578125

spectral_complexity_mean:23.643412 spectral_complexity_variance:4.345713 onset_rate:14.6

bpm:123

mode:1

mfcc1_mean: 189.32753

mfcc1_variance: 10683.18457 mfcc2_mean:-77.99054

mfcc2_variance: 17217.40625 mfcc3_mean:752.402161

mfcc3_variance:27912.228516 mfcc4_mean:-17.700356

mfcc4_variance: 10118.831055 mfcc5_mean:292.118134

mfcc5_variance: 10783.699219 mfcc6_mean:32.26096

mfcc6_variance:11426.730469 mfcc7_mean: 103.479568

mfcc7_variance: 5017.602051 mfcc8_mean:-1.559687

mfcc8_variance:5904.831055 mfcc9_mean:61.992264

mf cc9_vari ance: 4868.916016 mfcc10_mean:-9.912004

mfcd 0_variance:4470.294434 mfcd 1_mean:-101.235558

mfcd 1_variance:6993.414551 mfcc12_mean:-68.505165

mfcc12_variance:3775.445801 mood_angry:0.72

mood_relaxed:0.21

mood_happy:0.5

mood_sad:0.35

instrument_guitar.0.92

instrument_drums:0.84

instrument_voice:0.75

instrument_sax:0.01

genre_rock:0.85 genre_metal:0.99

genre_hiphop:0.33

genre_classical:0.01

In the example feature vector above the low level feature vectors comprise all the entries up to, but not including, the "mood", "instrument" and "genre" entries. The high level feature vectors comprise the "mood", "instrument" and "genre" entries.

Similarity comparisons

Similarity is measured between feature vectors A and feature vectors B using a similarity algorithm. The similarity algorithm is based on measuring A and B vector values within each feature group (e.g. mood, style, rhythm or tone) and weighting each similarity measure according to its relevance to the application. The weighting allows for the algorithm to be customised and adaptable to a variety of user scenarios. The fusion of values generates an overall similarity measure.

Figure 8 illustrates the process of comparing the feature vectors (feature vector A, 330) of a sampled audio segment (e.g. a seed audio stream 50, 400) with the feature vectors (feature vector B, 332) of a pre-processed audio segment in an audio segment characterisations database (120). It is noted that the seed segment (feature vector A) would be compared in a pair-wise manner with all the available segments within the database. It is noted however that multiple such comparisons could be conducted at once in order to speed the process. A similarity algorithm 334 then outputs a similarity measure 336.

The overall comparison is shown on the left hand side of Figure 8. The process is shown in more detail on the right hand side of Figure 8 where the similarity computation, 334a, 334b, is shown being performed against different comparison groups 330a, 330b, 332a, 332b (e.g. tempo or spectrum could be compared from the low level feature vectors). The results of the two similarity measures 334a and 334b are fused together in step 338 to provide the similarity measure output 338.

In one preferred embodiment three feature groups are considered and compared: mood (high level feature), tone (a low level feature) and style (high level). The weighting of these groups may be done automatically, by providing a set of constraints and using a parameter optimisation algorithm (such as, but not limited to, grid search), each parameter being the weight of each feature group. Constraints are a set of rules defining positive and negative results to evaluate the search algorithm results.

Examples of constraints

Optimise parameters to have :

- audio segment characterisations from the same song detected as similar

- audio segment characterisations by the same artist detected as similar

- audio segment characterisations from the same genre detected as similar

- audio segment characterisations matching other audio segment characterisations, using preselected similar audio segment characterisations

- audio segment characterisations from one known genre not matching with other audio segment characterisations from other pre-defined genres

Further details on the optimisation of the search engine include:

Search engine optimisation

Search engine optimisation of comparisons between sets of similarity measures, based on weighting of feature groupings,

w = weight value

similarity measure = (measure 1 * w1) + (measure 2 * w2) + (measure 3 * w3)...

measurei 0-1

sum (wi) = 1 e.g.

similarity measure = (mood measure * 0.5) + (style measure * 0.2) + (bpm measure * 0.2) + (mfcc measure * 0.1)

Real-time analysis efficiency

Real time analysis is aided by progress indicators at the front end and optimisation at the back end. - fast-forwarding the analysis of audio to the point where the results are still intelligible; changing the parameters of the windowing process (a window is small segment of audio data used as a unit of analysis)

- the ability of identifying a meaningful segment of audio provides more focused analysis and maximum efficiency with low means

- tuning the classifiers efficiency: optimising the training dataset to the bare essentials for achieving high accuracy measures

Figure 9 shows the components of an input audio signal 400. As described above the input signal 400 is segmented into a number of audio segments 402 from which low level feature data 404 is extracted. High level feature data 406 is then derived from the extracted low level data.

The processed audio signal may be used to populate a data store/database 408. As shown in Figure 9 the feature data 406, 408 may be stored as part of an audio segment characterisation entry 410. The audio segment characterisation 410 may further comprise a segment identifier 412 to identify the audio file from which the audio segment derives and other meta data 414.

The other meta data 414 may comprise start/end times to identify where the audio segment is located within the complete audio file (that contains the audio segment)

In one embodiment the audio file which contains the audio segment in question may be stored in a second data store (not shown) in which case the other meta data 414 may also comprise a hyperlink or other suitable link to the audio file that contains the audio segment in question.

It will be understood that the embodiments described above are given by way of example only and are not intended to limit the invention. It will also be understood that the embodiments described may be used individually or in combination. For the sake of clarity the text in figures 1 to 3 is reproduced below:

Figure 1

Box 10 - seed audio stream - candidate identified from outside source or available (audio segment characterisations)

Box 20 - similarity search - search into audio segment characterisation database triggered by the seed query

Box 30 - similarity results - results returned by order of relevance and aligned by result starting point

Box 40 - audio samples -available for instant audio sampling from any starting point Figure 2

Box 50 - audio stream

Box 60 - outside source - online database or private library

Box 70 - similarity features - extract audio features from audio segments

Box 75 - seed query - extracted data sent to database

Box 80 - inside source - from (audio segment characterisations) database

Box 20 - perform similarity measures on seed query within database (as per Figure 1)

Box 30 - return closest matched audio segment characterisations in order of relevance (as per

Figure 1)

Figure 3

Box 85 - audio stream

Box 90 - audio segmentation - audio content divided into audio segments

Box 100 - similarity features - extraction of high level and low level features (for each audio segment)

Box 1 10 - database indexing

Box 120 -database

Previous Patent: CURABLE LIQUID COMPOSITIONS

Next Patent: FILM