AUDIO SIGNAL ANALYSIS

Title:

AUDIO SIGNAL ANALYSIS

Document Type and Number:

WIPO Patent Application WO/2015/114216

Kind Code:

Abstract:

A technique for audio processing is provided. According to an example embodiment, the technique comprises obtain-ingone or more sets of features descriptive of characteristics of a segment of audio signal representing a piece of music and deriving a club score that is indicative of at least beat strength associated with said segment of audio signal on basis of said features. According to another example embodiment, the technique comprises obtaining one or more audio attributes characterizing a segment of audio signal representing a piece of music, the audio attributes possibly including the club score, and selecting a switching pattern from a plurality of predetermined switching patterns based at least in part on said one or more audio attributes, wherein a switching pattern is arranged to indicate temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal.

Inventors:

ERONEN ANTTI (FI)
CURCIO IGOR (FI)
OJANPERÄ JUHA (FI)
ROININEN MIKKO (FI)

Application Number:

PCT/FI2015/050059

Publication Date:

August 06, 2015

Filing Date:

January 30, 2015

Export Citation:

Click for automatic bibliography generation Help

Assignee:

NOKIA CORP (FI)

International Classes:

H04N5/262

Attorney, Agent or Firm:

NOKIA TECHNOLOGIES OY et al. (IPR DepartmentKarakaari 7, Espoo, FI)

Download PDF:

View/Download PDF PDF Help

Claims:

An apparatus comprising at least one processor and at least one memory including computer program code for one or more programs, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain one or more sets of features descriptive of characteristics of a segment of audio signal representing a piece of music, said one or more sets comprising at least a first set of features comprising one or more beat tracking, BT, features descriptive of periodicity of said segment of audio signal, a second set of features comprising one or more fluctuation pattern, FP, features descriptive of modulation energies at a set of modulation frequencies across a set of predetermined frequency bands in said segment of audio signal, a third set of features comprising one or more detrended fluctuation, DF, features descriptive of correlations across different time scales in said segment of audio signal, and a fourth set of features comprising one or more energy features descriptive of the signal energy within said segment of audio signal, and derive a club score on basis of the features in the first, second, third and fourth sets of features, which club score is indicative of at least beat strength associated with said segment of audio signal.

An apparatus according to claim 1 , wherein the apparatus caused to derive the club score is further caused to derive the club score by applying a classification or regression model on the features of the first, second, third and fourth sets of features.

An apparatus according to claim 2, wherein the apparatus caused to derive the club score is further caused to derive the club score a linear combination of the features of the first, second, third and fourth sets of features, each feature weighted by a respective predetermined weighting factor. An apparatus according to claim 3, wherein the apparatus caused to derive the club score is further caused to normalize the features of the first, second, third and fourth sets features by respective predetermined normalization parameters prior to derivation of said linear combination.

An apparatus according to any of claims 1 to 4, wherein the apparatus caused to derive the club score is further caused to normalize the derived club score by multiplying the derived club score by a predetermined normalizing factor.

An apparatus according to any of claims 1 to 5, wherein said BT features comprise one or more features extracted on basis of sequence of periodicity vectors, each periodicity vector comprising a plurality of periodicity values descriptive of the strength of periodicity over a range of period lengths for a respective sub-segment of said segment of audio signal, wherein the periodicity vectors are computed on basis of a first accent signal descriptive of the harmonic and pitch information in said segment of audio signal, and one or more features extracted on basis of a second accent signal descriptive of the lowest frequency band in said segment of audio signal.

An apparatus according to claim 6, wherein said BT features comprise one or more of the following:

- a BT feature indicative of the average of said second accent signal,

- a BT feature indicative of standard deviation of said second accent signal,

- a BT feature indicative of the maximum periodicity value in the mean or median of said periodicity vectors,

- a BT feature indicative of the sum of the periodicity values in the mean or median of said periodicity vectors,

- a BT feature indicative whether the tempo identified in said segment of audio signal is constant or non-constant.

An apparatus according to any of claims 1 to 7, wherein said modulation energies are derived as a Fast Fourier Transform applied to frequency bands of a mel-frequency cepstral representation of said segment of audio signal rearranged into said set of predetermined frequency bands.

9. An apparatus according to claim 8, wherein said FP features comprise one or more of the following:

- an FP bass feature, descriptive of the combined modulation energy in a predetermined number of the lowest frequency bands of said set of frequency bands,

- an FP gravity feature, indicative of the modulation frequency derived as the sum of weighted sums of modulation frequencies divided by the sum of modulation energies across the modulation frequencies and the frequency bands, each modulation frequency weighted by the sum of modulation energies across frequency bands at respective modulation frequency,

- an FP focus feature, descriptive of the modulation energy distribution across the modulation frequencies and the frequency bands.

- an FP maximum feature, indicative of the maximum modulation energy across the modulation frequencies and the frequency bands,

- an FP sum feature, indicative of the sum of modulation energies across the modulation frequencies and the frequency bands,

- an FP aggressiveness feature, indicative of the sum of modulation energies at a first predetermined number of highest frequency band in a second predetermined number of lowest modulation frequencies, divided by the maximum modulation energy across the modulation frequencies and the frequency bands.

- an FP low-frequency domination indicator, indicative of the ratio between the sum of modulation energies in a first predetermined number of highest frequency bands and the sum of modulation energies in a second predetermined number of lowest frequency bands.

10. An apparatus according to any of claims 1 to 9, wherein said correlations across different time scales are provided as one or more DF exponent values derived on basis of the energy of the mel-frequency cepstral representation of said segment of audio signal.

1 1 . An apparatus according to claim 10, wherein said DF features comprise at least the average of the DF exponent values.

12. An apparatus according to any of claims 1 to 1 1 , wherein said energy features comprise one or more of the following

- the energy of said segment of audio signal,

- the energies of sub-segments of said segment of audio signal,

- the energies at frequency bands of said segment of audio signal,

- the energies at frequency bands of sub-segments of said audio signal,

13. An apparatus according to any of claims 1 to 12, wherein said energy features are derived on basis of a mel-frequency cepstral representation of said segment of audio signal.

14. An apparatus according to any of claims 1 to 13, further caused to select a switching pattern from a plurality of predetermined switching patterns based at least in part on the derived club score, wherein a switching pattern is arranged to indicate temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal.

15. An apparatus according to claim 14, wherein a discontinuity in visual content comprises one of the following:

- a switch from one still image to another still image,

- a switch from a still image to a video source,

- a switch from a video source to a still image,

- a switch from one video source to another video source,

- insertion of a visual effect causing a temporary modification of the visual content without switching from one image or video source to another.

16. An apparatus according claim 14 or 15, wherein the apparatus caused to select the switching pattern is further caused to select a switching pattern arranged to indicate more frequent discontinuities in visual content with increasing value of the club score.

17. An apparatus according to claim 14 or 15, wherein the apparatus caused to select the switching pattern is further caused to select a switching pattern arranged to indicate a high frequency of discontinuities in visual content in response to the club score exceeding a predetermined threshold value, and select a switching pattern arranged to indicate a low frequency of discontinuities in visual content in response to the club score failing to exceed the predetermined threshold.

18. An apparatus according claim 14 or 15, wherein the apparatus caused to select the switching pattern is further caused to select a switching pattern arranged to indicate more frequent discontinuities in visual content at a probability that increases with increasing value of the club score.

19. An apparatus according to claim 18, wherein the apparatus caused to select the switching pattern is further caused to apply a Markov chain model having a plurality of states, each state associated with one of the plurality of switching patterns, and set, in response to the club score exceeding a predetermined threshold value, one or more transition probabilities towards a state associated with a switching pattern arranged to indicate a high frequency of discontinuities in visual content to high values and set one or more transition probabilities towards a state associated with a switching pattern arranged to indicate a low frequency of discontinuities in visual content to low values.

20. An apparatus according to claim 18, wherein the apparatus caused to select the switching pattern is further caused to define a plurality of Markov chain models, each having a plurality of states with each state associated with one of the plurality of switching patterns, select the Markov chain model to be applied in accordance with the club score by selecting, in response to a club score exceeding a predetermined threshold value, a Markov chain model that involves high transition probabilities towards a state associated with a switching pattern arranged to indicate a high frequency of discontinuities in visual content and low transition probabilities towards a state associated with a switching pattern arranged to indicate a low frequency of discontinuities in visual content, and selecting, in response to a club score failing to exceed the predetermined threshold value, a Markov chain model that involves low transition probabilities towards said state associated with the switching pattern arranged to indicate the high frequency of discontinuities in visual content and high transition probabilities towards said state associated with the switching pattern arranged to indicate the low frequency of discontinuities in visual content.

21 . An apparatus comprising at least one processor and at least one memory including computer program code for one or more programs, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to obtain one or more audio attributes characterizing a segment of audio signal representing a piece of music, and select a switching pattern from a plurality of predetermined switching patterns based at least in part on said one or more audio attributes, wherein a switching pattern is arranged to indicate temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal.

22. An apparatus according to claim 21 , wherein a discontinuity in visual content comprises one of the following:

- a switch from one still image to another still image,

- a switch from a still image to a video source,

- a switch from a video source to a still image,

- a switch from one video source to another video source, - insertion of a visual effect causing a temporary modification of the visual content without switching from one image or video source to another.

23. An apparatus according to claim 21 or 22, wherein the apparatus caused to select the switching pattern is further caused to classify said segment of audio signal into one of a plurality of predetermined classes on basis of the value of said audio attribute, select a switching pattern assigned to the class to which said segment of audio signal is classified into.

24. An apparatus according to claim 21 or 22, wherein the apparatus caused to select the switching pattern is further caused to classify said segment of audio signal into one of a plurality of predetermined classes on basis of the value of said audio attribute, select a switching pattern according to a predefined selection rule assigned to the class to which said segment of audio signal is classified into.

25. An apparatus according to any of claims 21 to 24, wherein said one or more audio attributes comprise one or more of the following a club score determined for said segment of audio signal, the musical genre of audio signal belongs to, the mood of said segment of audio signal, the mode of said segment of audio signal, the key of said segment of audio signal, the energy of said segment of audio signal, the harmony of said segment of audio signal.

26. An apparatus according to claim 23 or 24, wherein the apparatus caused to select the switching patter is further caused to classify said segment of audio signal on basis of the value of a first audio attribute, and select the switching pattern on basis of the value of a second audio attribute, which second audio attribute is different from the first audio attribute.

27. An apparatus according to claim 26, wherein the first audio attribute comprises one of the following: the musical genre of audio signal belongs to, the mood of said segment of audio signal, the mode of said segment of audio signal, the key of said segment of audio signal, and wherein the second audio attribute comprises one of the following: a club score determined for said segment of audio signal, the energy of said segment of audio signal.

28. An apparatus comprising means for obtaining one or more sets of features descriptive of characteristics of a segment of audio signal representing a piece of music, said one or more sets comprising at least a first set of features comprising one or more beat tracking, BT, features descriptive of periodicity of said segment of audio signal, a second set of features comprising one or more fluctuation pattern, FP, features descriptive of modulation energies at a set of modulation frequencies across a set of predetermined frequency bands in said segment of audio signal, a third set of features comprising one or more detrended fluctuation, DF, features descriptive of correlations across different time scales in said segment of audio signal, and a fourth set of features comprising one or more energy features descriptive of the signal energy within said segment of audio signal, and means for deriving a club score on basis of the features in the first, second, third and fourth sets of features, which club score is indicative of at least beat strength associated with said segment of audio signal.

29. An apparatus according claim 28, further comprising means for selecting a switching pattern from a plurality of predetermined switching patterns based at least in part on the derived club score, wherein a switching pattern indicates temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal.

30. An apparatus comprising means for obtaining one or more audio attributes characterizing a segment of audio signal representing a piece of music, and means for selecting a switching pattern from a plurality of predetermined switching patterns based at least in part on said one or more audio attributes, wherein a switching pattern indicates temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal.

31 . A method comprising obtaining one or more sets of features descriptive of characteristics of a segment of audio signal representing a piece of music, said one or more sets comprising at least a first set of features comprising one or more beat tracking, BT, features descriptive of periodicity of said segment of audio signal, a second set of features comprising one or more fluctuation pattern, FP, features descriptive of modulation energies at a set of modulation frequencies across a set of predetermined frequency bands in said segment of audio signal, a third set of features comprising one or more detrended fluctuation, DF, features descriptive of correlations across different time scales in said segment of audio signal, and a fourth set of features comprising one or more energy features descriptive of the signal energy within said segment of audio signal, and deriving a club score on basis of the features in the first, second, third and fourth sets of features, which club score is indicative of at least beat strength associated with said segment of audio signal.

32. A method according to claim 31 , wherein deriving the club score comprises deriving the club score by applying a classification or regression model on the features of the first, second, third and fourth sets of features.

33. A method according to claim 32, wherein deriving the club score comprises deriving the club score a linear combination of the features of the first, second, third and fourth sets of features, each feature weighted by a respective predetermined weighting factor.

34. A method according to claim 33, wherein deriving the club score comprises normalizing the features of the first, second, third and fourth sets features by respective predetermined normalization parameters prior to derivation of said linear combination.

35. A method according to any of claims 31 to 34, wherein deriving the club score comprises deriving normalizing the derived club score by multiplying the derived club score by a predetermined normalizing factor.

36. A method according to any of claims 31 to 35, wherein said BT features comprise one or more features extracted on basis of sequence of periodicity vectors, each periodicity vector comprising a plurality of periodicity values descriptive of the strength of periodicity over a range of period lengths for a respective sub-segment of said segment of audio signal, wherein the periodicity vectors are computed on basis of a first accent signal descriptive of the harmonic and pitch information in said segment of audio signal, and one or more features extracted on basis of a second accent signal descriptive of the lowest frequency band in said segment of audio signal.

37. A method according to claim 36, wherein said BT features comprise one or more of the following:

- a BT feature indicative of the average of said second accent signal, - a BT feature indicative of standard deviation of said second accent signal,

- a BT feature indicative of the maximum periodicity value in the mean or median of said periodicity vectors,

- a BT feature indicative of the sum of the periodicity values in the mean or median of said periodicity vectors,

- a BT feature indicative whether the tempo identified in said segment of audio signal is constant or non-constant.

38. A method according to any of claims 31 to 37, wherein said modulation energies are derived as a Fast Fourier Transform applied to frequency bands of a mel-frequency cepstral representation of said segment of audio signal rearranged into said set of predetermined frequency bands.

39. A method according to claim 38, wherein said FP features comprise one or more of the following:

- an FP bass feature, descriptive of the combined modulation energy in a predetermined number of the lowest frequency bands of said set of frequency bands,

- an FP focus feature, descriptive of the modulation energy distribution across the modulation frequencies and the frequency bands.

- an FP maximum feature, indicative of the maximum modulation energy across the modulation frequencies and the frequency bands,

- an FP sum feature, indicative of the sum of modulation energies across the modulation frequencies and the frequency bands,

40. A method according to any of claims 31 to 39, wherein said correlations across different time scales are provided as one or more DF exponent values derived on basis of the energy of the mel-frequency cepstral representation of said segment of audio signal.

41 . A method according to claim 40, wherein said DF features comprise at least the average of the DF exponent values.

42. A method according to any of claims 31 to 41 , wherein said energy features comprise one or more of the following

- the energy of said segment of audio signal,

- the energies of sub-segments of said segment of audio signal,

- the energies at frequency bands of said segment of audio signal,

- the energies at frequency bands of sub-segments of said audio signal,

43. A method according to any of claims 31 to 42, wherein said energy features are derived on basis of a mel-frequency cepstral representation of said segment of audio signal.

44. A method according to any of claims 31 to 43, further comprising selecting a switching pattern from a plurality of predetermined switching patterns based at least in part on the derived club score, wherein a switching pattern is arranged to indicate temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal.

45. A method according to claim 44, wherein a discontinuity in visual content comprises one of the following:

- a switch from one still image to another still image, - a switch from a still image to a video source,

- a switch from a video source to a still image,

- a switch from one video source to another video source,

- insertion of a visual effect causing a temporary modification of the visual content without switching from one image or video source to another.

46. A method according claim 44 or 45, wherein selecting the switching pattern comprises selecting a switching pattern arranged to indicate more frequent discontinuities in visual content with increasing value of the club score.

47. A method according to claim 44 or 45, wherein selecting the switching pattern comprises selecting a switching pattern arranged to indicate a high frequency of discontinuities in visual content in response to the club score exceeding a predetermined threshold value, and selecting a switching pattern arranged to indicate a low frequency of discontinuities in visual content in response to the club score failing to exceed the predetermined threshold.

48. A method according claim 44 or 45, wherein selecting the switching pattern comprises selecting a switching pattern arranged to indicate more frequent discontinuities in visual content at a probability that increases with increasing value of the club score.

49. A method according to claim 48, wherein selecting the switching pattern comprises applying a Markov chain model having a plurality of states, each state associated with one of the plurality of switching patterns, and setting, in response to the club score exceeding a predetermined threshold value, one or more transition probabilities towards a state associated with a switching pattern arranged to indicate a high frequency of discontinuities in visual content to high values and set one or more transition probabilities towards a state associated with a switching pattern arranged to indicate a low frequency of discontinuities in visual content to low values.

50. A method according to claim 48, wherein selecting the switching pattern comprises defining a plurality of Markov chain models, each having a plurality of states with each state associated with one of the plurality of switching patterns, selecting the Markov chain model to be applied in accordance with the club score by selecting, in response to a club score exceeding a predetermined threshold value, a Markov chain model that involves high transition probabilities towards a state associated with a switching pattern arranged to indicate a high frequency of discontinuities in visual content and low transition probabilities towards a state associated with a switching pattern arranged to indicate a low frequency of discontinuities in visual content, and selecting, in response to a club score failing to exceed the predetermined threshold value, a Markov chain model that involves low transition probabilities towards said state associated with the switching pattern arranged to indicate the high frequency of discontinuities in visual content and high transition probabilities towards said state associated with the switching pattern arranged to indicate the low frequency of discontinuities in visual content.

51 . A method comprising obtaining one or more audio attributes characterizing a segment of audio signal representing a piece of music, and selecting a switching pattern from a plurality of predetermined switching patterns based at least in part on said one or more audio attributes, wherein a switching pattern is arranged to indicate temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal.

52. A method according to claim 51 , wherein a discontinuity in visual content comprises one of the following:

- a switch from one still image to another still image, - a switch from a still image to a video source,

- a switch from a video source to a still image,

- a switch from one video source to another video source,

- insertion of a visual effect causing a temporary modification of the visual content without switching from one image or video source to another.

53. A method according to claim 51 or 52, wherein selecting the switching pattern comprises classifying said segment of audio signal into one of a plurality of predetermined classes on basis of the value of said audio attribute, selecting a switching pattern assigned to the class to which said segment of audio signal is classified into.

54. A method according to claim 51 or 52, wherein selecting the switching pattern comprises classifying said segment of audio signal into one of a plurality of predetermined classes on basis of the value of said audio attribute, selecting a switching pattern according to a predefined selection rule assigned to the class to which said segment of audio signal is classified into.

55. A method according to any of claims 51 to 54, wherein said one or more audio attributes comprise one or more of the following a club score determined for said segment of audio signal, the musical genre of audio signal belongs to, the mood of said segment of audio signal, the mode of said segment of audio signal, the key of said segment of audio signal, the energy of said segment of audio signal, the harmony of said segment of audio signal.

56. A method according to claim 53 or 54, wherein selecting the switching pattern comprises classifying said segment of audio signal on basis of the value of a first audio attribute, and selecting the switching pattern on basis of the value of a second audio attribute, which second audio attribute is different from the first audio attribute.

A method according to claim 56, wherein the first audio attribute comprises one of the following: the musical genre of audio signal belongs to, the mood of said segment of audio signal, the mode of said segment of audio signal, the key of said segment of audio signal, and wherein the second audio attribute comprises one of the following: a club score determined for said segment of audio signal, the energy of said segment of audio signal.

A computer program comprising one or more sequences of one or more instructions which, when executed by one or more processors, cause an apparatus to at least perform the method according to any of claims 31 to 57.

A computer program product comprising at least one computer readable non- transitory medium having program code stored thereon, the program code, when executed by an apparatus, causing the apparatus at least to perform the method according to any of claims 31 to 57.

Description:

Audio signal analysis TECHNICAL FIELD

The example and non-limiting embodiments of the present invention relate to audio signal analysis. In particular, at least some example embodiments relate to a method, to an apparatus and/or to a computer program for audio signal analysis for determination of one or more audio signal attributes, e.g. analysis for determining a danceability measure descriptive of at least beat strength the audio signal and/or making use of such audio signal attributes in selection of a switching pattern that defines temporal locations of discontinuities in visual representation to accompany the audio signal.

BACKGROUND

In music terminology, the music meter comprises the recurring pattern of stresses or accents in the music. The musical meter can be described as comprising a measure pulse, a beat pulse and a tatum pulse, respectively referring to the longest to shortest in terms of pulse duration.

Beat pulses provide the basic unit of time in music, and the rate of beat pulses, also referred to as the tempo of the music, is considered the rate at which most people would tap their foot on the floor when listening to a piece of music. Identifying the rate and/or temporal positions of the occurrences of beat pulses in a piece of music, or beat tracking as it is known, is desirable in a number of practical applications. Such applications include music recommendation applications in which music similar to a reference track is searched for, in Disk Jockey (DJ) applications where, for example, seamless beat-mixed transitions between songs in a playlist is required, and in automatic looping techniques. Beat tracking techniques typically generate a beat sequence, comprising indications of the temporal positions of beats in a piece of music or part thereof.

The information derived on basis of beat tracking may be useful in determining a genre or type of a piece of music or part thereof. However, typically the information available from the beat tracking may not be sufficient to identify a piece of music to belong to a certain genre at high enough reliability or accuracy.

The following terms are useful for understanding certain concepts to be described later. Pitch: the physiological correlate of the fundamental frequency (fo) of a note.

Chroma, also known as pitch class: musical pitches separated by an integer number of octaves belong to a common pitch class. In Western music, twelve pitch classes are used.

Beat or tactus: the basic unit of time in music, it can be considered the rate at which most people would tap their foot on the floor when listening to a piece of music and hence it may also referred to as foot tapping rate. The word beat (or one of its equivalents) is also used to denote part of the music belonging to a single beat.

Tempo: the rate of the beat or tactus pulse, usually represented in units of beats per minute (BPM). Bar or measure: a segment of time defined as a given number of beats of given duration. For example, in music with a 4/4 time signature, each measure comprises four beats.

Downbeat: the first beat of a bar or measure.

Accent or Accent-based audio analysis: analysis of an audio signal to detect events and/or changes in music, including but not limited to the beginning of all discrete sound events, especially the onset of long pitched sounds, sudden changes in loudness of timbre, and harmonic changes. Further detail is given below.

It is believed that humans perceive musical meter by inferring a regular pattern of pulses from accents, which are stressed moments in music. Different events in mu- sic cause accents. Examples include changes in loudness or timbre, harmonic changes, and in general the beginnings of all sound events. In particular, the onsets of long pitched sounds cause accents. Techniques for automatic estimation of tempo, beat, and/or downbeat may try to imitate the human perception of music meter to some extent. This may involve the steps of measuring musical accentuation, performing period estimation of one or more pulses, finding the phases of the estimated pulses, and choosing the metrical level corresponding to the tempo or some other metrical level of interest. Since accents relate to events in music, accent based audio analysis refers to the detection of events and/or changes in music. Such changes may relate to changes in the loudness, changes in spectrum and/or changes pitch content of the signal. As an example, accent based analysis may relate to detecting spectral change from the signal, calculating a novelty or an onset detection function from the signal, detecting discrete onsets from the signal, or de- tecting changes in pitch and/or harmonic content of the signal, for example, using chroma features. When performing the spectral change detection, various transforms or filter bank decompositions may be used, such as the Fast Fourier Transform or multi rate filter banks, or even fundamental frequency Fo estimators or pitch salience estimators. As a simple example, accent detection might be performed by calculating the short-time energy of the signal over a set of frequency bands in short frames over the signal, and then calculating the difference, such as the Euclidean distance, between every two adjacent frames. To increase the robustness for various music types, many different accent signal analysis methods have been developed. The technique(s) to be described hereafter draws on background knowledge described in the following publications which are incorporated herein by reference.

[1 ] International patent application no. PCT/IB2012/053329.

[2] International patent application no. PCT/IB2012/052157.

[3] Streich, Herrera, "Detrended Fluctuation Analysis of Music Signals: Danceability Estimation and further Semantic Characterization", Proceedings of the Audio Engineering Society (AES) 1 18 ^th Convention, Barcelona, Spain, May 28 - 31 , 2005.

[4] Elias Pampalk, "Computational Models of Music Similarity and their Application in Music Information Retrieval, Dissertation, Vienna University of Technology, Vienna, Austria, March 2006. [5] Cemgil A . et al., "On tempo tracking: tempogram representation and Kalman filtering." J. New Music Research, 2001 .

[6] Eronen, A., Klapuri, A., "Music Tempo Estimation with k-NN regression," IEEE Trans, on Audio, Speech and Language Processing, Vol. 18, No. 1 , Jan 2010. [7] Seppanen , J., Eronen A., Hiipakka J., "Joint Beat & Tatum Tracking from Music Signals", International Conference on Music Information Retrieval, ISMIR 2006.

[8] Klapuri A., Eronen A., Astola J., "Analysis of the meter of acoustic musical signals," IEEE Trans. Audio, Speech, and Language Processing, Vol. 14, No. 1 , 2006.

[9] D. Ellis, "Beat Tracking by Dynamic Programming", J. New Music Research, Special Issue on Beat and Tempo Extraction, vol. 36 no. 1 , March 2007, pp. 51 -60 (10pp) DOI: 10.1080/09298210701653344.

[10] UK patent application no. 1310861 .8.

[1 1 ] UK patent application no. 1317204.4.

SUMMARY According to an example embodiment, an apparatus is provided, the apparatus comprising at least one processor and at least one memory including computer program code for one or more programs, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to obtain one or more sets of features descriptive of characteristics of a segment of audio signal representing a piece of music, said one or more sets comprising at least a first set of features comprising one or more beat tracking, BT, features descriptive of periodicity of said segment of audio signal, a second set of features comprising one or more fluctuation pattern, FP, features descriptive of modulation energies at a set of modulation frequencies across a set of predetermined frequency bands in said segment of audio signal, a third set of features comprising one or more detrended fluctuation, DF, features descriptive of correlations across different time scales in said segment of audio signal, and a fourth set of features comprising one or more energy features descriptive of the signal energy within said segment of audio signal. The apparatus is further caused to derive a club score on basis of the features in the first, second, third and fourth sets of features, which club score is indicative of at least beat strength associated with said segment of audio signal. The apparatus may be further caused to select a switching pattern from a plurality of predetermined switching patterns based at least in part on the derived club score, wherein a switching pattern is arranged to indicate temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal. According to another example embodiment, an apparatus is provided, the apparatus comprising at least one processor and at least one memory including computer program code for one or more programs, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to obtain one or more audio attributes characterizing a segment of audio signal representing a piece of music and to select a switching pattern from a plurality of predetermined switching patterns based at least in part on said one or more audio attributes, wherein a switching pattern is arranged to indicate temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal.

According to another example embodiment, an apparatus is provided, the apparatus comprising means for obtaining one or more sets of features descriptive of characteristics of a segment of audio signal representing a piece of music, said one or more sets comprising at least a first set of features comprising one or more beat tracking, BT, features descriptive of periodicity of said segment of audio signal, a second set of features comprising one or more fluctuation pattern, FP, features descriptive of modulation energies at a set of modulation frequencies across a set of predetermined frequency bands in said segment of audio signal, a third set of features comprising one or more detrended fluctuation, DF, features descriptive of cor- relations across different time scales in said segment of audio signal, and a fourth set of features comprising one or more energy features descriptive of the signal energy within said segment of audio signal. The apparatus further comprises means for deriving a club score on basis of the features in the first, second, third and fourth sets of features, which club score is indicative of at least beat strength associated with said segment of audio signal. The apparatus may further comprise means for selecting a switching pattern from a plurality of predetermined switching patterns based at least in part on the derived club score, wherein a switching pattern is arranged to indicate temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal. According to another example embodiment, an apparatus is provided, the apparatus comprising means for obtaining one or more audio attributes characterizing a segment of audio signal representing a piece of music, and means for selecting a switching pattern from a plurality of predetermined switching patterns based at least in part on said one or more audio attributes, wherein a switching pattern indicates temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal.

According to another example embodiment, a method is provided, the method comprising obtaining one or more sets of features descriptive of characteristics of a seg- ment of audio signal representing a piece of music, said one or more sets comprising at least a first set of features comprising one or more beat tracking, BT, features descriptive of periodicity of said segment of audio signal, a second set of features comprising one or more fluctuation pattern, FP, features descriptive of modulation energies at a set of modulation frequencies across a set of predetermined frequency bands in said segment of audio signal, a third set of features comprising one or more detrended fluctuation, DF, features descriptive of correlations across different time scales in said segment of audio signal, and a fourth set of features comprising one or more energy features descriptive of the signal energy within said segment of audio signal. The method further comprises deriving a club score on basis of the fea- tures in the first, second, third and fourth sets of features, which club score is indicative of at least beat strength associated with said segment of audio signal. The method may further comprise selecting a switching pattern from a plurality of predetermined switching patterns based at least in part on the derived club score, wherein a switching pattern is arranged to indicate temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal.

According to another example embodiment, a method is provided, the method comprising obtaining one or more audio attributes characterizing a segment of audio signal representing a piece of music, and selecting a switching pattern from a plu- rality of predetermined switching patterns based at least in part on said one or more audio attributes, wherein a switching pattern is arranged to indicate temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal. According to another example embodiment, a computer program is provided, the computer program including one or more sequences of one or more instructions which, when executed by one or more processors, cause an apparatus at least to carry out the method according to an example embodiment described in the foregoing. The computer program referred to above may be embodied on a volatile or a nonvolatile computer-readable record medium, for example as a computer program product comprising at least one computer readable non-transitory medium having program code stored thereon, the program which when executed by an apparatus cause the apparatus at least to perform the operations described hereinbefore for the computer program according to the fifth aspect of the invention.

The exemplifying embodiments of the invention presented in this patent application are not to be interpreted to pose limitations to the applicability of the appended claims. The verb "to comprise" and its derivatives are used in this patent application as an open limitation that does not exclude the existence of also unrecited features. The features described hereinafter are mutually freely combinable unless explicitly stated otherwise.

Some features of the invention are set forth in the appended claims. Aspects of the invention, however, both as to its construction and its method of operation, together with additional objects and advantages thereof, will be best understood from the following description of some example embodiments when read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF FIGURES

The embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.

Figure 1 schematically illustrates some components of an audio analysis arrangement according to an example embodiment.

Figure 2 schematically illustrates basic components of a mel-frequency cepstral coefficient (MFCC) analysis. Figure 3 illustrates an example of framing and windowing within the MFCC analysis.

Figure 4 illustrates an example of FFT-domain signal.

Figure 5 illustrates an example of weights applicable for mel-scaling within the MFCC analysis.

Figure 6 illustrates an example of logarithmic domain mel-frequency coefficients in context of the MFCC analysis.

Figure 7 schematically illustrates some components of a music analysis system serving as an exemplifying framework for application of embodiments of the present invention. Figure 8 schematically illustrates some elements of a framework suitable for application of the exemplifying the music analysis system (e.g. that of Figure 7).

Figure 9 schematically illustrates an exemplifying apparatus according to an example embodiment. Figure 10 illustrates an exemplifying method according to an example embodiment. DESCRIPTION OF SOME EMBODIMENTS

Embodiments described in the following relate to techniques for audio analysis in order to characterize the extent of danceability or club-likeness of a piece of music or part thereof. The concept of danceability or club-likeness, as applied herein, is related to the concept of beat strength, which may be loosely defined as a rhythmic characteristic that allows discriminating between the pieces of music (or parts/segments thereof) having the same tempo. Briefly, a piece of music characterized by a higher beat strength can be assumed to exhibit perceptually stronger and more pronounced beats than another piece of music characterized by a lower beat strength. An indicator descriptive of the extent of danceability or club-likeness is herein referred to as a club score. As in case of the beat strength, a piece of music characterized by a high club score exhibits perceptually stronger and more pronounced beats in comparison to another piece of music characterized by a low club score. Moreover, a piece of music characterized by a high club score exhibits a relatively high and relatively constant tempo in comparison to the tempo of another piece of music characterized by a low club score. In other words, in more colloquial terms, the club score reflects the characteristics that may be perceived by a human observer as danceability of the piece of music.

The club score, or another measure of danceability or club-likeness, may be applied e.g. as a piece of information to be presented by user as an indicator that characterizes a piece of music, e.g. a song available for listening and/or purchasing. As another example, the club score may be provided as input for an automated tool for mixing music e.g. to enable identifying pieces of music exhibiting desired (e.g. high enough) extent of danceability or club-likeness. As a further example, the club score may be provided as input for an automated tool for generating a visual presentation to accompany the respective piece of music, where the club score is applied as a control parameter that at least partly affects the choice of a change/switch pattern that defines the temporal positions and/or frequency of changes of video source or image in the visual presentation. More detailed examples regarding application of the club score will be described hereinafter.

Figure 1 schematically illustrates some components of an exemplifying audio analysis arrangement 100 that may be applied in determination of the club score. In the audio analysis arrangement 100 an input audio signal x n) is provided to a beat tracker 1 10 for periodicity analysis. The input audio signal x n) represents the piece of music to be analyzed by the audio analysis arrangement 100. The input audio signal x n) is preferably provided in an uncompressed format, e.g. as Pulse Code Modulation (PCM) signal in 16-bit sample resolution at sampling rate 44.1 kHz. In case the input audio signal x n) is provded in a compressed format, a respective decoder is applied to convert the input audio signal x n) into an uncompressed format. In case the sample resolution and/or the sampling rate of the input audio signal x n) is different from the one suitable for input to the beat tracker 1 10 and/or other components of the audio analysis arrangement 100, a conversion to the appropriate sample resolution and/or sampling rate is applied before providing the input audio signal x n) for further processing..

The beat tracker 1 10 is configured to carry out a beat tracking (BT) analysis to extract a set of BT features on basis of the input audio signal x(n)As an example, the BT analysis may involve deriving one or more accent signals on basis of the input audio signal x n) for detection of events and/or changes in a piece of music repre- sented by the input audio signal x(n).The BT analysis may further comprise a tempo (or BPM) estimation for the piece of music represented by the input audio signal x n). The tempo estimation comprises a periodicity analysis for extraction of a sequence of periodicity vectors on basis of the accent signal(s) for use in the tempo estimation. The BT analysis is typically carried out on frame by frame basis. The frame duration may be fixed or the time frame duration may vary from frame to frame. Typically, the time frame duration is e.g. in the range from 10 seconds to one minute, e.g. 30 seconds.

As an example, the BT analysis may be carried out according to the beat tracking technique described in detail in [1 ]. As a concise overview, this technique comprises generating three beat time sequences from input audio signal x n), specifically from accent signals derived from input audio signal x n). A selection stage then identifies which of the three beat time sequences is a best match or fit to one of the accent signals, this sequence being considered the most useful and accurate representation of the beats in input audio signal x n). The beat tracking technique of [1 ] comprises calculating a first accent signal a based on fundamental frequency F ₀ salience estimation. This accent signal ¾ , which is a chroma accent signal, is extracted as described in [6]. The chroma accent signal % represents musical change as a function of time and, because it is extracted based on the F ₀ information, it emphasizes harmonic and pitch information in the signal. Note that, instead of calculating a chroma accent signal based on F ₀ salience estimation, alternative accent signal representations and calculation methods could be used. For example, the accent signals described in [8] or [9] could be utilized.

The beat tracking technique of [1 ] further comprises calculating a second accent signal a ₂ using the accent signal analysis method described in [7]. The second accent signal a ₂ is based on a computationally efficient multi-rate filter bank decomposition of the input audio signal x n). Compared to the F ₀ salience based first accent signal ¾, the second accent signal a ₂ is generated in such a way that it relates more to the percussive and/or low frequency content in the input audio signal x n) and does not emphasize harmonic information. Specifically, the accent signal representing the lowest frequency band of the multi-rate decomposed signal may be selected as the second accent signal, as described in [7] so that the second accent signal a ₂ emphasizes bass drum hits and other low frequency events. The typical upper limit of this sub-band is 187.5 Hz or 200 Hz. This is performed as a result of the understanding that electronic dance music is often characterized by a stable beat produced by the bass drum.

Further details regarding calculation of the first accent signal % and the second accent signal a ₂ are found in [1 ]. The beat tracking technique of [1 ] further comprises a tempo estimation that involves computing a sequence of periodicity vectors on basis of the first accent signal where each periodicity vector represents a time frame of the input audio signal x(n) . Each periodicity vector comprises a plurality of periodicity values, each periodicity value describing the strength of periodicity at for a respective period length (i.e. lag). The lags considered extend over a range of interest, covering e.g. lags from 0.02 to 5 seconds at desired intervals. To obtain a single representative tempo for the piece of music, represented by the input audio signal x(n), a point-wise median or a point- wise mean of the periodicity vectors over time may be calculated. The median periodicity vector may be normalized to remove a possible trend therein. Instead of making use of the periodicity vectors in full, a subrange of the periodicity vector may be selected as the final periodicity vector. The subrange may be taken as the range of bins corresponding to periods from 0.06 to 2.2 s, for example. Furthermore, the final periodicity vector may be normalized by removing the scalar mean and normalizing the scalar standard deviation to unity for each periodicity vector. Consequently, the tempo estimation is then performed based on the (possibly normalized) periodicity vectors by using k-Nearest Neighbour regression. Other tempo estimation methods could be used as well, such as methods based on finding the maximum periodicity value, possibly weighted by the prior distribution of various tempi. Further details regarding determination of the sequence of periodicity vectors and estima- tion of the tempo are found in [1 ].

According to an embodiment, the beat tracker 1 10 is arranged to extract at least one or more of the following beat tracking features (BT features) on basis of the accent signal and the sequence of periodicity values for subsequent use in analysis for determining one or more audio attributes, e.g. the club score.: - The average of an accent signal in a low(est) frequency band, e.g. the average of the accent signal a ₂ representing the lowest frequency band of the multi-rate decomposed input audio signal x n).

- The standard deviation of the low frequency band accent signal, e.g. the standard deviation of the second accent signal a ₂.

- The maximum value of the median or mean of the periodicity vectors. A high value serves as an indication of a strong beat in the input audio signal x n) whereas a low(er) value suggests a less strong beat.

- The sum of the values of the mean or median of the periodicity vectors. A high value typically serves as an indication of a strong beat in the input audio signal x n) whereas a low(er) value suggests a less strong beat.

- A tempo indicator for indicating whether the tempo identified for the input audio signal x n) is considered constant or essentially constant, e.g. such that 1 denotes constant or essentially constant tempo (or beat or BPM) whereas 0 denotes non-constant or ambiguous tempo.

Instead of applying the technique described in [1], the beat tracker 1 10 may be provided as any beat tracker known in the art that is capable of providing at least the features described above. As an example, the beat tracker 1 10 may be provided as the beat tracker described in [5], [6], [8], or [9]. In the case all the features described above were not available in the alternative beat tracker, only the subset of features available might be used.

The input audio signal x n) is further provided to a mel-frequency cepstral coefficient (MFCC) analyzer 120 for computation of the signal energy and MFCCs. The MFCC analyzer 120 is configured to carry out the MFCC analysis to extract the MFCCs - and hence the signal energies E _mel (i) in mel-frequency bands - on basis of the input audio signal x n). The MFCCs are commonly applied in speech and music analysis and details of the MFCC analysis thereof are hence known in the art. In the following, an overview of MFCC analysis suitable in context of the MFCC analyzer 120 is provided in the following with reference to Figure 2, which schemat- ically illustrates basic components of a MFCC analysis. The input audio signal x n) is provided for pre-emphasis processing (201 ) to derive pre-emphasize audio signal x _pre(n). The pre-emphasis processing may involve applying a first-order finite impulse response (FIR) filter having the transfer function 1- 0.98z ^"1. Such a filter serves to flatten the spectrum of input audio signal x n) to ac- count for the fact that natural audio signals tend to have relatively high energy content in low frequencies. This filter may also be considered to model the lower sensitivity of the human ear at low frequencies. The input audio signal x n) may be optionally downsampled to a lower sampling frequency, e.g. from 44.1 kHz to 22.05 kHz, in order to reduce the computational load of the MFCC analyzer 120. The pre-emphasized audio signal x _pre( ) is subjected to framing (202) and windowing (203). The framing involves segmenting the pre-emphasized audio signal x _pre( ) into a sequence of frames of desired temporal length (i.e. desired frame duration). Temporally successive frames may exhibit temporal overlap, for example overlap 25 % of the frame duration. The frame duration may be a pre-selected du- ration for example in the range 20 to 50 milliseconds (ms), e.g. 30 ms. The windowing e.g. using a Hamming window (or another suitable window known in the art) is applied to reduce or avoid framing artifacts at or near frame boundaries. A frame of segmented and windowed audio signal may be denoted as x(t) with the index t indicating the temporal position of the frame. The framing and windowing are illus- trated by an example in Figure 3.

The segmented and windowed frames of the input audio signal x(t) are subjected to Fast Fourier Transform (FFT) (204) in order to derive respective FFT-domain frames X(t). In the FFT-domain the frequency resolution is equal throughout the spectrum. Figure 4 illustrates an example of a FFT-domain signal having 1024 FFT (magnitude) bins. Due to the frequency resolution of the human perception being relatively inaccurate at high frequencies, the FFT-domain frames X(t) typically provide unnecessarily accurate resolution at high frequencies due to human perception, resulting in increased computational complexity with only marginal gain in performance. Therefore, mel-scaling (205) is applied to the FFT-domain frames X(t) to model the non-linear frequency resolution of human perception. In this regard, the mel-scaling may involve subjecting the FFT-domain frames X(t) to a filter bank having equal bandwidth in the mel-frequency scale. As a result, the mel-band magnitudes X _mei(t, 0 representing mel-bands (or mel-channels) / are obtained, each indicating a weighted sum of the FFT bins in the respective mel-band / ^'. For example 40 mel-bands (or mel-channels) may be employed. Figure 5 provides an example of scaling/weighing applied by 40 mel-filters that may be applied in context of the above-mentioned the filter bank. In order to model the nearly logarithmic perception of intensities of the human ear, the mel-band magnitudes X _mei(t, 0 are further subjected to logarithm operation (206) to derive mel-band magnitudes in the logarithmic domain, e.g. as X _log{t, i) = \og(X _mel(t, i))- Figure 6 provides an exemplifying illustration of 40 log magnitude values of the mel-band magnitudes X _log (t, (representing the logarithmic domain mel-band magnitudes of the FFT bins illustrated in Figure 4 after mel-scaling using the scaling/weighting illustrated in Figure 5). In some embodiments, the logarithmic domain mel-band magnitudes Xi _og (t, 0 are further sub- jected to the Discrete Cosine Transform (DCT) (207) to compute the MFCCs C _mfcc(t, c) for the frame t. Here, c is the index of the cepstral coefficient, with c = 0 corresponding to the logarithmic frame energy. For example, 20 cepstral coefficients may be calculated. The MFCCs can in some embodiments used for further classification of the piece of music represented by the input audio signal x n), such as genre classification or audio categorization, but they are typically not required for the club-score calculation.

The MFCC analyzer 120 is further arranged to calculate the energy of the input audio signal x n). The energy may be calculated for each frame t to provide frame energies E(t). In particular, the energy calculation may involve calculating the en- ergy on basis of the zeroth cepstral coefficient or C _mfcc(t, 0).

The output from the MFCC analyzer 120 for frame t hence comprises at least the average frame energy E(t) and the logarithmic domain mel-band magnitudes Xiog(t _> 0- In some embodiments, also the MFCCs C _mfcc(t, c) and optionally their first and/or second order time derivatives may be provided as further output from the MFCC analyzer 120. The average frame energy E(t) and the logarithmic domain mel-band magnitudes X _log(t, i), possibly together with other energy related parameters, may be jointly referred to as energy features.

The audio analysis arrangement 100 further comprises a fluctuation pattern analyzer 130. The fluctuation pattern (FP) analyzer 130 is configured to perform a fluc- tuation pattern analysis (FPA) on basis of the logarithmic domain mel-band magnitudes Xi _0g (t, Oprovided by the MFCC analyzer 120 in order to extract a set of FP features, which may also be referred to as a second set of features. An exemplifying overview of the FPA is available in [4], section 2.2.4 (on pages 36 to 40). Note, however, that [4] describes usage of the FPA for comparison of two pieces of music. In the following, main steps of the FPA suitable in context of the FP analyzer 130 are described.

The sequence of logarithmic domain mel-band magnitude frames X _log {t, i) is arranged into segments of desired temporal duration;

For each segment the following steps are taken:

a. The logarithmic-domain mel-band magnitudes of each frame X _log{t, i) are arranged into a smaller number frequency bands, e.g. the number of frequency bands may be reduced from 40 to 12. This may involve keeping the lowest mel-bands intact, while for the higher mel-bands two or several mel-bands are combined into a single band such that the frequency resolution at higher frequency bands is reduced in order to reduce the computational complexity without significantly affecting the accuracy of the FPA. Consequently, frequency bands represented by coefficients C _FP(t, k), k-\ , 2, 12 are obtained.

b. For each frequency band in the reduced frequency resolution representation, the FFT is applied over the coefficients of the respective frequency band across frames of the respective segment (e.g. over coefficents C _FP(t _Q: t _e, k) , where t ₀ and t _e indicte the first and last frames of the segment, respectively) to compute amplitude modulation frequencies of the loudness in a desired range, e.g. in the range of 1- 10 Hz. For example, 30 modulation bands may be used, denoted by band indices 6 = 1 , 2, 30. c. The amplitude modulation frequencies are weighted using a model of perceived fluctuation strength, e.g. according to the curve illustrated in Figure 2.15 of [4].

d. Some filters are applied to emphasize certain types of patterns. As described in [4], this may involve applying a difference filter and two smoothing filters filtl and filt2. The effect of the fluctuation strength weighting and smoothing filters is illustrated in Figure 2.16 of [4]. Generally, smoothing filters may be advantageous for removing noise and/or irregularities which are not relevant for the analysis of periodic amplitude modulation in the band magnitudes. The final fluctuation pattern feature may be denoted with FP(k, b), where k = 1 , 2, 12 is the index of the frequency band (after combining several mel-fre- quency bands in the step 2a) and b = 1 , 2, 30 is the modulation frequency band index. The frame indices t ₀ : t _e are omitted in the fur- ther discussion to simplify the notation.

The segments of desired temporal duration (step 1 above) may be provided e.g. by grouping a predetermined number of consecutive logarithmic domain mel-band magnitudes of (output) frames Xi _og (t, i) from the MFCC analysis into 'superframes' for the FPA. As an example, the temporal length of the 'superframe' applied in the FPA may be a few seconds, e.g. approximately 3 seconds. As a particular example, 128 consecutive logarithmic domain mel-band magnitude frames Xi _og (t, i) , each representing 30 ms of the input audio signal x(n) employing 25 % overlap with the preceding and following frame in sequence, may be grouped into a 'superframe' for the FPA, resulting in 'superframes' of 2.88 seconds in duration. More detailed de- scription of the processing suitable for the steps from 2a to 2d above is found in [4], section 2.2.4.1 (pages 38 to 40). It should be noted that steps 2c and 2d are optional, and either or both may be included in the FPA for improved modeling accuracy and reliability at a cost of increased computational load.

The FPA analysis results, for each segment (i.e. 'superframe') in a FP, which is preferably provided as a matrix whose columns correspond to frequency bands (of the reduced frequency resolution) and whose rows correspond to modulation frequencies. In some embodiments, the matrices derived for segments of a piece of music may be each re-arranged into respective vectors, the resulting vectors may be averaged to provide a single FP vector descriptive of the piece of music, e.g. descriptive of amplitude modulation of loudness per frequency bands (according to the reduced frequency resolution).

The FPs indicated by matrices (or vectors) for each segment or the FP indicated by the averaged vector may be employed for computation of the set of FP features for subsequent use in the club-score determination. The FP analyzer 130 is arranged to extract at least the following FP features:

- A FP bass feature, derived e.g. as a sum of 'bass band' modulation energy indicated in the respective FP, where the 'bass band' may consist of e.g. the two lowest frequency bands of FP (k, b) (bands k = 1 and k = 2).

- A FP gravity feature, derived e.g. as the center of gravity of the respective FP on the modulation frequency axis. A low center of gravity indicates the respective segment of the input audio signal x n) to be likely perceived as "slow" (not only having a low tempo but also e.g. vibra/tremolo is likely to have an effect of such perception). The FP gravity may be calculated as

₌ ∑ _b b∑ _k FP (k, b)

^gravity _∑kb p _{P {k> b}

where k is the index of the frequency band and b the index of the modulation frequency.

- A FP maximum feature, derived e.g. as the maximum value in the respective FP, e.g. as FP _max = max (max FP(k, b)) . Segments dominated by a strong b k

beat typically have a high maximum value of the FP.

- A FP focus feature, derived e.g. as the energy distribution in the respective FP. For derivation of the FP focus feature, FP _max may be thresholded to predefined in accordance with a predefined minimum value, e.g. such that if FPmax < 2e-16, it is set equal to 2e-16. Note that the exact value may be changed from the one used as an example herein. The threshold may be applied to prevent problems if the FP(k, b) values are very small. FP focus may be calculated as

∑ _b∑ _k FP(k, b)

ffocus _kbFp

max a FP sum feature, derived e.g. as sum of the values in the respective FP, e.g. as FP _sum =∑ _b∑ _k FP(k, b) The FP sum may be derived in addition to or instead of the FP maximum described above,

a FP aggressiveness feature, which can be calculated as

F ¹ P ¹ aggressiveness—

max

Herein, the summation runs across frequency bands k = 2, 3, ... , 12 and across modulation frequencies b = 1 , 2, ...,4, which correspond to modulation frequencies below 1 Hz.

- a FP LF domination feature, descriptive of low-frequency domination in the respective FP, derived e.g. as the ratio between the sum of the values in the four highest frequency bands and the sum of the values in the three lowest frequency bands in the respective FP.

The audio analysis arrangement 100 further comprises a detrended fluctuation analyzer (DF) analyzer 140. The DF analyzer 140 is configured to carry out a detrended fluctuation analysis (DFA) on basis of the frame energies E (t) provided by the MFCC analyzer 120 in order to extract a DF feature, which may also be re- ferred to as a DF exponent. An exemplifying overview of the DFA is available in [3]. The DFA finds its origin from fractal analysis and it has the ability to indicate correlations across different time scales, e.g. long-term correlations, in non-stationary time series. In the following, main steps of the DFA suitable in context of the DF analyzer 140 are described. 1 . First the energy is integrated: y(m) =∑ _t≤m exp(£ ^"(t)). Since the energy E (t) is logarithmic, the exponent of the energy E (t) is taken.. Alternatively, the standard deviation of the amplitude may be used as an input instead of the energy E(t), as suggested in [3], but using the logarithmic frame energy E(t) is advantageous since it is readily available from the MFCC analysis.

2. Next, D (1, T) is calculated as

D (l, τ) = - ^ (y (7 + m) - y (m)

as suggested in [3]. The linear trend y(m) is obtained by low-pass filtering y(m) with a Finite Impulse Response (FIR) filter having the coefficients bp = > P = 0, - > ^τ - ! ^■

3. The time scale in focus τ obtains 36 different values on a nonlinear range, such that the first value τ ₀ = 13, and the relation between two adjacent values is τ _£ = round (l.lTi-i) , where round(x) denotes rounding of the argument x to the nearest integer. With frame length of 1024 samples and 44100 Hz sampling rate, the time scale obtains values between roughly 300ms (τ ₀ = 13) and 8.6 seconds (τ ₃₅ = 372).

4. The detrended fluctuation DFA( ) is then obtained as described in [3]:

5. Next we calculate the DFA exponent a: ) log ₁₀ ((T _{i+ 1} + 3)/(T _{i +} 3))

for i = 0, ... , τ - 1.

Finally, to obtain a measure of the danceability for the segment of input audio signal x(t) representing the music to be analyzed, the system returns

DF exponent = mean(a(i)).

The audio analysis arrangement 100 further comprises an audio attribute determiner 150 for determining the club score for the piece of music or part thereof represented by the input audio signal x n). Additionally, the audio attribute determiner 150 may be configured to determine or obtain further audio attributes. The audio attribute determiner 150 is configured to provide a club score value to indicate the outcome of the analysis process. In order to determine the club score indicator, the audio attribute determiner 150 is arranged to obtain (e.g. receive) the BT features from the beat tracker 1 10, the energy features from the MFCC analyzer 120, the FP features from the FP analyzer 130 and the DF exponent from the DF analyzer 140 pertaining to the segment of the input audio signal x n) for which the club score is being determined.

The audio attribute determiner 150 may be configured to compute the club score as a product of a feature vector v and a transform vector W. The feature vector v in- eludes one or more BT features received from the beat tracker 1 10, one or more energy features received from the MFCC analyzer 120, one or more FP features received from FP analyzer 130 and one or more DF features received from the DF analyzer 140, while the transform vector W represents weighing factors (pre-)as- signed to each of the features in the feature vector s. In other words, the audio attribute determiner 150 may be configured to compute the club score as P = v * W, where v is a 1 by N vector and W is a N by 1 vector.

The features included in the feature vector v may comprise any combination of the exemplifying BT features, energy features, FP features and DF features described hereinbefore. Preferably, the feature vector v comprises at least one BT feature, at least one energy feature, at least one FP feature and at least one DF feature in order to take into account different aspects descriptive of characteristics of the input audio signal x(n) in the club score determination. In general, to some degree a larger set of features included in the feature vector s results in more reliable and accurate determination of the club score. However, if the dimensionality of the fea- ture vector becomes too large the accuracy of the determination will suffer. Moreover, irrelevant or non-informative features should not be included in the feature vector. As a non-limiting the example, the feature vector may comprise the following features (described in more detail hereinbefore):

- the average of the accent signal in a low(est) frequency band;

- the standard deviation of said accent signal; - the maximum value of the median or mean of the periodicity vectors;

- the sum of the values of the mean or median of the periodicity vectors;

- the tempo indicator for indicating whether the tempo identified for the input audio signal x n) is considered constant or essentially constant (or non-con- stant/ambiguous);

- the FP bass feature;

- the FP gravity feature;

- the FP focus feature;

- the FP maximum feature;

- the FP aggressiveness feature;

- the FP LP domination feature;

- the DF exponent at least for one predetermined time scale;

- the average frame energy

Instead of directly relying on the feature vector v, the audio attribute determiner 150 may be configured to normalize the features of the feature vector v prior to multiplication by the transform vector W. The normalization may comprise subtracting a respective predetermined mean value from each of the features to derive the normalized feature vector v _norm, e.g. as v _norm = v - m _norm, where elements of the 1 by N mean vector m _norm are the mean values of the respective features of the fea- ture vector v. Additionally, the normalization may comprise scaling the features with respective predetermined normalization factors, e.g. as v _norm = v * F _norm, where the diagonal elements of the matrix F _norm are the normalization factors to be applied for the respective features of the feature vector v or as v _norm = (y - m _norm) * F _norm. The normalization factors of the matrix F _norm are, preferably, selected such that mul- tiplication of a (possibly mean-removed) feature with the respective normalization factor results in feature values exhibiting or approximating standard deviation at or close to unity. Consequently, regardless of the exact manner of deriving the normalized feature vector v _norm, the club score is computed as P = v _norm * w.

The club score P, which is a scalar value, may be further subjected to a further scaling (or normalization) in order to guarantee the club score P to lie within a desired predetermined scale. The scaling may involve subtraction of a predefined mean value and/or multiplication (or division). As an example, the desired scale may be from 1 to 5 with a higher club score P indicating a higher degree/extent of dance- ability or club-likeness.

The club score P is derived separately for a number of segments of the input audio signal x n). Hence, a piece of music may have a sequence of club scores P assigned thereto. The sequence of club scores P may be applied to derive a single club score P that is representative the danceability or club-likeness of the piece of music as a whole. Such a single club score P may be derived e.g. as a mean or median of the sequence of club scores P assigned to the piece of music. Hence, the club score P to be provided for further use (e.g. by the switching pattern selector 160 as will be described hereinafter) may be provided e.g. as a vector including the sequence of club scores P assigned to a certain piece of music thereby segment- wise characterizing the certain piece of music. Alternatively, the club score P may be provided from the audio attribute determiner 150 as the single club score P char- acterizing the certain piece of music in its entirety.

Instead of analyzing a single input audio signal x n) in order to derive the club score for a certain piece of music or part thereof, a number of input audio signals x n) representing the certain piece of music or part thereof may be subjected to analysis by the audio analysis arrangement 100 and an average or median of the resulting club scores P (or other statistical value derived on basis of the resulting club scores P) may be assigned as the final club score(s) P characterizing the respective piece of music. In other words, the club score P may be derived on basis of a number of input audio signals x n) that overlap in time. This may be advantageous, for example, in situations where several audio capturing devices have been capturing the same audio event. In this case, an overall club score estimate based on a plurality of captured audio signals from the same situation may be considered more reliable than an estimate based on a single captured audio signal only.

The transform vector W to be used in determination of the club score P may be derived on basis of experimental data. Such derivation may involve using a relatively large set of test items, preferably comprising tens, hundreds or even thousands of pieces of music. Each test item (e.g. each piece of music or part thereof) has a club score pre-assigned thereto. The test items, preferably, comprise test items exhibiting club scores extending over the whole range of possible club score values. The pre-assignment of club scores may have been performed manually by a single user/listener or by the pre-assigned club scores may be derived as an average of the club score given (manually) by a plurality of users. The derivation further comprises extracting the feature vector v _test(i) for each of the test items and computing the parameters of interest from the feature vectors v _test(i) in view of the respective pre-assigned club scores. These parameters of interest comprise the transform vector W to be used in determination of the club score P. The transform vector W may be obtained by applying a suitable analysis technique to the feature vectors v _test(i) extracted on basis of the test items, such as Linear Discriminant Analysis (LDA) known in the art.

Other suitable analysis techniques include Support vector machine (SVM), neural networks such as multilayer perceptrons (NLP), Bayesian classifiers, using different parametric density models such as single Gaussians or Gaussian mixture models or hidden Markov models, decision trees, networks of binary classifiers, random forests, k-nearest neighbors, or learning vector quantization. Furthermore, instead of using a classification model, a regression model can be used as well. Note that when a different classifier is used, then the calculation of the club score depends on the classifier used. For example, in the case of the nearest neighbor classifier there is no longer use for the transform vector W with which the feature vector v is multiplied, but the classification comprises calculating distances from the feature vector v to the feature vectors v _test(i) derived for the test items, and predicting the club- score based on the club scores of the k-nearest feature vectors of the test items ^vtest(Q - Instead of applying the full set of feature vectors v _test(i) derived for the test items in determination of the club score, the feature vectors v _test(i) may be employed to determine a codebook of desired number of codevectors (e.g. by using k- means clustering of the feature vectors v _test (i)), each codevector having a club score assigned thereto. Consequently, the determination of the club score may in- volve determining the club score as the club score assigned to the codevector closest to the feature vector v. In case normalization of the feature vector v is to be applied in the audio attribute determiner 150, normalization parameters, for example the mean values of the vector m _norm and/or the normalization factors of the matrix F _norm, may also be derived on basis of the feature vectors v _test(i). For this purpose, the parameters derived from the feature vectors v _test(i) may further comprise the mean of the feature vectors m _test = avg(v _test (i)) in order to compute the mean value for each of the features for derivation of the mean vector m _norm for normalization process. The mean vector for normalization in the audio attribute determiner 150 may be defined e.g. directly as the mean of the feature vectors, i.e. m _norm = m _test.. As another example, the parameters derived from feature vectors v _test(i) may comprise the standard deviation a _test of the mean-removed feature vectors v _test(i) - m _test for derivation of the normalization factors, defined e.g. via a vector f _norm = 1 / stdev(m _test (i) - ^mtest) whose values are applied as the diagonal elements of the matrix F _norm (with the other elements of F _norm set to zeros). As briefly pointed out hereinbefore, to name a few use scenarios for the club score, the club score may be applied e.g. as a piece of information to be presented by user as an indicator that characterizes a piece of music, as input for an automated tool for mixing music e.g. to enable identifying pieces of music exhibiting desired (e.g. high enough) extent of danceability or club-likeness or as an input for an automated tool for generating a visual presentation to accompany the respective piece of music, where the club score is applied as a control parameter that at least partly affects the choice of a change/switch pattern that defines the temporal positions and/or frequency of changes of video source or image in the visual presentation. In the following, a more detailed description regarding the latter example usage of the club score is provided for illustration purposes.

Referring to Figure 7, a music analysis server 310 (hereafter "analysis server") is shown connected to a network 320, which can be any data network such as a Local Area Network (LAN), Wide Area Network (WAN) or the Internet. The analysis server 310 is configured to analyze audio associated with received video clips in order to perform automated video editing. In this regard, the analysis server 310 may be configured to implement the audio analysis arrangement 100 described hereinbefore and to apply the audio associated with the received video clips as the input audio signal x n). External terminals 330, 332, 334 may communicate with the analysis server 310 via the network 320, in order to upload video clips having an asso- ciated audio track. The terminals 330, 332, 334 incorporate video camera and audio capture (i.e. microphone) hardware and software for the capturing, storing, uploading and downloading of video data over the network 320.

The terminal 330, 332, 334 may be a mobile telephone or smartphone, a personal digital assistant (PDA), a portable media player (PMP), a portable computer or any other device capable of running software applications and providing audio outputs. In some embodiments, the terminal 330, 332, 334 may engage in cellular communications using a wireless communications module. The wireless communications module may be configured to communicate via several protocols such as Global System for Mobile Communications (GSM), Code Division Multiple Access (CDMA), Universal Mobile Telecommunications System (UMTS), Bluetooth and IEEE 802.1 1 (Wi-Fi). A memory of the terminal 330, 332, 334 may store multimedia files such as music and video files, including the captured video clips and their associated audio referred to above. The memory may further store a software application, which, when executed by a processor of the terminal 330, 332, 334, is configured to cause uploading captured video clips, including their associated audio track, to the analysis server 500.

The analysis server 310 is configured to receive video clips from the terminals 330, 332, 334 and to carry out the processing described hereinbefore in context of the audio analysis arrangement 100 for at least some of the associated audio tracks for the purpose of supporting an automatic video processing and editing procedure, for example to join video clips together at musically meaningful points. Instead of carrying out such audio analysis for a number of audio tracks separately, , the analysis server 310 may be configured to perform the audio analysis for a common audio track which has been obtained by combining parts from the audio track of one or more video clips. Referring to Figure 8, a practical example will now be described. Each of the terminals 330, 332, 334 is shown in use at an event which is a music concert represented by a stage area 340 and speakers 350. Each terminal 330, 332, 334 is assumed to be capturing the event using their respective video cameras; given the different po- sitions of the terminals 330, 332, 334 the respective video clips will be different but there will be a common or essentially common audio track providing they are all capturing over a common time period.

Users of the terminals 330, 332, 334 subsequently upload their video clips to the analysis server 3100, either using their above-mentioned software application or from a computer with which the terminal synchronizes. At the same time, users are prompted to identify the event, either by entering a description of the event, or by selecting an already-registered event from a pull-down menu. Alternative identification methods may be envisaged, for example by using associated GPS data from the terminals 330, 332, 334 to identify the capture location. At the analysis server 310, received video clips from the terminals 330, 332, 334 are identified as being associated with a common event. The analysis server 310 may be configured to carry out a dedicated beat tracking to identify beats in the audio clip(s). Alternatively, BT features from the beat tracker 1 10 may provide identification of the beats (or the beat tracker 1 10 may be configured to provide a dedicated output for providing the beat identification). The identified beats are, subsequently, used as useful video angle switching points for automated video editing. A second software application, executable in the analysis server 310, is configured to control and perform the video processing; including processing the associated audio signal to perform the beat tracking (in case the beat identification is not available from the beat tracker 1 10). Instead of receiving video clips from the terminals 330, 332, 334 and combining them into a composite video to accompany the input audio signal x n) (or an audio signal generated on basis of the input audio signal x(n)), the analysis server 310 may be arranged to receive one or more still images associated with an audio clips form the terminals 330, 332, 334 and to compose a slideshow on basis of the re- ceived still images by considering the identified beats in the input audio signal x n) as useful image switching or changing points for automated slideshow generation. Moreover, visual information received at the analysis server 310 may be a mixture of video clips and still images, and the visual presentation to accompany the input audio signal x n) (or an audio signal generated on basis of the input audio signal x(n)) may comprise both still images and video clips such that switching from one visual source to another is arranged to take place in view of the identified beats in the input audio signal x n).

Hence, in general the switching from one source of visual information introduces a discontinuity in the composite visual presentation, and the discontinuity may be e.g. one of the following: a switch from one still image to another still image, a switch from a still image to a video clip, a switch from a video clip to another video clip, a switch from a video source to a still image. Moreover, further types of discontinuities may be introduced without switching from a still image or a video clip to another, e.g. a temporary (short-term) modification or distortion of the visual content. Such modification/distortion may comprise e.g. a "flash of light", temporary distortion of the colors of the image/video, cropping to a certain sub-portion of the image/video, zooming in to the image/video, etc.

The discontinuities of the visual content are defined to take place in predefined temporal locations with respect to beats and/or downbeats identified in the input audio signal x n) . As pointed out above, the beats/downbeats may be indicated in the beat identification information that may be received from the beat tracker 1 10. These temporal locations and/or their relation to the beats/downbeats identified in the input audio signal x n) are defined by a switching pattern. Characteristics of a switching pattern together with a few examples will be described in the following.

The audio analysis arrangement 100 may further comprise a switching pattern se- lector 160 arranged to select a switching pattern from a plurality of predefined switching patterns. The switching pattern is selected in accordance with the one or more audio attributes that characterize the piece of music represented by (the corresponding segment of) the input audio signal x n) . In the following, the switching pattern selection based on the club score received from the audio attribute deter- miner 150 is described. However, basing the switching pattern selection solely on the club score is provided as a non-limiting example in favor of brevity and clarity of description, and hence using additional and/or different audio attributes instead is equally applicable within various embodiments of the present invention.

In general, the switching pattern selector 160 is arranged to select a switching pat- tern resulting in more frequent discontinuities (e.g. video angle switching and/or image changes) with increasing value of the club score P. This may be provided e.g . by selecting a first switching pattern that involves a high frequency of discontinuities in response to the club score P exceeding a first predetermined threshold value 777i (e.g. P > 777i) whereas in case the club score P fails to exceed the first threshold value Th^ a switching pattern that involves a lower frequency of discontinuities is selected. The range of values of the club score P below the first threshold value may be further divided into sub-ranges by predetermined threshold values Th where the club score P exceeding a threshold value 77?, results in selecting / ^':th switching pattern, with the switching pattern i involving a higher frequency of discon- tinuities than switching pattern j < / ^'.

Selecting a switching pattern that involves high frequency of discontinuities serves to create a subjectively appealing effect as the visual characteristics of the edited video are likely to match the music style of the input audio signal x n). For example, introducing a beat-to-beat video switching or image switching pattern typically yields a high energetic composite video that is similar to manually created music video productions of professional type that aim to convey a high energetic expression to the audience and/or that are typically applied e.g. as video edits of club scenes or Disk Jockey (DJ) performances.

A switching pattern may cause introducing the discontinuities on beats, between beats or to introduce a mixture of discontinuities introduced on beats and between beats. In this regard, the beat tracker 1 10 may be configured to apply the techniques described in [1 ] and [2] to analyze beats and downbeats in the input audio signal x n). Instead of employing the beat tracker 1 10 for beat and/or downbeat analysis, another entity or component of the audio analysis arrangement 100 may be em- ployed for this purpose. Furthermore, the input audio signal x n) may be analyzed for patterns, or groupings of musical measures into groups of two, e.g. as described in [10]. This analysis may be carried out by the beat tracker 1 10 or another entity/component of the audio analysis arrangement 100. In the technique described in [10], the different beats in a musical measure may have different probabilities for a visual change to happen on that beat.

More specifically, in the technique described in [10] the following probabilities are assigned to the beats in a pattern:

0.7 for beat 1

0.25 for beat 5

0.05 for beat 8

Consequently, on the average, in such a switching pattern 70% of the discontinuities are caused to take place on the first beat of an 8-beat pattern, 25% of the discontinuities are caused to take place on the fifth beat of an 8-beat pattern (the downbeat of the second measure in a group of two measures), and 5% of the discontinuities are caused to take place on the last beat of an 8-beat pattern.

At least some switching patterns of the plurality of switching patterns available for the switching pattern selector 160 describe on which of the beats in a musical pattern (e.g. a sequence of two measures, or 8 beats, in 4/4 time signature) a discon- tinuity in the visual content is introduced.

The switching pattern selector 160 may be configured to select or re-select a switching pattern from the plurality of switching patterns for each musical pattern (e.g. a sequence of two measures or 8 beats). The selection or re-selection is affected by the club score P assigned to the respective piece of music in its entirety or on the club-score of a sequence of club scores P assigned to the respective musical pattern of the respective piece of music.

As a non-limiting example, the plurality of switching patterns available for the switching pattern selector 160 may comprise at least the following switching patterns (a first set of switching patterns) that involve introducing multiple discontinuities in a musical pattern: AL L_B E AT_TAG : [0, 1 , 2, 3, 4, 5, 6, 7],

EVEN_BEAT_TAG: [0, 2, 4, 6],

O D D_B E AT TAG : [1 , 3, 5, 7],

F I RST 3 B E ATS TAG : [0, 1 , 2],

F I RST 4 B EATS TAG : [0, 1 , 2, 3],

L AST_4 B E ATS_TAG : [4, 5, 8, 7],

where the beats inside a musical pattern (e.g. a sequence of two measures, or 8 beats) are numbered from 0 to 7, and, for example, the switching pattern ALL_BEAT_TAG denotes that a discontinuity is to be introduced on all beats of a musical pattern and the switching pattern ODD_BEAT_TAG implies that a discontinuity is to be introduced on beats 1 , 3, 5 and 7 of a musical pattern. In general, if the index of a beat is included in the switching pattern then a discontinuity is to be introduced on the respective beat.

In addition to the above-described first set of switching patterns, the plurality of switching patterns available for the switching pattern selector 160 may include a set of simple switching patterns (a second set of switching patterns) that involve introduction of a single discontinuity allocated to a certain predefined beat of a musical measure. Moreover, the switching pattern selector 160 may be configured to apply one of these simple switching patterns for a first percentage of time while one of the switching patterns from the first set is used for the remaining percentage of time. As an example, the simple switching patterns (the second set of switching patterns) and the respective probabilities within the first percentage of time may include

F I RST B EAT TAG : [0] (70 %)

FIFTH BEAT TAG: [4] (25 %)

EIGHTH BEAT TAG: [7] (5 %)

During the remaining percentage of time the switching pattern selector 160 may be configured to randomly select one of the switching patterns from the first set. The first percentage of time may be set on basis of the club score assigned to the respective portion of the input audio signal x(n) . The first percentage may decrease with increasing value of the club score P, e.g. such that the first percentage is high (e.g. 90 %) when the club score P is low (e.g. less than 4 assuming the range from 1 to 5 for the club score) whereas the first percentage is low (e.g. 30 %) when the club score is high (e.g. 4 or higher).

In another example, for sections of the video when the club score in the audio signal is above a predetermined threshold, e.g. above 4 on a scale from 1 to 5, the switch- ing pattern selector 160 may be configured to randomly select a switching pattern from the first set of switching patterns. Otherwise, it may be configured to select a switching pattern from the second set of switching patterns.

In a further example, the switching pattern selector 160 is provided with switching pattern sequence tables, from which the switching pattern selector 160 selects switching patterns sequentially starting again from the beginning of the table after the end has been reached. The table may contain a sequence of indices referring to various switching patterns, and the sequential ordering of indices determines the order of applying the respective switching patterns. Different switching pattern sequence tables may be provided for different ranges of club score values, and the choice of the switching pattern sequence table is based on the momentary value of the club score.

As yet another example, selection of the switching pattern for a certain musical pattern may be implemented in accordance with a Markov chain model. As known in the art, a Markov chain model involves determining a set of states and change probabilities between the states. In this context, each state of Markov chain model corresponds to a predefined one of the switching patterns available for the switching pattern selector 160. As an example, we may assume the following switching patterns (a third set of switching patterns) to be available:

A=[0, 1 , 2, 3, 4, 5, 6, 7],

B=[0, 2, 4, 6],

C=[1 , 3, 5, 7],

D=[0, 1 , 2],

E=[0, 1 , 2, 3],

F=[4, 5, 6, 7],

G=[0],

H=[4], l=[7] ,

J=D.

Moreover, the applied Markov chain model may have an initial probability, which indicates the probability to start in a certain state. Furthermore, the Markov chain model has transition probabilities P(i|j) that indicate the probability of making a transition from state j to state i. The transition probabilities P(i|j) may be static or they may be dynamic, i.e. variable in dependence of the club score P assigned to the respective segment of the input audio signal x n).

As an example, in case of dynamic transition probabilities P(i|j) the transition prob- abilities to the state corresponding to the switching pattern A (or to any state corresponding to one of the switching patterns from A to F) may be set to high value(s) in comparison to transition probabilities to the state corresponding to the switching pattern J (or to any state corresponding to one of the switching patterns from G to J) in response to a high club score P assigned to the respective segment of the input audio signal x n) (e.g. in response to the club score exceeding a predetermined threshold, e.g. a club score of 4 or higher in the scale from 1 to 5). In contrast, low club score (e.g. one failing to reach another predetermined threshold, e.g. a club score of 2 or lower) may result in setting the transition probabilities to the state corresponding to the switching pattern A (or to any state corresponding to one of the switching patterns from A to F) to low value(s) in comparison to transition probabilities to the state corresponding to the switching pattern J (or to any state corresponding to one of the switching patterns from G to J).

As another example of dynamic setting of the transition probabilities P(i|j), the transition probabilities to the state corresponding to the switching pattern A (or to any state corresponding to one of the switching patterns from A to F) may be incremented by a value d, where d = (P - 0.01 ) / P max, where P max denotes the maximum value of the club score P according to the applied scale (e.g. P _max = 5). After adding this increment, the transition probabilities may be normalized such that the probabilities P(i|j) for each pair of states i and j add up to unity. Consider a simple example: assume we have three switching patterns denoted with indices / =1 , 2, 3 and denote the transition probability matrix A _jt = P(i\j) . In other words, the elements of transition probability matrix ^ indicate transition probabilities from switching pattern / ^' to switching pattern j. Assume state 3 is related to high club score values. Assume the transition probabilities in the initial state are defined by

0.5 0.3 0.2

A = 0.4 0.4 0.2

0.2 0.3 0.5

Now, assuming the club score P value 5 on a range from 1 to 5, the probabilities corresponding to the high club score values (in this case, state 3) are incremented by d = (5 - 0.01 ) I 5 = 0.998. This yields the updated transition probability matrix

0.5 0.3 1.198

A' = 0.4 0.4 1,198

0.2 0.3 1.498 However, the probabilities to transition from state i to any of the other states j, including i, should add up to one. Therefore, after normalization the transition probabilities become

0.2503 0.1502 0.5996

A" = 0.2002 0.2002 0,5996

0.1001 0.1502 0,7497

Now, the probabilities summed across the rows add up to one. After the new switch- ing pattern has been selected according to the transition probability matrix A", the transition probabilities can be returned to the original values A. When the next switching pattern is to be selected, the transition probability values may be again modified based on the club score value P.

The relationship between the switching patterns and the club score P may be based on experimental data. This may involve, for example obtaining/learning different switching patterns and transition frequencies therebetween for setting the states, prior probabilities, and transition probabilities P(i|j) of a Markov chain model on basis of annotated learning data. The data annotation in this regard may comprise temporal locations of scene transitions in visual material (e.g. video clips) employed as the learning data, characterization of the corresponding switching patterns and indications on the club score(s) P assigned for the respective portion of the input audio signal x n) . For example, different learned Markov chain models may be learned/derived for different predetermined sub-ranges of the club score P value. This may be provided e.g. by analyzing a set of professional music videos used as the learning data. The analysis comprises identifying the beats, downbeats, music patterns and respective club score values for the audio signals representing the pieces of music associated with music videos employed as the learning data. The identifying might be done using automatic analysis or manual annotation. Moreover, the shot boundaries from the video signal are either annotated manually or analyzed with shot boundary detection algorithms. Consequently, parameters for one or more Markov chain models are estimated from the automatically analyzed and/or manu- ally annotated learning data. For example, the estimation may first start by identifying one or more switching patterns in the learning data. Such identification may be carried out, for example, by quantizing the discontinuities in the found switching patterns into the closest beats, clustering the found switching patterns, and retaining a subset of the most frequent switching patterns. Prior to the clustering, all the found switching patterns can for instance be coded as binary vectors of length 8 with value(s) 0 in a vector indicating beats of the 8-beat music pattern with no switching and value(s) 1 in the vector indicating beats on which switching from one shot to another occurs (i.e. as indicated e.g. for the switching patterns A to J hereinbefore). Such a vector would present e.g. the switching pattern C above as the vector [0,1 ,0,1 ,0,1 ,0,1 ]. Another approach for defining the switching pattern for an 8-beat music pattern is to apply a beat distance vector including indications of the distance to the beginning of the 8-beat music pattern or to the previous switch within the switching pattern, padded with zeros to length 8 in order to have the same vector length for all switching patterns regardless of the number of switches therein. Such a beat distance vector would present e.g. the switching pattern C above as beat distance vector [2,2,2,2,0,0,0,0]. The clustering can then be performed for the coded switching patterns with a clustering algorithm, such as k-means, k-medians, or k-medoids, using an appropriate distance measure, such as Hamming distance for the binary coding or Euclidean distance for the beat distance vector coding. Other features, such as the total num- ber of switches in the switching pattern, can also be concatenated or appended to or otherwise associated with the code vectors, possibly together with weighting to be applied between the different features to guide the clustering. Some of the resulting clusters can further be discarded by analyzing their properties such as size or the distribution of different switching patterns assigned to the cluster. After the clustering, each switching pattern representing one of the clusters may be assigned into a Markov chain model state. Then, the system may estimate the transition probabilities between switching patterns by counting occurrences of transiting from state i to state j. The prior probabilities can be estimated from the total counts of the occurrence of each state. Constraints such as minimum or maximum amount of switches in a switching pattern may be set before the Markov chain model estimation.

The learning data can also be used to create models describing the deviation of the discontinuities from the exact beat positions, e.g., by forming a histogram of relative deviations from the closest beat. This can be done for example collectively for all beats, separately for each beat position in a musical pattern, or among all occurrences of a certain switching pattern. The deviation models can then be used to artificially deviate the discontinuities of switching patterns containing discontinuities only on exact beat times, which may improve the switching in a sense that it feels less like generated by a computer and more like hand-made or created by a human director.

Furthermore, separate deviation models and separate sets of switching patterns or transition probabilities may be defined for different levels of the club score P. For example, different Markov chain model may be learned/derived from the learning data for a number of sub-ranges of the possible club score P values (e.g. sub-ranges from 1 to 3, 3 to 4 and 4 to 5, assuming a continuous range of club scores in scale 1 to 5, or the sub ranges may be defined by each club score from 1 to 5 assuming integer values in scale 1 to 5 for the club score). When the learned Markov chain models are applied (e.g. by the switching pattern selector 160), the club score P assigned for a segment of input audio signal x n) may be used to select the appropriate sets of switching patterns and transition probabilities as well as other param- eters, such as the initial probability. The learning data subset partitioning and association to a given club score value or sub-range can also be done based on other automatically extracted or manually defined information such as song tempo ranges, artist, genre, recording year, or video color statistics, or a combination of multiple such attributes. As an example, a switching model for high club scores could be learned by considering only a subset of the learning data consisting of music videos by electronic music artists with tempo ranging between 1 15 and 150 BPM.

The learned statistical Markov chain models can further be used in combination with a deterministic switching time determining method such as the one described in [1 1]. The methods can be combined for instance as follows: whenever the deterministic method indicates one or more switching points (e.g. strong enough perceptual emphasis in case of [1 1 ]) during a musical pattern, a switching pattern with discontinuities aligned with the deterministic switching points is forced - otherwise switching patterns from the statistical model are used. The transition from a deterministic switching pattern to using a statistical model can be smoothed by initializing the statistical model state with the switching pattern most similar to the preceding deterministic switching pattern. Club score values can, for example, be used for adjusting the detection threshold of the deterministic switching pattern determining method, in order to balance between using the deterministic and statistical switching patterns according to the club score. In addition to or instead of to the club score, one or more other audio attributes may be used in the switching pattern selection. Switching pattern selection on basis of one or more audio attributes may be carried out in the switching pattern selector 160, either in context of the audio processing arrangement 100 (e.g. to make use of the club score in the selection) or independently of (other) components of the audio processing arrangement 100. Such audio attributes may be derived or derivable on basis of the input audio signal x n) by audio analysis techniques, derived or derivable by human subjects having listened to the input audio signal x n) and used their own judgment to set the respective audio attribute, or the audio attributes may information associated with but not necessarily directly derivable from the input audio signal x n) itself.

In general, one or more audio attributes may be applied to classify the input audio signal x n) into one of a plurality of predetermined classes or categories, and each of these categories may imply selecting a certain predetermined category-specific switching pattern or applying a certain category-specific rule to select one of predetermined switching patterns. Note that selecting a certain predefined category-specific switching pattern on basis of the value of an audio attribute corresponds to directly selecting the switching pattern on basis of the value of the audio attribute. As an example, a category-specific rule may define that one of predetermined switching patterns assigned to respective category is selected randomly. The mapping between the value of the audio attribute and the category depends on characteristic and/or type of the audio attribute. As an example of the classification process on basis of a given audio attribute, each of the categories may be associated with one or more predefined value ranges of the given audio attribute and hence the input audio signal x n) may be classified into a certain category if its value falls within one of the value ranges associated to said class.

As a further example, at least for some of the categories the respective category- specific rule may further apply the value of the audio attribute in selection of the switching pattern, e.g., such that a high value (e.g. a value exceeding a predefined threshold) value of the audio attribute causes selection of a switching pattern that results in a high(er) frequency of discontinuities in the visual content (e.g. video angle switching and/or image changes) while a low value (e.g. a value not exceeding the predefined threshold) causes selection of a switching pattern that results in a low(er) frequency of discontinuities in the visual content. As a yet another example, the category-specific rule may apply a Markov chain model where each state of the Markov chain model corresponds to a certain switching pattern, as described in more detail hereinbefore using the club score as an example of an audio attribute applied to control selection of the switching pattern.

Instead of using the same audio attribute both to classify the input audio signal x n) into one of predetermined categories and in selection of the switching pattern using the category-specific rule, different audio attributes may be applied for these purposes. For example, the audio attribute applied to classify the input audio signal x n) into one of predetermined categories may be the musical genre or the musical mood determined or assigned to the input audio signal x n), while the audio attrib- ute applied to select the switching pattern may be the club score determined or assigned to the input audio signal x n). As an example, there may be different switching pattern(s) assigned for different musical genres or moods, and/or some parameters of the category-specific selection rules, such as the transition probabilities between states of the Markov chain model provided for the given category, may de- pend on the musical genre or the mood. As one example, there may be different switching pattern models trained for each musical genre, and within some of the genres the club score (or other audio attribute(s) or values such as energy) control the switching pattern selection. For example, different switching pattern models may be trained or defined for pop music, rock music, dance music and classical music. Furthermore, within the pop/dance/rock categories, the club likeness value might further be used to enable faster switching patterns (i.e. switching patterns that result in higher frequency of discontinuities in visual content) whenever the club likeness goes beyond a predetermined threshold. Within the classical category, club-likeness might not be used at all, if faster switching is not desired in videos of this style. An audio attribute may characterize the audio content. Such an audio attribute might be obtained from metadata associated with the video or audio clip, such as Moving Picture Experts Group MPEG-2 Audio Layer III ID3 metadata container or other metadata. Instead of being contained in a metadata container within the same container media format, the metadata might be located separately from the audio or video file, such as in a separate file, database, or in a separate device such as a server from where it could be queried. In some embodiments, audio fingerprinting such as the system provided by Shazam Inc. might be applied on the audio signal to determine an audio fingerprint, sent it to a service, and in return obtain identity of the audio file. As a result, metadata could be obtained describing the audio file. Metadata could be obtained, for example, from metadata providers such as All Music Guide. The contents of the metadata might further be based on either manual annotation of human experts or automatic methods performed by machines. The automatic methods could furthermore be based on analysis of textual data describing music and deriving metadata attributes from the text data, analysis of social tags provided by humans, or analysis of the audio signal as the club-score determination method describe earlier. Examples of audio attributes that may contribute to classi- fication of the input audio signal x n) and/or the switching pattern selection for the input audio signal x n) include the following.

- musical genre, such as pop, dance, R&B, hip hop, rock, indie, reggae, metal, classical, jazz, blues, classical,...

- mood, such as happy, sad, angry, melancholic, chill out, mellow,...

- mode, such as major/minor

- key, such as C major, D minor, ...

- energy

- harmony

- audio category, such as speech or music

- basic attributes such as artist, album, track title

Some of these attributes can be automatically analyzed from the audio signal. For example, a method for music genre classification has been presented in [4]. A method for audio mood classification has been presented, for example, in Cyril Lau- rier, "Automatic Classification of Musical Mood by Content-Based Analysis", PhD Thesis, Universitat Pompeu Fabra, Barcelona, Spain, 201 1 . Methods to analyze audio key or mode has been described in Geoffroy Peeters, "MIREX-2012 "AUDIO KEY DETECTION" TASK: IRCAMKEYMODE", abstracts of the Music Information Retrieval Evaluation exchange (MIREX 2012), in association with the 13th International Conference on Music Information Retrieval, ISMIR 2012, Porto, Portugal, the 8-12 October, 2012. Methods for audio energy analysis have been presented earlier in this text. Measures of harmony could relate to analysis of audio chroma or chords, see for example, H. Papadopoulos, G. Peeters, "Joint estimation of chords and downbeats from an audio signal", in IEEE - Transactions on Audio, Speech and Language Processing 18 (6) 2010. A method for classifying between speech and music has been presented, for example, in Scheirer, Slaney, "Construction and evaluation of a robust multifeature speech/music discriminator", In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997. ICASSP-97, pages 1331 - 1334 vol.2.

Operations, procedures, functions and/or methods described in context of the components of the audio analysis arrangement 100 may be distributed between the components in a manner different from the one(s) described hereinbefore. There may be, for example, further components within the audio analysis arrangement 100 for carrying out some of the operations procedures, functions and/or methods assigned in the description hereinbefore to components of the audio analysis arrangement 100, or there may be a single component or a unit for carrying out the opera- tions, procedures, functions and/or methods described in context of the audio analysis arrangement 100.

In particular, the operations, procedures, functions and/or methods described in context of a component of the audio analysis arrangement 100 may be provided as respective software means, hardware means or combination of software means and hardware means. As an example in this regard, the audio analysis arrangement 100 may be provided by an apparatus comprising means for obtaining one or more sets of features descriptive of characteristics of a segment of audio signal representing a piece of music, said one or more sets comprising at least a first set of features comprising one or more BT features descriptive of periodicity of said segment of audio signal, a second set of features comprising one or more FP features descriptive of modulation energies at a set of modulation frequencies across a set of predetermined frequency bands in said segment of audio signal, a third set of features comprising one or more DF features descriptive of correlations across different time scales in said segment of audio signal, and a fourth set of features comprising one or more energy features descriptive of the signal energy within said segment of audio signal and means for deriving the club score on basis of the features in the first, second, third and fourth sets of features, which club score is indicative of at least beat strength associated with said segment of audio signal. The apparatus may further comprise means for selecting a switching pattern from a plurality of predeter- mined switching patterns based at least in part on the derived club score, wherein a switching pattern indicates temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal. The above-described means for obtaining, means for deriving and/or means for selecting may be varied in a number of ways, e.g. as described in the foregoing in context of corresponding elements of the audio processing arrangement 100. As another example in this regard, the operations, procedures, functions and/or methods described in context of applying one or more other audio attributes instead of or in addition to the club score (e.g. in the switching pattern selector 160) may be provided by an apparatus comprising means for obtaining one or more audio attributes characterizing a segment of audio signal representing a piece of music and means for selecting a switching pattern from a plurality of predetermined switching patterns based at least in part on said one or more audio attributes, wherein a switching pattern indicates temporal locations for introduction of discontinuities in visual content associated with said segment of audio signal in relation to temporal locations of beats or downbeats identified for said segment of audio signal. These means for obtaining and means for selecting may be varied in a number of ways, e.g. as described in the foregoing in context of the switching pattern selection on basis of one or more audio attributes.

Figure 10 depicts a flowchart illustrating an exemplifying method 500 for carrying out operations, procedures, functions and/or methods described in context of the components of the audio analysis arrangement 100. The method 500 comprises obtaining the BT features descriptive of periodicity of a segment of the input audio signal x n), as indicated in block 510. The method 500 further comprises obtaining the FP features descriptive of modulation energies at a set of modulation frequencies across a set of predetermined frequency bands of the input audio signal x n), as indicated in block 520. The method 500 further comprises obtaining the DF features descriptive of correlations across different time scales in the input audio signal x n), as indicated in block 530. The method 500 further comprises obtaining the energy features descriptive of the energy of the input audio signal x n), as indicated in block 540. The method 500 further comprises deriving the club score on basis of the BT features, the FP features, the DF features and the energy features, as indicated in block 550. The method 500 may further comprise selecting a switching pattern from a plurality of predetermined switching patterns at least in part on basis of the determined club score, as indicated in block 560. Examples regarding more detailed operation within the method steps referred to in blocks 510 to 560 are described hereinbefore in context of the audio analysis arrangement 100. Figure 9 schematically illustrates an exemplifying apparatus 900 upon which an embodiment of the invention may be implemented. The apparatus 900 as illustrated in Figure 9 provides a diagram of exemplary components of an apparatus, which is capable of operating as or providing the audio analysis arrangement 100 according to an embodiment and/or capable of operating as or providing the switching pattern selector 160 for switching pattern selection on basis of one or more audio attributes. The apparatus 900 comprises a processor 910 and a memory 920. The processor 910 is configured to read from and write to the memory 920. The memory 920 may, for example, act as the memory for storing the audio/voice signals and the noise/voice characteristics. The apparatus 900 may further comprise a communica- tion interface 930, such as a network card or a network adapter enabling wireless or wireline communication with another apparatus and/or radio transceiver enabling wireless communication with another apparatus over radio frequencies. The apparatus 900 may further comprise a user interface 940 for providing data, commands and/or other input to the processor 910 and/or for receiving data or other output from the processor 910, the user interface 940 comprising for example one or more of a display, a keyboard or keys, a mouse or a respective pointing device, a touchscreen, a touchpad, etc. The apparatus 900 may comprise further components not illustrated in the example of Figure 9.

Although the processor 910 is presented in the example of Figure 9 as a single component, the processor 910 may be implemented as one or more separate components. Although the memory 920 in the example of Figure 9 is illustrated as a single component, the memory 920 may be implemented as one or more separate components, some or all of which may be integrated/removable and/or may provide permanent / semi-permanent/ dynamic/cached storage. The apparatus 900 may be embodied, for example, as an electronic device equipped with processing capacity sufficient to carry out operations, procedures and/or functions described in context of the arrangement 100 and/or in context of the switching pattern selection on basis of one or more audio attributes. As a non- limiting example, such a device may be provided as a computer apparatus arranged to operate as a server. The computer apparatus may be a personal computer such as a laptop computer or a desktop computer or it may be a mainframe computer. Moreover, provided that a sufficient processing capacity is available, the apparatus 900 may be embodied e.g. as a mobile phone, a smartphone, a digital camera, a digital video camera, a music player, a media player, a gaming device, a personal digital assistant (PDA), a tablet computer, etc. The memory 920 may store a computer program 950 comprising computer-executable instructions that control the operation of the apparatus 900 when loaded into the processor 910. As an example, the computer program 950 may include one or more sequences of one or more instructions. The computer program 950 may be provided as a computer program code. The processor 910 is able to load and exe- cute the computer program 950 by reading the one or more sequences of one or more instructions included therein from the memory 920. The one or more sequences of one or more instructions may be configured to, when executed by one or more processors, cause an apparatus, for example the apparatus 900, to carry out operations, procedures and/or functions described hereinbefore in context of the audio analysis arrangement 100 and/or in context of the switching pattern selection on basis of one or more audio attributes.

Hence, the apparatus 900 may comprise at least one processor 910 and at least one memory 920 including computer program code for one or more programs, the at least one memory 920 and the computer program code configured to, with the at least one processor 910, cause the apparatus 900 to perform operations, procedures and/or functions described hereinbefore in context of the audio analysis arrangement 100 and/or in context of the switching pattern selection on basis of one or more audio attributes.

The computer program 950 may be provided at the apparatus 900 via any suitable delivery mechanism. As an example, the delivery mechanism may comprise at least one computer readable non-transitory medium having program code stored thereon, the program code which when executed by an apparatus cause the apparatus at least to carry out operations, procedures and/or functions described hereinbefore in context of the audio analysis arrangement 100 and/or in context of the switching pattern selection on basis of one or more audio attributes. The delivery mechanism may be for example a computer readable storage medium, a computer program product, a memory device a record medium such as a CD-ROM, a DVD, a Blue- Ray disc or another article of manufacture that tangibly embodies the computer program 950. As a further example, the delivery mechanism may be a signal configured to reliably transfer the computer program 950. Reference to a processor should not be understood to encompass only programmable processors, but also dedicated circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processors, etc. Features described in the preceding description may be used in combinations other than the combinations explicitly described. Although functions have been described with ref- erence to certain features, those functions may be performable by other features whether described or not. Although features have been described with reference to certain embodiments, those features may also be present in other embodiments whether described or not.

Previous Patent: AUTHENTICATION SYSTEM AND METHOD FOR AUTHENTICATING A USER

Next Patent: A FILTER CLOTH