Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND APPARATUS FOR IDENTIFYING AUDIO SUCH AS MUSIC
Document Type and Number:
WIPO Patent Application WO/2005/101243
Kind Code:
A1
Abstract:
A song identification process has two stages: a database generation stage and a matching stage. In the database generation stage the system extracts an audio fingerprint and robust codes for a database of songs. A set of maps of all possible permutations of the codes are generated, storing all the locations of the different codes. In the matching stage, the system performs a similarity measure between the query song and the database songs by aligning the query at the locations indicated by the maps. A bit error rate (BER) is computed for every song. The song with the lowest song BER is the matching song.

Inventors:
Yeo, Kue Leong (Blk 115 Potong Pasir Avenue 1, #02-888, 5, 35011, SG)
Chong, Kok Seng (Blk 50 Choa Chu Kang North 7, #03-06, 7, 68952, SG)
Neo, Sua Hong (Blk 959 Hougang Street 91, #06-284, 9, 53095, SG)
Kuah, Kim Hann (Blk 145 Yishun Street 11, #04-43, 5, 76014, SG)
Application Number:
PCT/SG2004/000092
Publication Date:
October 27, 2005
Filing Date:
April 13, 2004
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MATSUSHITA ELECTRIC INDUSTRIAL CO. LTD. (1006 Oaza Kadoma, Kadoma-shi, Osaka, 571-8501, JP)
Yeo, Kue Leong (Blk 115 Potong Pasir Avenue 1, #02-888, 5, 35011, SG)
Chong, Kok Seng (Blk 50 Choa Chu Kang North 7, #03-06, 7, 68952, SG)
Neo, Sua Hong (Blk 959 Hougang Street 91, #06-284, 9, 53095, SG)
Kuah, Kim Hann (Blk 145 Yishun Street 11, #04-43, 5, 76014, SG)
International Classes:
G06F17/30; (IPC1-7): G06F17/30
Domestic Patent References:
WO2002011123A22002-02-07
WO2002008943A22002-01-31
WO1998025269A11998-06-11
Foreign References:
US20030191764A12003-10-09
US20030086341A12003-05-08
US6453252B12002-09-17
Other References:
HAITSMA J. ET AL.: "An Efficient Database Search Strategy For Audio Fingerprinting", IEEEWORKSHOP ON MULTIMEDIA SIGNAL PROCESSING., 9 December 2002 (2002-12-09), pages 178 - 181, XP010642541
CANO P. ET AL.: "A Review of Algorithms for Audio Fingerprinting", IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 9 December 2002 (2002-12-09), pages 169 - 173, XP010642539, Retrieved from the Internet
FREEDB.HOWTO, 8 January 2004 (2004-01-08), Retrieved from the Internet
Attorney, Agent or Firm:
ELLA CHEONG SPRUSON & FERGUSON (SINGAPORE) PTE LTD (P.O. Box 1531, Robinson Road Post Office, 1, 90303, SG)
Download PDF:
Claims:
CLAIMS
1. A method of generating a database from audio files, for use in audio identification of a query file, comprising: receiving one or more audio files; for individual received audio files extracting an audio fingerprint of the audio file; and generating a location table mapping the locations of data strings of a predetermined category within the audio fingerprint; and storing the one or more audio fingerprints and location tables in a database.
2. A method of matching an audio query against one or more known audio files, comprising: (a) receiving the query; (b) extracting a query fingerprint of the query; (c) receiving an audio fingerprint of an audio file; (d) identifying one or more data strings of a predetermined category from the query fingerprint; (e) identifying a location within the audio fingerprint having a data string corresponding to one of the one or more data strings of the query fingerprint; (f) determining an alignment of the query fingerprint and the audio fingerprint that matches the position of the identified location within the audio fingerprint with the location of the predetermined data string; and (g) determining a similarity measure between the query fingerprint and the audio fingerprint based on the alignment.
3. A method according to claim 2, wherein step (e) comprises using a location table mapping the locations of data strings of a predetermined category within the audio fingerprint.
4. A method according to claim 2, wherein the audio fingerprint of an audio file is received from a database of audio files.
5. A method according to claim 4, further comprising generating a database from audio files, comprising: receiving one or more audio files; for individual received audio files extracting an audio fingerprint of the audio file; and generating a location table mapping the locations of data strings of a predetermined category within the audio fingerprint; and storing the one or more audio fingerprints and location tables in the database; and wherein wherein step (e) comprises using the location table for the received audio fingerprint.
6. A method according to any one of claims 2 to 5, wherein step (d) comprises identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingerprint; and step (e) comprises for the or each identified first data string of the query fingerprint, identifying one or more second locations, the second locations being locations within the audio fingerprint having a second data string, a second data string being a data string corresponding to the identified first data string;.
7. A method according to claim 6, wherein step (f) comprises, for the or each second location within the audio fingerprint, identified as having a second data string, aligning the query fingerprint against the audio fingerprint by matching the identified second location within the audio fingerprint with the first location of the corresponding first data string within the query fingerprint; step (g) comprises determining the similarity measure between the query fingerprint and the audio fingerprint for one or more different alignments generated in step (f); and further comprising (a) determining a best match measure for the received audio file from amongst the one or more determined similarity measures.
8. A method of matching an audio query against an audio file, comprising: (a) receiving the query file; (b) extracting a query fingerprint of the query file; (c) receiving an audio fingerprint of a first audio file; (d) identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingerprint; (e) for the or each identified first data string of the query fingerprint, identifying one or more second locations, within the audio fingerprint, having a second data string corresponding to the identified first data string; (f) for the or each second location, within the audio fingerprint, identified as having a second data string, aligning the query fingerprint against the audio fingerprint by matching the identified second location, within the audio fingerprint, with the first location of the corresponding first data string within the query fingerprint; (g) determining a similarity measure between the query fingerprint and the audio fingerprint for one or more different alignments generated in step (f); and (h) determining a best match measure for the received audio file from amongst the one or more determined similarity measures.
9. A method according to claim 7 or 8, wherein step (g) further comprises determining a first similarity measure based on a reduced area of the query fingerprint and comprises determining a second similarity measure based on a fuller or complete area of the query fingerprint only if the first similarity measure is on a first side of a first threshold.
10. A method according to any one of claims 2 to 9, further comprising: repeating steps (c) and (e) to (g) for a plurality of different audio files; and comparing the similarity measures of the plurality of audio files to determine a best match.
11. A method according to claim 6, wherein step (g) comprises, for the or each identified second location within the audio fingerprint, determining a third location, the third location being a location within the audio fingerprint that would correspond to a predetermined fourth location, within the query fingerprint, if the query fingerprint were aligned against the audio fingerprint by matching the respective second location, within the audio fingerprint, with the first identified location of the corresponding first identified first data string within the query fingerprint; and further comprising prior to step (g), (i) identifying a fifth location and an associated count, within the audio fingerprint, the fifth, location being the location of the audio fingerprint that is determined as the third location most often and the associated count being the number of times the fifth location is determined as the third location; and wherein step (g) is carried out on the alignment where the predetermined fourth location of the query fingerprint is aligned with the fifth location of the audio fingerprint.
12. A method of matching an audio query file against one or more known audio files, comprising: (a) receiving the query file; (b) extracting a query fingerprint of the query file; (c) receiving an audio fingerprint of a first audio file; (d) identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingerprint; (e) for the or each, identified first data string of the query fingerprint, identifying one or more second locations, the second locations being locations within the audio fingerprint having a second data string, a second data string being a data string corresponding to the first identified first data string; (f) for the or each identified second location within the audio fingerprint, determining a third location, the third location being a location within the audio fingerprint that would correspond to a predetermined fourth location, within the query fingerprint, if the query fingerprint were aligned against the audio fingerprint by matching the respective second location, within the audio fingerprint, with the first identified location of the corresponding first identified first data string within the query fingerprint; and (i) identifying a fifth location and an associated count, within the audio fingerprint, the fifth location being the location of the audio fingerprint that is determined as the third location most often and the associated count being the number of times the fifth location is determined as the third location.
13. A method according to claim 12, further comprising, after step (i), (g) determining a similarity measure between the query fingerprint and the audio fingerprint based on the alignment where the predetermined fourth location of the query fingerprint is aligned with the fifth location of the audio fingerprint.
14. A method according to claim 11 or 13, further comprising repeating steps (c), (e), (f) and (i) a plurality of times, for a plurality of different audio files, prior to step (i).
15. A method according to claim 14, further comprising comparing the count from step (i) from each of the plurality of different audio files and conducting step (g) in respect of those audio files for which the count is one of those within a predetermined number of the highest counts.
16. A method according to claim 15, further comprising comparing the similarity measures of the plurality of audio files to determine a best match.
17. A method according to any one of claims 11 to 16, wherein the predetermined fourth location is the start of the query fingerprint.
18. A method according to any one of the preceding claims, wherein extracting a fingerprint comprises extracting multiband energies from a file.
19. A method according to claim 18, wherein extracting a fingerprint further comprises obtaining differential energies from the multiband energies.
20. A method according to claim 18 or 19, wherein the multiband energies comprise energies in 6 bands.
21. A method according to any one of claims 18 to 20, wherein extracting a fingerprint further comprises dividing the file into time domain samples.
22. A method according to any one of claims 18 to 21 , wherein extracting multiband energies comprises converting samples of the audio file into frequency samples.
23. A method according to any one of claims 18 to 22, wherein extracting multiband energies further comprises grouping samples within different frequency ranges into different frequency bands.
24. A method according to claim 23, wherein extracting multiband energies further comprises, in individual frequency bands, calculating the sum of squares of the samples.
25. A method according to any one of claims 18 to 24, wherein obtaining differential energies from the multiband energies comprises determining a difference value between the multiband energies of a current sample and the multiband energies of a preceding sample.
26. A method according to claim 25, wherein the preceding sample comprises the immediately preceding sample.
27. A method according to any one of claims 18 to 26, further comprising obtaining binary outputs from the differential multiband energies.
28. A method according to claim 27, wherein obtaining binary outputs from the differential multiband energies comprises comparing the differential multiband energies with thresholds to provide binary outputs.
29. A method according to claim 27 or 28, further comprising concatenating the binary outputs from individual samples to provide differential energy contours.
30. A method according to any one of claims 27 to 29, further comprising, for individual bands, concatenating the binary outputs from successive samples.
31. A method according to any one of the preceding claims, wherein the data strings of a predetermined category comprise data strings of a predetermined length.
32. A method according to claim 31 , wherein the predetermined length is 9 bits.
33. A method according to any one of the preceding claims, wherein the data strings of a predetermined category comprise strings of contiguous data.
34. A method according to any one of the preceding claims, wherein the data strings of a predetermined category comprise strings of data in a time direction.
35. A method according to any one of the preceding claims, wherein the audio fingerprint comprises data derived from different frequency bands of the audio file and individual data strings of a predetermined category comprise strings of data from individual frequency bands.
36. A method according to claim 35 when dependent on at least claim 6, 8 or 11, wherein the first locations comprise, for at least one of the different frequency bands, every location in the audio fingerprint derived from the at least one frequency band.
37. A method according to any one of the preceding claims, wherein the data at individual locations within the audio fingerprint is present in a plurality of the data strings of a predetermined category.
38. A method according to claims 36 and 37, wherein the data at the starting location in the at least one frequency band is only used by one of the data strings of a predetermined category and the data at the end location in the at least one frequency band is only used by one of the data strings of a predetermined category.
39. A method according to any one of claims the preceding claims, wherein the data strings of a predetermined category comprise strings of data that overlap.
40. A method according to any one of claims 2 to 11 and 13 to 16 or according to any one of claims 17 to 39 when dependent on at least one of claims 2, 8 and 13, wherein the similarity measure is a bit error rate.
41. A method according to any one of the preceding claims, wherein the audio file comprises a music file.
42. A method according to any one of the preceding claims, wherein the query comprises a portion of a music file.
43. A database relating to a plurality of audio files, comprising: a plurality of audio fingerprints; and for individual audio fingerprints, a location table mapping the locations of data strings of a predetermined category within the audio fingerprint.
44. Apparatus operable according to the method of any one of the preceding claims. 45. Apparatus for generating a database from audio files, for use in audio identification of a query file, comprising: means for receiving one or more audio files; means for, for individual received audio files, extracting an audio fingerprint of the audio file; and generating a location table mapping the locations of data strings of a predetermined category within the audio fingerprint; and means for storing the one or more audio fingerprints and location tables in a database. 46. Apparatus for matching an audio query against one or more known audio files, comprising: (a) means for receiving the query; (b) means for extracting a query fingerprint of the query; (c) means for receiving an audio fingerprint of an audio file; (d) means for identifying one or more data strings of a predetermined category from the query fingerprint; (e) means for identifying a location within the audio fingerprint having a data string corresponding to one of the one or more data strings of the query fingerprint; (f) means for determining an alignment of the query fingerprint and the audio fingerprint that match.es the position of the identified location within the audio fingerprint with the location of the predetermined data string; and (g) means for determining a similarity measure between the query fingerprint and the audio fingerprint based on the alignment.
45. 47 Apparatus for matching an audio query file against an audio file, comprising: (a) means for receiving the query file; (b) means for extracting a query fingerprint of the query file; (c) means for receiving an audio fingerprint of a first audio file; (d) means for identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingerprint; (e) means for, for the or each identified first data string of the query fingerprint, identifying one or more second locations, within the audio fingerprint, having a second data string corresponding to the identified first data string; (f) means for, for the or each second location, within the audio fingerprint, identified as having a second data string, aligning the query fingerprint against the audio fingerprint by matching the identified second location, within the audio fingerprint, with the first location of the corresponding first data string within the query fingerprint; (g) means for determining a similarity measure between the query fingerprint and the audio fingerprint for one or more different alignments generated in step (f); and (h) means for determining a best match measure for the received audio file from amongst the one or more determined similarity measures.
46. 48 Apparatus for matching an audio query file against one or more known audio files, comprising: (a) means for receiving the query file; (b) means for extracting a query fingerprint of the query file; (c) means for receiving an audio fingerprint of a first audio file; (d) means for identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingerprint; (e) means for, for the or each identified first data string of the query fingerprint, identifying one or more second locations, the second locations being locations within the audio fingerprint having a second data string, a second data string being a data string corresponding to the first identified first data string; (f) means for, for the or each identified second location within the audio fingerprint, determining a third location, the third location being a location within the audio fingerprint that would correspond to a predetermined fourth location, Λvithin the query fingerprint, if the query fingerprint were aligned against the audio fingerprint by matching the respective second location, within the audio fingerprint, with the first identified location of the corresponding first identified first data string within the query fingerprint; and (i) means for identifying a fifth location and an associated count, within the audio fingerprint, the fifth location being the location of the audio fingerprint that is determined as the third location most often and the associated count being the number of times the fifth location is determined as the third location.
47. A computer program product having a computer program recorded on a computer readable medium, for instructing a computer to operate according to the method of any one of claims.
48. A computer program product having a computer program recorded on a computer readable medium, for generating a database from audio files, for use in audio identification of a query file, said computer program product comprising: computer program code means for receiving one or more audio files; computer program code means for, for individual received audio files, extracting an audio fingerprint of the audio file; and generating a location table mapping the locations of data strings of a predetermined category within the audio fingerprint; and computer program code means for storing the one or more audio fingerprints and location tables in a database.
49. A computer program product having a computer program recorded on a computer readable medium, for matching an audio query against one or more known audio files, said computer program product comprising: (a) computer program code means for receiving the query; (b) computer program code means for extracting a query fingerprint of the query; (c) computer program code means for receiving an audio fingerprint of an audio file; (d) computer program code means for identifying one or more data strings of a predetermined category from the query fingerprint; (e) computer program code means for identifying a location within the audio fingerprint having a data string corresponding to one of the one or more data strings of the query fingerprint; (f) computer program code means for determining an alignment of the query fingerprint and the audio fingerprint that matches the position of the identified location within the audio fingerprint with the location of the predetermined data string; and (g) computer program code means for determining a similarity measure between the query fingerprint and the audio fingerprint based on the alignment.
50. A computer program product having a computer program recorded on a computer readable medium, for matching an audio query file against an audio file, said computer program product comprising: (a) computer program code means for receiving the query file; (b) computer program code means for extracting a query fingerprint of the query file; (c) computer program code means for receiving an audio fingerprint of a first audio file; (d) computer program code means for identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingerprint; (e) computer program code means for, for the or each identified first data string of the query fingerprint, identifying one or more second locations, within the audio fingerprint, having a second data string corresponding to the identified first data string; (f) computer program code means for, for the or each second location, within the audio fingerprint, identified as having a second data string, aligning the query fingerprint against the audio fingerprint by matching the identified second location, within the audio fingerprint, with the first location of the corresponding first data string within the query fingerprint; (g) computer program code means for determining a similarity measure between the query fingerprint and the audio fingerprint for one or more different alignments generated in step (f); and (h) computer program code means for determining a best match measure for the received audio file from amongst the one or more determined similarity measures.
51. A computer program product having a computer program recorded on a computer readable medium, for matching an audio query against one or more known audio files, said computer program product comprising: (a) computer program code means for receiving the query file; (b) computer program code means for extracting a query fingerprint of the query file; (c) computer program code means for receiving an audio fingerprint of a first audio file; (d) computer program code means for identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingerprint; (e) computer program code means for, for the or each identified first data string of the query fingerprint, identifying one or more second locations, the second locations being locations within the audio fingerprint having a second data string, a second data string being a data string corresponding to the first identified first data string; (f) computer program code means for, for the or each identified second location within the audio fingerprint, determining a third location, the third location being a location within the audio fingerprint that would correspond to a predetermined fourth location, within the query fingerprint, if the query fingerprint were aligned against the audio fingerprint by matching the respective second location, within the audio fingerprint, with the first identified location of the corresponding first identified first data string within the query fingerprint; and (i) computer program code means for identifying a fifth location and an associated count, within the audio fingerprint, the fifth location being the location of the audio fingerprint that is determined as the third location most often and the associated count being the number of times the fifth location is determined as the third location.
52. A method of generating a database from audio files, for use in audio identification of a query file, substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.
53. A method of matching an audio query file against an audio file, substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.
Description:
METHOD AND APPARATUS FOR IDENTIFYING AUDIO SUCH AS MUSIC

Field of the Invention

The invention relates to identifying audio, such as music, in particular by comparing certain properties of the sound with corresponding properties in a database derived from known sounds.

Background

The revolution in Internet technologies provides convenience in communication between people in different parts of the globe. This revolution coupled with new compression techniques makes the transfer of media content across the globe easier. Tapping the infrastructure that has developed, the music industry is exploring ways to manage the distribution of media content, especially audio files. Effective management requires the knowledge of the media being transferred. Several methods have been employed, the simplest being the verification of a filename. However, the ease of changing filename has rendered this method inadequate.

Another solution would be to use file comparison. By comparing actual data with query data bit by bit, a file identity could be determined. However, there exist several file formats widely available on the Internet. In order to reduce the amount of storage necessary for this approach, the comparison could use checksums or one way hash functions such as Message Digest 5 (MD5). Nevertheless, checksums and MD5 are independent of the audio properties of the file and this method may prove insecure.

Descriptive information pertaining to audio data can be and is coupled with the media content. This is known as metadata. Such metadata may contain the artist name, composer name, title and other data. End users on the other hand can easily alter this data, thus rendering the method insecure. Watermarking technologies have emerged as an alternative method. These embed digital information into the audio file without altering the acoustic content perceived by the end user. This digital information translates to a unique identifier of the song. The audio quality of the watermarked song versus the robustness of the watermarking algorithm to prevent hacking has been the central issues. However, watermarking requires a prior embedding of digital information into an audio file, thus all songs and other audio already on the market cannot now usefully be protected in this manner.

Content based identification system differs from the above technologies. A content based recognition system analyses the acoustic properties of the audio file and develops unique fingerprints. Training of the audio files is achieved by suitable extraction of unique acoustic properties, also known as audio fingerprints. Caution must be exercised in the selection of the acoustic properties as it affects the system robustaess and database size. Following the training process is the database building process whtere all the audio fingerprints are indexed.

For such approaches, the matching process is similar to the training process. _An audio fingerprint is extracted from a clip of an original song. The matching algorithm performs a calculation of the similarity between the extracted query audio fingerprint and the ones in the database by means of a similarity function. The best- matched audio fingerprint yields the highest similarity measure and the corresponding song is deemed as the matched song.

The best search strategy to employ is to do an exhaustive search. However, due to constraints in processing power, an exhaustive search can be prohibitive for a large database.

A method of generating and matching hashes is described in U.S. Patent Application No. 2002/0,178,410Al, published on 28 November 2002 in the name of Haitsma et al. The hashes are short summaries of data files, which can be used to identify the file. The method generates robust hashes for multimedia content. The a~udio clip is divided into successive frames. For each frame, the frequency spectrum is divided into bands. A robust property of each band is computed and represented by a concatenation of binary hash words, one for each frame. To identify a possible compressed database audio file, a block of derived hash words is matched by a computer with a large database. The database consists of a lookup table (LUT), which stores the entry of each hash word. Each entry points to the song and the location where the respective hash word occurs. From the lookup table, the Hamming distance is calculated between the query hash word and the database song is obtained. The Hamming distance determines whether the query clip is similar to the database song.

An alternative method of recognising an audio clip is described in U.S. Patent Application No. 2002/0,083,060Al, published on 27 June 2002 in the name of Wang et al. The method recognises the audio clip by storing a set of landmark time points and associated fingerprints in the database. The landmarks occur at reproducible locations within the file. Fingerprints represent features of the signal at or near the landmark time points. To perform recognition, landmarks and fingerprints are computed for the query sample and used to retrieve matching fingerprints from the database. Therefore, if a large number of corresponding landmarks are linearly related, i.e. if equivalent fingerprints of the sample and the retrieved file have the same time evolution, then the file identified as similar to the sample.

In U.S. Patent Application No. 2001/0,044,719Al, published on 22 November 2001 in the name of Casey, a computerised method of extracting features from an acoustic signal and recognised is described. A spectral envelope is produced by windowing the acoustic signal. The dimensionality of the spectral envelope is then reduced to produce a set of features for the acoustic signal. The features in the set are clustered to produce a group of features for each of the sources. Each group of features is a quantitative descriptor that is associated with a qualitative descriptor. Hidden Markov models are trained with a set of known features and stored in a database. Matching is simply done by retrieving features to determine the similarity. For all recognition systems, there exists limited tolerance to the degree of variations in the input signal. This is termed as the robustness of the system. In most applications, the input signal could be purely subjected to compression distortions. However in more severe cases, non-linear distortions due to air-microphone recording may affect the overall system performance. All recognition systems developed face the challenge of matching accuracy due to above distortions.

More practically, a recognition system should exhibit a fast matching speed. This parameter becomes one of the important factors to consider in implementing a matching algorithm. Coupling to the above problem lies the consideration of storage requirements. Many times, the storage requirements determine the exactness or the compactness of a certain audio feature.

Summary

According to a first aspect of the present invention, there is provided a method of generating a database from audio files, for use in audio identification of a query file. The method comprises receiving one or more audio files, for individual received audio files, extracting an audio fingerprint of the audio file and generating a location table mapping the locations of data strings of a predetermined category within the audio fingerprint, and storing the one or more audio fingerprints and location tables in a database.

According to a second aspect of the present invention, there is provided a method of matching an audio query against one or more known audio files. The method comprises: receiving the query, extracting a query fingerprint of the query, receiving an audio fingerprint of an audio file, identifying one or more data strings of a predetermined category from the query fingerprint; identifying a location within the audio fingerprint having a data string corresponding to one of the one or more data strings of the query fingerprint; determining an alignment of the query fingerprint and the audio fingerprint that matches the position of the identified location within the audio fingerprint with the location of the predetermined data string; and determining a similarity measure between the query fingerprint and the audio fingerprint based on the alignment. According to a third aspect of the present invention, there is provided a method of matching an audio query file against an audio file. The method comprises: receiving the query file; extracting a query fingerprint of the query file; receiving an audio fingerprint of a first audio file; identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingerprint; for the or each identified first data string of the query fingerprint, identifying one or more second locations, within the audio fingerprint, having a second data string corresponding to the identified first data string; for the or each second location, within the audio fingerprint, identified as having a second data string, aligning the query fingerprint against the audio fingerprint by matching the identified second location, within the audio fingerprint, with the first location of the corresponding first data string within the query fingerprint; determining a similarity measure between the query fingerprint and the audio fingerprint for the one or more different alignments ; and determining a best match measure for the received audio file from amongst the one or more determined similarity measures.

According to a fourth aspect of the present invention, there is provided a method of matching an audio query file against one or more known audio files. The method comprises: receiving the query file; extracting a query fingerprint of the query file; receiving an audio fingerprint of a first audio file; identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingerprint; for the or each identified first data string of the query fingerprint, identifying one or more second locations, the second locations being locations within the audio fingerprint having a second data string, a second data string being a data string corresponding to the first identified first data string; for the or each identified second location within the audio fingerprint, determining a third location, the third location being a location within the audio fingerprint that would correspond to a predetermined fourth location, within the query fingerprint, if the query fingerprint were aligned against the audio fingerprint by matching the respective second location, within the audio fingerprint, with the first identified location of the corresponding first identified first data string within the query fingerprint; and identifying a fifth location and an associated count, within the audio fingerprint, the fifth location being the location of the audio fingerprint that is determined as the third location most often and the associated count being the number of times the fifth location is determined as the third location.

According to a further aspect of the present invention, there is provided apparatus for operating according to the method of any of the first to fourth aspects.

According to yet a further aspect, there is provided a computer program product having a computer program recorded on a computer readable medium, for instructing a computer to operate according to the method of any of the first to fourth aspects.

Introduction to the Drawings

The invention is described by way of non-limitative example, with reference to the accompanying drawings in which:-

Figure 1 is a schematic view of sound identifying apparatus according to an embodiment of the invention;

Figure 2 is a flowchart illustrating an exemplary overview process of the training of an audio database according to an embodiment;

Figure 3 is a flowchart illustrating an exemplary overview process of the matching of audio according to an embodiment;

Figure 4 is a schematic block diagram illustrating components of an exemplary embodiment of the feature extraction module of Figure 1 ; Figure 5 is a flowchart illustrating an exemplary process for extracting an audio fingerprint from an input training audio file;

Figure 6 is an illustration of an exemplary arrangement of an audio fingerprint;

Figure 7 is a schematic illustration of the use of an audio fingerprint in generating TTS maps;

Figure 8 is a flowchart relating to the generation of the TTS maps from an audio fingerprint;

Figure 9 is a flowchart illustrating an exemplary matching procedure for determining the best match between the query file and audio files in a database;

Figure 10 is an illustration of a query fingerprint and certain aspects of an exemplary method of determining possible locations against which to match an audio file with the query file;

Figure 11 is a flowchart relating to a process for determining a file BER;

Figure 12 is a flowchart of an alternative BER calculation procedure;

Figure 13 is a flowchart of a process for determining a best match using a statistical matching process;

Figure 14 is a flowchart relating to an example of a specific portion of the process of Figure 13;

Figure 15 is an illustration of an example of how the query fingerprint, TTS maps and a current count table interrelate in the processes of Figure 13 and 14; and Figure 16 is an illustration of a computer system for implementing the apparatus and processes associated with the exemplary embodiments

Description

The below-described embodiments use what may be described as a Long String of Sub-band Match (LSSM) method. They attempt to tackle the problems associated with an exhaustive search by means of innovating the indexing process. Instead of conducting an exhaustive search by matching the query audio in every possible location in the database files, LSSM provides an efficient way of locating an audio file in a database matching a query, such as a query song.

An audio identification process has two stages: a database generation stage and a matching stage. In the database generation stage the system extracts an audio fingerprint and robust codes for a database of audio files. A set of maps of all possible permutations of the codes is generated, storing all the locations of the different codes. In the matching stage, the system performs a similarity measure between the query and the database files "by aligning the query at the locations indicated by the maps. A bit error rate (BER) is computed for every audio file. The audio file with the lowest song BER is the matching audio file.

Figure 1 is a schematic view of sound identifying apparatus 10 according to an embodiment of the invention.

The sound identifying apparatus 10 has a feature extraction module 12 with two inputs, a training audio file input 14 for inputting an audio input signal Sl and a query input 16 for inputting a query input signal S2. The feature extraction module 12 also has two outputs, one carries an extracted audio feature signal S3 as an input to a feature database 18 and the other carries an extracted query feature signal S4 as an input to a matching module 20. The feature database 18 contains an audio fingerprint database 22 and a Temporal Transition Code (TTS) map database 24. The features database 18 is also arranged to provide a selected extracted audio feature signal S5 as an input to the matching module 20. The matching module 20 is arranged to output a match identification signal S6.

The audio identification process comprises two stages, a database generation stage, which is a training operation, and an identification stage. Usually the training process occurs only once, possibly with occasional updates, whilst the identification stage occurs every time a new item needs identifying. The two stages may also occur some time apart.

Figure 2 is a flowchart illustrating an exemplary overview process (SlOO) of the training of an audio database, for instance in the apparatus of Figure 1. This process (SlOO) generates the contents of the feature database 18 of Figure 1; this is the database generation stage. Figure 3 is a flowchart illustrating an exemplary overview process (S 120) of the matching of audio, for instance using the apparatus of Figure 1. The process (S 120) is used to identify an unknown piece of audio, a so-called query; this is the identification stage.

The sound identifying apparatus 10 receives a training audio file within the audio input signal Sl at the audio input 14 (step S 102). The feature extraction module 12 processes the input training audio file and translates relevant information to extract an input audio fingerprint (step S 104). To reduce storage requirements, the input training audio file may be converted into a mono-training audio file and sampled at a certain sampling frequency, such as 44.IkHz. A set of TTS maps for the input training audio file is generated based on the extracted audio fingerprint (step S 106). The extracted audio fingerprint and the generated TTS maps are sent to storage in the audio fingerprint database 22 and the TTS Map database 24, respectively, within the feature database 18. Metadata, identifying the file and its author may also be sent to and stored in the feature database 18. A check is then made to determine if the current training audio file is the last one to be processed for training (step Sl 10). If this is not the last file, the next file is selected (step Sl 12) and the process returns to step S 104 to process the next training audio file. This process is therefore repeated for as many training audio files as are input to the apparatus 10 or in the database from which the input training audio files are drawn. If the current training audio file is the last file, the training process ends.

When an identification is to be made, the sound identifying apparatus 10 receives an input query file in a query input signal S2 at the training audio file input 14 (step S 122). The feature extraction module 12 processes the input training audio file and translates relevant information to extract a query fingerprint (step S 124). The matching module 20 obtains necessary TTS Maps and audio fingerprints from the feature database (step S 126) The feature extraction module 12 also obtains certain query TTS codes from the query fingerprint (step S 128). The query fingerprint and query TTS codes are input to the matching module 20 as the extracted query feature signal S4. Based on the TTS Maps and audio fingerprints and the query fingerprint and query TTS codes, the matching module 20 determines a best match between the query file and the training audio files used to generate the contents of the feature database 18 (step S 130) and outputs the result. The matching process then ends.

As is described later, various methods can be used to find the best match. A similarity measure, such as a Bit Error Rate (BER), is used to determine the similarity between the query file and the training audio files whose audio fingerprints are in the database 18 by means of comparing the extracted features. The BER for this purpose may be the ratio between the number of error bits and the total number of bits.

Training

In the following described LSSM approach, an audio fingerprint is extracted by calculating the difference in the multi-band energy between successive frames. Each frame consists of several data samples. Consecutive frame features an overlapped in the data samples (previous samples and new samples) and each frame undergoes a Fast Fourier Transform (FFT). From the FFT, differential multi-band energies are extracted. Multi-band energies are calculated as the differences in energies between successive frames. These differential multi-band energies constitute the audio fingerprint processed at the particular frame, in short, the differential multi-band energies relating to time. The audio fingerprint generated is the concatenation of several frames of multi-band energy in sequential order.

The audio identification process consists of two stages, namely a database generation stage where the system extracts an audio fingerprint and robust codes for every audio file. These codes are robust but they are not unique. Thus a set of maps comprising all possible permutations of the said codes are generated, storing all the locations of the different codes.

Figure 4 is a schematic block diagram illustrating components of an exemplary embodiment of the feature extraction module 12 of Figure 1. Figure 5 is a flowchart illustrating an exemplary process for extracting an audio fingerprint from an input training audio file, for instance as may be involved in step S 104 of Figure 2. This is also an example of the operation of the feature extraction module 12 of Figure 1 and as shown in detail Figure 4.

An audio file is received as a time domain, pulse code modulated (PCM) sample. The received input audio file is divided into sample frames (step S 142) by a framing circuit 22, with a subsequent frame overlapping a current frame to ensure high correlation between frames. The preferred frame size is 1024 samples with each sample in the frame multiplied by a window function such as a Hamming window. Each frame undergoes a spectral transformation (step S 144), such as a FFT, by way of a transform module 24, to provide a spectral representation. For computation simplicity and efficiency, the spectral information may be band limited. A recommended range is from to 300Hz and 3kHz for audio. The band limiting of the training audio file's spectral information into a specific frequency range has the advantage of reducing storage requirements by eliminating redundancy. A range that is most relevant to the human auditory system is generally selected.

The frequency range obtained is further processed into absolute values (step S 146) by a magnitude generator 26 and separated into frequency bands (step S 148) by a band separator 28. This involves band limiting the spectrum, linearly or non-linearly into frequency ranges. In a preferred embodiment, each band has the same bandwidth, for example of 450Hz.

Suitable audio properties are extracted (step S 150) to represent the training audio file. In this embodiment energy is used as it exhibits robustness to various processing distortions. To achieve this the absolute spectral values in the individual frequency bands are squared and summed to obtain the energies of each band separately, in an energy extractor module 30. This generates a set of multiband energies 32 for that particular frame. The multiband energies 32 are stored in a one-frame delay 34 for each band (Sl 52). The multiband energy 32 for each band in the current frame N is also subtracted (step S 154) from the energy value for the same band of the previous frame N- 1, at an adder 36, to produce a set of difference values 38. A threshold module 40 compares the difference value 38 for each band of the frame with a threshold value (step S 156) for instance 0.1 , to produce a binary representation of each band. If the energy difference value 38 is above the energy threshold, binary ' 1 ' is used to represent the particular band. If the energy difference value 38 is below the energy threshold, binary '0' is used to represent the particular band. The thresholds are all the same, although they may be adaptively controlled, based on the distortion or background noise in the environment in which the audio file is recorded. This might be achieved by sampling the audio file at various points.

A frame concatenator 42 generates an x-bit code word 44 for the frame (step S 158) by concatenating the binary representation of the energy difference across bands. A fingerprint concatenator 46 forms the audio fingerprint 50 by concatenating the code words 44 from sequential frames sequentially (step S 160) for the framing audio file.

The number of bits, x, in the code word 44 is generally the same as the number of bands (although it may differ if the data from two or more bands merge or if a band splits). A typical number of bands is 6, as this balances robustness in the system with computation requirements. Figure 6 is an illustration of an exemplary arrangement of an audio fingerprint 50. In this embodiment, the fingerprint is a two-dimensional array of binary values. The horizontal direction 52 represents the differential energy contour for each frame and the vertical direction 54 represents the differential energy evolution across frames for a particular band. The vertical direction is also synonymous with the time direction information. For simple identification, the audio fingerprint's first column is labelled as band 0, followed by band 1, through to the last band, band 5. The band relates to the energy at different parts of the frequency spectrum.. The first row is indexed as frame 0, followed by frame 1, etc. The frame index relates to time dimension information, with the number of frame indices depending on the length of the training audio file.

In principle, there exists high correlation between successive energy in a particular band. Exploiting this result, the consecutive band energy is concatenated in the time direction to generate a code word, which is a string of data. This code captures the temporal envelope of the audio file at that particular time, which is termed here as the Temporal Transition Code (TTS code).

In order to create an efficient method to index these TTS codes, a map is generated to locate all possible locations of a particular TTS code in a particular audio file. Generation of this map, also know as the Temporal Transition Map (TTS Map), is achieved by concatenating the multi-band energies in time and shifting the TTS codes incrementally in the time direction. The associated time and band information of the particular TTS code is recorded into the different bands' TTS maps.

The frame index gives a time representation of the TTS code and is continuous until the end of the audio fingerprint 50.

Figure 7 is a schematic illustration of the use of an audio fingerprint in generating TTS maps. Figure 7 illustrates an example of a set of TTS maps 60, 62, 64, 66, 68, 70, a separate one for each band of the audio fingerprint 50. The uppermost TTS map 60 is for band 0 of the audio fingerprint 50. The second TTS map 62 is for band 1 of the audio fingerprint 50, and so on. Each individual TTS map 60 - 70 is capable of recording all possible permutations of TTS codes of the respective band.

Each TTS map 60 - 70 contains details of the frame locations 72 of the individual TTS codes within the relevant band for the associated audio fingerprint 50. The TTS codes are extracted from the audio fingerprint 50 relying on the principle of high correlation across frames. For each frame, the respective TTS codes are derived as a y- bit binary representation 74 consisting of a concatenation of the binary digit for that frame in the audio fingerprint 50 and the binary digits in the same band for the y-1 immediately succeeding frames.

The TTS maps 60 - 70 contain a list 80 of all possible values of the TTS code, from 0 to 2y - 1. When the TTS code 74 for a frame is determined, the frame number of that frame is then associated with that code within the TTS map. The same TTS code may appear more than once in the same band of an audio fingerprint 50. Thus more than one frame number may be associated with each TTS code value.

Figure 8 is a flowchart relating to the generation of the TTS codes from the audio fingerprint 50. This is an example of the process of step S 106 within the flowchart of Figure 2. Figure 8 is described with reference to the illustration of Figure 7.

The process is initiated by setting the current band number, BN = 0 and the current frame index, FI = 0 (step S 172). For the current band number and the current frame index the binary representation of the TTS code 74 is extracted (step S 174). For FI = 0 this is achieved by concatenating y consecutive bits from frame 0 to frame index y-1. In this embodiment y = 9, as this is a suitable TTS code length for handling air- microphone distortions. Thus for BN = 0 and FI = 0, the relevant binary representation of the TTS code 74 for the audio fingerprint 50 of Figure 6 is 101001101, being a concatenation of the values for FI = 0 to FI = 8 in band 0.

Within the TTS map 60 corresponding to the current band number, the TTS code value in the code value list 80 that corresponds to the current extracted TTS code for the current frame index is located. The current frame index 72 is recorded, associated with the TTS code value for that frame index, in the TTS map (step S178). In the example shown in Figures 6 and 7, for BN = 0 and FI = 0, the TTS code 74 for FI = 0 is 333 (101001101 in decimal for easier reading in this description). Thus the frame index value "0" is stored in the TTS map 60 associated with the TTS code value 333.

Once the current frame index value has been stored in the TTS map, a determination is made whether the current frame index is the last within the current band (step S 180). If the current frame index is not the last within the current band, the frame index is incremented by one, i.e. FI = FI + 1 (step S 182), and the process returns to step S 174. On the other hand, if the current frame index is the last within the current band at step S 180, then a determination is made whether the current band number is the last band within the audio fingerprint (in the example of Figures 6 and 7, whether BN = 5) (step Sl 84). If the current band number is not the last band, the band number is incremented by one, i.e. BN = BN + 1 and the frame index reverts to zero, i.e. FI = 0 (step S 186), and the process returns to step S174. On the other hand, if the current band number is the last within the current audio fingerprint, then this sub-process is finished, as this marks the end of the generation of TTS maps for that audio fingerprint 50.

In the example of Figures 6 and 7, after the frame index value 0 has been stored with TTS code value 333 in the TTS map 60 for the first band, the TTS code 74 for the next frame, FI = 1 is extracted. This has a binary representation 01 OO 11010 (decimal 154), against which value the frame index value 1 is stored. The binary representation for FI = 2 is 100110100 (decimal 308), against which value the frame index value 2 is stored. In this example, the binary representation for FI = 6 is IOIOOI 101 (decimal 333), which is the same TTS code value as that of FI = 0. Thus this same value 333 has two frame locations stored against it.

The training processes described above with reference to Figures 1, 2 and 4 to 8 encompass the training phrase of the recognition system. Although, processing complexity is encountered at certain phases of the training process, the increased complexity does not affect the matching time during a recognition operation as the relevant processes are conducted prior to matching. The main concern during the training phase is the selection of suitable audio fingerprints as that contributes to the overall memory requirements of the system.

For each training audio file in the database, TTS maps and an audio fingerprint are created. Each TTS map stores the TTS codes of each band of an audio fingerprint. The TTS maps and the audio fingerprints make up the data of the features database 18 of the sound identifying apparatus 10 of this embodiment.

In the sound identifying apparatus 10, the TTS map may be loaded into a RAM to achieve fast searching speed. If there is a memory constraint, the audio fingerprints need not be stored in RAM.

Matching

In principle, LSSM exploits the similarity between the actual TTS code and the query file's TTS codes to generate the TTS map. Effectively, due to a-priori knowledge of the database audio file, a map of all possible 'correct' TTS codes is generated by concatenating the multi-band energies in time and shifting the TTS codes incrementally in time. The result is a TTS map populated with enough information to enable efficient identification of the best match location of an audio file in milliseconds. The TTS map streamlines the matching time required by the system as opposed to the exhaustive search.

Matching is a quick and simple process of accessing the TTS map with trie help of the extracted TTS codes of the query file. From the extracted TTS codes, matching locations are located using trie TTS map. Subsequently, a similarity measure is calculated between the query file and the database audio file. This process is looped for all the other extracted TTS codes.

Finally, the audio file that yields the highest similarity measure is the matched audio file. During the matching stage, a clip of unknown audio, the query audio, is input into the recognition system as a query file to identify that audio. Matching is defined as locating a best matching database (training) audio file by determining the similarity between the query and the database audio files in the database. For this to occur an audio fingerprint is generated for the query, termed here a query fingerprint 100. The query fingerprint 100 is extracted (in step S 124 of Figure 2) by the same method as is used to extract the audio fingerprint 50. The flowchart of Figure 5 and the associated description is applicable here too, with the file being processed being the query file, rather than a database audio file. Likewise the TTS codes for the query fingerprint 100 are also determined. The qxiery TTS codes are generated (in step S 128 of Figure 3) by the same method as is used to extract the database audio TTS codes (step S 174). They are extracted one by one as is described below; TTS maps do not need to be generated for the query file. However, if desired, all the query TTS codes could be extracted initially before any of them is used, possibly to be stored in a table.

Prior to calculating a similarity measure between the query and the database audio files, the system determines the location at which the query best matches any database audio file. As the query may have been randomly extracted from an unknown source, the system has no idea where it should start its matching process. TTS codes are extracted from the query and possible locations for the query are determined by accessing the entries in the TTS maps.

Figure 9 is a flowchart illustrating an exemplary matching procedure for determining the best match, between the query file and audio files in the database. Figure 9 illustrates an example of the processes of steps S 126 to S 130 of Figure 3.

A current audio file number is initiated, that is j = 0 (step S202). The TTS maps and the audio fingerprint are retrieved for the audio file corresponding to the current audio file number (step S2O4). A file BER for the current audio file is determined (step S206), for instance as is described later. The file BER for the current database audio file is compared with the lowest file BER so far for this query (step S208). If the file BER for the current audio file is lower (for instance as would always happen if it is the first audio file being compared), then the file BER for the current audio file becomes the lowest BER (step S210). If the existing lowest file BER is lower, then that remains as the lowest file BER. The system also determines If there are any more audio files to compare, that is whether j = the last relevant number (step S212). If there are no more audio files to compare, the audio file with the current lowest file BER is set and output as the best matching audio file for that query (step S214), after which the process ends. On the other hand, if there are more audio files, the current audio file number is incremented by one, i.e. j = j + 1 (step S216), and the process returns to step S204 to retrieve features of the next audio file and determine the file BER for that next audio file.

Figure 10 is an illustration of a query fingerprint 100 and certain aspects of an exemplary method of determining possible locations against which to match a database audio file with the query file. Figure 11 is a flowchart relating to a process for determining a file BER (step S206 of Figure 9), including determining possible matching locations.

A query band number QBN, a query frame index QFI and the file BER are reset for a new audio file, that is QBN = 0, QFI = 0, file BER = max (step S222). For the current query band number and the current query frame index, a query TTS code is obtained from the query fingerprint (step S224). For audio file j, audio frame indices of the audio fingerprint having the same value TTS code value as the query TTS code are identified, using the audio file TTS map for the same band (S226). Thus if QBN = 0, the TTS map being used is also for band 0. Typically, several possible frame indices in the audio fingerprint correspond to any particular TTS code value. In the illustrated example in Figure 10, the TTS code for the first frame of the first band of the query fingerprint 100 is binary 010011010 (decimal 154). For the database audio file whose particular first band TTS map is shown in Figure 10, the TTS code value 154 is shared by frame indices 1, 13 and 300.

A determination is made (step S227) as to whether any audio fingerprint frame indices were identified for the TTS code. Assuming one or more potential frame locations have been identified, the first identified frame index from the audio fingerprint is set as the current audio match frame index (step S228). The query fingerprint is overlapped or compared with the audio fingerprint, with the current query frame index aligned with the current audio match frame index (step S230). These two frame indices have the same TTS code value as each other. A check is made as to whether the query fingerprint falls entirely within the audio fingerprint with the current alignment (step S231). If the query fingerprint falls entirely within the audio fingerprint, a bit error rate (BER) is generated from the comparison of the two fingerprints (step S232). The BER is defined as the number of mismatched bits divided by the total number of bits in the query fingerprint.

Once the BER is generated for a frame location with the shared TTS code, the system compares the just-determined, current BER with the files BER for this database audio file (step S234), that is with the lowest BER so far from the comparisons with this audio file. If the current BER is lowest (as would always happen if it is the first BER determined for this particular pair of query and database audio file) then the current BER becomes the lowest BER (step S236). If the existing file BER is lower, then that remains as the file BER.

The system determines if there are any more locations to compare for the current query TTS code, that is whether the current audio match frame index is the last of the audio frame indices identified in step S226 (step S238). A similar check is made if step S231 determined that the query fingerprint did not fall entirely within the audio fingerprint.

If the current audio match frame index is not the last of the audio frame indices identified in step S226 and there more locations to compare, the next of the audio frame indices identified in step S226 is set as the current audio match- frame index (step S240). If the current audio match frame index is the last of the audio frame indices identified in step S226, then a check is made of whether the current query frame index is the last query frame index in the current query band (step S242). A similar check is made if step S227 determined that no frame indices had been identified in step S226. If the current query frame index is not the last query frame index in the current query band, the query frame index number is incremented by one, i.e. QFI = QFI + 1 (step S244) and the process returns to step S224. If the current query frame index is the last query frame index in the current query band, then a check is made of whether the current query band number is the last query band in the query fingerprint (step S246). If the current query band number is not the last query band number- in the query fingerprint, the query band number is incremented by one, i.e. QBN = QBN + 1, and the query file index number is reset, i.e. QFI = 0 (step S248). After this the process returns to step S224. If the current query band number is the last query band number in the query fingerprint, this part of the overall process ends. The next step in the main process of Figure 9 is step S208.

In this way, the operation proceeds for all frames and all bands of the query. For each audio fingerprint, the location that yields the lowest BER is the location at which the query best matches the current database audio file.

hi the above embodiment, comparison of fingerprints only occurs where the query fingerprint falls entirely within the audio fingerprint. In alternative embodiments, comparison may also occur where the query fingerprint falls pairtly outside the audio fingerprint (that is it begins before the audio fingerprint or ends after the audio fingerprint). Such an operation might allow more robust processing for noisier queries. The comparison is made at the overlapping areas and is made for that part of the query fingerprint that falls outside the audio fingerprint as if the audio fingerprint existed in that area with every bit equal to zero.

In the example of Figure 10, the TTS code 154 from the first frame of the query fingerprint 100 matches the TTS code for frame indices 1, 13 and 300 of the audio fingerprint. Thus the query fingerprint 100 is first overlapped with the audio fingerprint such that frame 0 of the query fingerprint 100 is aligned with frame 1 of the audio fingerprint, and a first BER is generated. The query fingerprint 100 is next overlapped with the audio fingerprint such that frame 0 of the query fingerprint 100 is aligned with frame 13 of the audio fingerprint, and a second BER is generated. The query fingerprint 100 is next overlapped with the audio fingerprint such that frame 0 of the quer^ fingerprint 100 is aligned with frame 300 of the audio fingerprint, and a third BER is generated.

In the matching stage as described with reference to Figures 9 to 11, the system performs a similarity measure between the query file and the database audio files by aligning the query at the locations indicated by the maps. A bit error rate (BER) is computed for every audio file. The audio file with the lowest audio file BER is the matching audio file.

In the exemplary process described with reference to Figure 11, the TTT1S code is extracted for every frame index of every band of the query fingerprint, and a BER is calculated for every match. This can be computationally expensive. To reduce the burden, the amount of matches tested can be reduced. One way of doing this is to keep a record of every alignment made between the query fingerprint and the audio fingerprint, so that when a repeat alignment comes up, within the same band or between different bands, no BER is determined for the second or further occurrence of that alignment. Another way of reducing processing used is only to calculate the BER for alig^iments where the query fingerprint falls wholly within the audio fingerprint (in a time sense). Further ways of reducing the amount of processing required might include testing only occasional query TTS codes, e.g. every second one, every third one, every foixrth one, etc. Additionally or alternatively, the TTS codes might be taken from only oixe band, for instance the first, second or third band, etc.

As is mentioned above, the exemplary process described with reference to Figure 11 is computationally expensive. Moreover there is high processing complexity in calculating the file BER for the whole of the query for each possible location. An alternative or additional way of reducing the data processing needed is described now with reference to Figure 12. This method is a progressive thresholding method, for determining a BER for different overlap alignments of the query fingerprint aaαd the audio fingerprint (a substitute for step S232 of Figure 11). The progressive thresholding method simplifies the calculation of the file BER by reducing the amount of unnecessary processing for BER calculations that takes place at clear non-matching locations.

The concept behind the progressive thresholding method is that, since the calculation of a BER involves the summation of all the mismatched bits in the designated region of matching, the process can be terminated prematurely if the mismatch bits exceed a certain threshold within a smaller region. This threshold is termed here the progressive threshold. Figure 12 is a flowchart of a progressive threshold BER calculation procedure. The conceptual flow shown in Figure 12 is based on conducting an initial BER calculation on a smaller region of frames.

A number of frames M, is initiated at a first predetermined level (step S252), which is less than the number of frames in the query fingerprint. The number M may be determined as a percentage of the full query, for instance 10 - 15%, or set as a fixed number, e.g. 50 frames. The frames that are used within the number of frames M do not need to include the matched frame. The more distorted the query is, the larger the number would ideally be. The number can be adaptively controlled based on a determination of distortion in the query, based on a sampling of background noise. For the current overlap position of the query fingerprint and the audio fingerprint a reduced area, the BER is calculated based on just the M frames (step S254). Thus, where M is less than the total number of frames in the query fingerprint, the BER is calculated for a reduced area. The M-based BER is compared with a BER threshold (step S256). The BER threshold may also be fixed, for instance at 0.4 or adaptively controlled, based on the level of distortion in the query, especially if there is no adaptive control of the frame number M. If the M-based BER is not lower than the BER threshold, this sub-process ends, proceeding to step S238 of the process of Figure 11. If the M-based BER is lower than the BER threshold, the determination is then made as to whether the number MI covers less than the whole of the query fingerprint (step S258). If the number M do-es not cover less than the whole of the query fingerprint, then the process continues as with the rest of the operation of Figure 11 with step S234. If the number M covers less than the whole of the query fingerprint, the frame number M is incremented by initial M (i.e. M = M + initial M) (step S260) and the process returns to step S254, where the BER is calculated for the increased number of frames. This process continues until a BER is produced for all the frames in the query fingerprint, effectively as in step S232 of the process of Figure 11.

The improved matching algorithm is used for every attempt at a match location and reduces the time required to perform BER calculations at locations that clearly do not match, by terminating the BER calculation prematurely. The overall effect is an increase in matching speed.

In the embodiment described with respect to Figure 12, where a match is rejected before the complete BER is calculated, the process proceeds from step S256 of Figure 12 to step S 238 of Figure 11. In an alternative embodiment, where a match is rejected before the complete BER is calculated, the process proceeds from step S256 of Figure 12 to step S 234 of Figure 11. The current BER still stands as the BER for the reduced comparison set, but that is not a problem in step S234, because it is still a high BER that would be bettered by any audio file for which the BER is calculated for the complete query fingerprint.

Additionally in the embodiment described with reference to Figure 12 has the number of frames M incremented by M each time in step S260. In a further alternative embodiment, the number of frames M is incremented immediately to the full number of frames in the query fingerprint, the first time through step S260.

The above approaches to determining the best matching audio file use two distinct steps for each file. Firstly potential matching locations between the query and each database audio file are determined. Secondly, time is consumed by the calculation of some form of comparison BER for every (or almost every) potential match.

An alternative approach to calculating a BER for each potential match location is described with reference to Figures 13 to 15. Figure 13 is a flowchart of a process for determining a best match using a statistical matching process. Figure 14 is a flowchart relating to determining a best match score, for example for use in the process of Figure 13. Figure 15 is an illustration of an example of how the query fingerprint, TTS maps and a newly mentioned current count table interrelate in this alternative approach.

To increase matching speed, statistical matching is introduced. During the initial matching phase, the system locates the best matching location in every compared database audio file. A current count table records the similarity between the query and the database audio file. By storing the number of times the time location is indexed, the system selects a list of top-ten audio files. The top-ten audio files are chosen based on the magnitude of the entries in the audio index map. Finally, a BER computation is done on the top-ten audio files by aligning the query at the location stored by the audio index map. The matching audio file is identified as the audio file with the smallest BER.

Figure 13 is a flowchart of a process for determining a best match using a statistical matching process. The process of Figure 13 may be used as step S130 of the flowchart of Figure 3.

A current audio file number is initiated, that is j = 0 (step S202). The TTS maps and the audio fingerprint are retrieved for the audio file corresponding to the current audio file number (step S204). These first two steps are the same as the first two steps shown in the flowchart of Figure 9. A statistically determined best match is determined for the current database audio file number (step S272), providing a score for that database audio file. The current score is checked against the ten best scores so far (step S274) for the previously checked audio files, which, within the exemplary embodiment described below, are the ten highest scores so far. If the current score is better (e.g. higher) than any existing score within the ten best scores so far, the list of the ten best scores so far is updated (step S276), by removing the worst (e.g. lowest) of the scores within the top ten and storing the current score in the top ten. After the ten best scores so far is updated or if the current score is deemed no better than any existing score within the ten best scores so far, in step S274, a determination is made of whether all the database audio files have been checked, that is whether j = the last relevant number (step S278). If any have not yet been checked for a match, the database audio file number is incremented by one, i.e. j = j + 1 (step S280). If all the database audio files have been checked, the BER is generated for each of the ten database audio files for which an associated score is present in the final list of the ten best scores (step S282). The query fingerprint is matched to the associated audio fingerprints the arrangement that yields the score within the ten best scores, that is at the locations of the respective audio fingerprints that produced the scores in the final list of the ten best scores. The database audio file of these ten that produces the lowest BER is deemed to be the matching database audio file for the query and is output as the best match audio file (step S284).

The process of Figure 14 is an example of a process that corresponds to step S272 of the process of Figure 13. The approach used, in effect, matches frame 0 in the query fingerprint against a frame index in the audio fingerprint of an database audio file and determines how many matched TTS codes this results in for each frame index in the audio fingerprint.

The process of Figure 14 is initiated by resetting a query band number to zero, i.e. QFN = 0, a query frame index number to zero, i.e. QFI = 0, and resetting a file score and all frame index counts in a current count table to zero (step S302). For the current query frame index of the current query band, the TTS code in the query fingerprint is obtained (step S304). Using the TTS band map (for the same band number as the current query band number) of the database audio file being compared, the frame indices in the audio fingerprint of that database audio file that have the same TTS code are identified (step S306). A determination is made (step S307) as to whether any audio fingerprint frame indices were identified for the TTS code in step S306. Assuming there are any, the first of the identified frame indices is set as the current offset frame index (step S308). A determination is made of which frame index within the audio fingerprint the query fingerprint would have to start at, for the current query frame index to match with the current offset frame index (step S310). For the database audio file frame index within the audio fingerprint determined as the frame index at which the query fingerprint would start, in step S310, a count associated with that database audio file frame index (in a current count table) is incremented by one (step S312). A check is made as to whether the current offset frame index is the last identified frame index from step S 306 (step S314). If there are one or more further identified frame indices from step S306, the next identified frame index from step S306 is set as the current query frame index (step S316).

Once all the identified frame indices from step S306 have been processed, a determination is made as to whether the current query frame index number QFI is the number of the last query frame index (step S318). A similar check is made if step S307 determined that no frame indices had been identified in step S306. If the current query frame index number QFI is not the number of the last query frame index, the query frame index number is incremented by one, i.e. QFI = QFI + 1 (step S320), and the process returns to step S304 to obtain the TTS code for the new query frame index number. If the current query frame index number QFI is the number of the last query frame index, a determination is made as to whether the current query band number QBN is the number of the last query band (step S322). If the current query band number QBN is not the number of the last query band, the query band number is incremented by one, i.e. QBN = QBN + 1 and the query frame index number is reset to 0, i.e. QFI = 0 (step S324), and the process returns to step S304 to obtain the TTS code for the next band. Once all the bands have been processed, the highest count associated with any audio file frame index in the final current count table is set as the best match score for audio file j (step S326), and the best match score is stored together with the identity of the associated database audio file frame index. This best match score may then be compared with the ten best scores in step S274 of the process of Figure 13. hi this instance the ten best scores will be the ten highest best match scores.

In Figure 15, the TTS code 102 for query frame 86 is extracted from the query fingerprint 100. The TTS code is binary 100110100 (decimal 308). For decimal 308 of the TTS map 60 for band 0 of the database audio file being compared, there are three entries: frame 3, frame 100 and frame 422. As the current query frame index is frame 86, the offset associated with query frame 86, relative to query frame 0 is 86 (that is 86 - 0). Thus for query frame 86 to match database audio file frame 3, query frame 0 would match with database audio file frame negative 83 (that is 3 - 86), which is not possible. The next identified frame index is frame 100. For query frame 86 to match database audio file frame 100, query frame 0 would match with database audio file frame 14 (that is frame index less the offset, i.e. 100 - 86). As such, in a current count table 104, the current count associated with database audio file frame 14 is incremented by one. In this case, the count previously associated with database audio file frame 14 was eight and so its new count is nine. Likewise for query frame 86 to match database audio file frame 422, query frame 0 would match with database audio file frame 336 (that is 422 - 86). As such, in the current count table 104, the current count associated with database audio file frame 336 is incremented by one. In this case, the count previously associated with database audio file frame 336 was twenty-four and so its new count is twenty-five. After this has been done for all query frame indices, a complete or final count table 106 is generated. In this particular case, the highest count number is 85, associated with database audio file frame 130.

For the statistical processing, as with the other methods of finding a best match, it is possible to reduce the amount of processing that occurs. For example, in the approach described with reference to the flowchart of Figure 14, the process could just rely upon a count obtained from fewer than all the bands, even a single band, such as band 0.

Another mechanism for reducing the amount of processing necessary is to classify the audio in the database, for example between the spoken word and music, with music classified into vocal or instrumental, classical or modern, etc. These categorisations can be downloaded with the music audio itself when generating the database. When a query is input for matching pre-processing occurs to determine what type of audio it is, with similar categorisations. In this way, only audio files from the same general category need to be checked.

The apparatus and processes of the exemplary embodiments can be implemented on a computer system 400, for example as schematically shown in Figure 16. The embodiments may be implemented as software, such as a computer program being executed within the computer system 400, and instructing the computer system 400 to conduct the method of the example embodiment. The computer system 400 comprises a computer module 402, input modules such as a keyboard 404 and mouse 406 and a plurality of output devices such as a display 408, and printer 410.

The computer module 402 is connected to a computer network 412 via a suitable transceiver device 414, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).

The computer module 402 in the example includes a processor 418, a Random Access Memory (RAM) 420 and a Read Only Memory (ROM) 422. The computer module 402 also includes a number of input/output (I/O) interfaces, for example an I/O interface 424 to the display 408, and an I/O interface 426 to the keyboard 404.

The components of the computer module 402 typically communicate via an interconnected bus 428 and in a manner known to the person skilled in the relevant art.

The application program is typically supplied to the user of the computer system 400 encoded on a data storage medium such as a CD-ROM or floppy disc and read utilising a corresponding data storage medium drive of a data storage device 430. The application program is read and controlled in its execution by the processor 418. Intermediate storage of program data may be accomplished using the RAM 420. The application program supplied may include just the mechanism for identifying query files, without a mechanism for generating the database of audio files. That database may be supplied as a fixed database, for example on the same CD-ROM or available over the Internet or some other network.

The above description is of audio files in general. However, the embodiments are of particular use where the audio files are music files, for example songs, with the query also being of a song. The described methods provide systematic approaches to identify an audio clip from a song database efficiently, by the generation of TTS maps and audio fingerprints. The approaches also provide accurate and efficient methods of searching for the matching song from a song database. In addition, the embodiments have the ability to handle non-linear distorted queries without sacrificing search speed and storage requirements. The embodiments provide robust and efficient algorithms for an audio identification system.

The embodiments of the invention provide efficient search and identification methodologies for an audio identification system. Exemplary applications of the audio identification system include a broadcast monitoring system, a mobile retrieval system and a content monitoring system amongst others.

Broadcast monitoring system: A broadcast monitoring system has a central database server and a monitoring station. The monitoring station transmits audio fingerprints extracted from the broadcast audio content to the central server. The central server determines the identity of the song being broadcast through its large database of songs.

Mobile retrieval system: A clip of a song is recorded using a mobile telephone and transmitted through the mobile network. A song information server processes the clip and, in return, messages the result back to the user.

Content monitoring system: An audio file transferred through the Internet is first screened by an audio server. The audio server determines the audio content through audio identification rather than by filename or ID tag. With the information gathered, it matches the legal rights of file on the distribution channel and blocks unauthorised transmission.

In the foregoing manner, a method and apparatus for identifying audio such as music are disclosed. Only several embodiments are described but it will be apparent to one skilled in the art in view of this disclosure that numerous changes and/or modifications may be made without departing from the scope of the invention.