SOUND RESPONSIVE DEVICE AND METHOD

Title:

SOUND RESPONSIVE DEVICE AND METHOD

Document Type and Number:

WIPO Patent Application WO/2019/002417

Kind Code:

Abstract:

A sound recognition method that is capable of distinguishing between real-world sounds and pre- recorded or broadcast sound by determining if the sound emanates from a designated location, such as the location of a loudspeaker, or by recognising characteristics of the sound indicating that it has been subjected to audio recording, audio broadcast and/or audio reproduction processes.

Inventors:

MOORHEAD PAUL (GB)

Application Number:

PCT/EP2018/067333

Publication Date:

January 03, 2019

Filing Date:

June 27, 2018

Export Citation:

Click for automatic bibliography generation Help

Assignee:

KRAYDEL LTD (GB)

International Classes:

G10L25/51; G10L17/26

Domestic Patent References:

WO1998034216A2

1998-08-06

Foreign References:

JP2005250233A

2005-09-15

Other References:

HANY FARID: "Detecting Digital Forgeries Using Bispectral Analysis", 1 January 1999 (1999-01-01), XP055499185, Retrieved from the Internet [retrieved on 20180813]
GRIGORAS ET AL: "Statistical Tools for Multimedia Forensics", CONFERENCE: 39TH INTERNATIONAL CONFERENCE: AUDIO FORENSICS: PRACTICES AND CHALLENGES; JUNE 2010, AES, 60 EAST 42ND STREET, ROOM 2520 NEW YORK 10165-2520, USA, 17 June 2010 (2010-06-17), XP040567050
HENNEQUIN ROMAIN ET AL: "Codec independent lossy audio compression detection", 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 5 March 2017 (2017-03-05), pages 726 - 730, XP033258513, DOI: 10.1109/ICASSP.2017.7952251
BRIAN D'ALESSANDRO ET AL: "Mp3 bit rate quality detection through frequency spectrum analysis", MULTIMEDIA AND SECURITY, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 7 September 2009 (2009-09-07), pages 57 - 62, XP058088142, ISBN: 978-1-60558-492-8, DOI: 10.1145/1597817.1597828

Attorney, Agent or Firm:

FRKELLY (IE)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS:

1. A method of operating a sound responsive device comprising at least one microphone for receiving sounds from an environment, the method comprising: detecting a sound at said at least one microphone;

producing a corresponding audio signal from the or each microphone;

performing audio signal processing on the corresponding audio signal from the or each microphone;

determining from said audio signal processing if said sound is a real-world sound;

performing at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.

2. The method of claim 1 , wherein said determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic corresponding to one or more audio signal processing process.

3. The method of claim 2, wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio recording, audio broadcast and/or audio reproduction.

4. The method of claim 2 or 3 wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio encoding.

5. The method of any one of claims 2 to 4 wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio compression.

6. The method of claim 5, wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more compression artefact. 7. The method of any one of claims 2 to 6 wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio rendering by an electronic amplifier and/or loudspeaker. 8. The method of any one of claims 2 to 7, wherein said audio signal processing comprises frequency analysis, and said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more frequency characteristic corresponding to one or more audio signal processing process.

9. The method of claim 8 wherein said one or more frequency characteristic comprises a spectral distribution of said audio signal.

10. The method of claim 9 wherein said one or more frequency characteristic comprises a spectral distribution of said audio signal in one or more frequency band that is common to both real-world and non real-world sounds.

1 1. The method of claim 10 wherein said one or more frequency band comprises the frequency band from 20Hz to 500Hz, or from 500Hz to 50kHz.

12. The method of any one of claims 8 to 1 1 wherein said one or more frequency characteristic comprises an absence of frequency components of said audio signal in one or more frequency bands.

13. The method of claim 12 wherein said one or more frequency characteristic comprises an absence of frequency components of said audio signal in a low frequency band, for example below 500Hz.

14. The method of claim 12 or 13 wherein said one or more frequency characteristic comprises an absence of frequency components of said audio signal in a high frequency band, for example above 10kHz.

15. The method of any one of claims 2 to 14 wherein said one or more characteristic comprises one or more bit rate characteristic.

16. The method of claim 15 wherein said one or more bit rate characteristic comprises a change in bit rate.

17. The method of claim 15 or 16 wherein said one or more bit rate characteristic comprises use of different bit rates for different frequency bands of the audio signal. 18. The method of claim 17 wherein said one or more bit rate characteristic comprises use of a relatively low bit rate for a relatively low frequency band, for example below 500Hz.

19. The method of claim 17 or 18 wherein said one or more bit rate characteristic comprises use of a relatively low bit rate for a relatively high frequency band, for example above 10kHz.

20. The method of claim 15 wherein said one or more bit rate characteristic comprises a change in bit rate, in particular a reduction of the bit rate, after a high intensity signal event.

21. The method of any one of claims 2 to 20, wherein said one or more characteristic comprises noise floor level.

22. The method of claim 21 , wherein said one or more characteristic comprises the noise floor level being above a threshold level. 23. The method of any preceding claim wherein said determining comprises determining if said sound was rendered by a loudspeaker.

24. The method of any preceding claim wherein said determining from said audio signal processing if said sound is a real-world sound comprises comparing the audio signal from the or each microphone against at least one reference template, and determining that said sound is not a real world sound if said audio signal matches said at least one reference template.

25. The method of claim 24 wherein the or each template is a transfer function template, for example a transfer function template corresponding to any one of an audio recording process, and audio broadcast process and/or an audio reproduction process.

26. The method of any one of claims 2 to 25 wherein said one or more characteristics are derived empirically from training data. 27. The method of claim 26 wherein said training data comprises data representing pairs of non- processed and corresponding processed sound samples.

28. The method of claim 26 or 27 wherein said one or more characteristics are derived from said training data by machine-learning.

29. The method of any preceding claim wherein said determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if it emanated from any one of at least one designated location in said environment. 30. A sound responsive device comprising at least one microphone for receiving sounds from an environment, the device further comprising audio signal processing means configured to perform audio signal processing on audio signals produced by the or each microphone to determine if said sound is a real-world sound, the device being configured to perform at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.

Description:

Sound Responsive Device and Method

Field of the Invention The present invention relates to sound responsive devices. In particular the invention relates to electronic devices for responding to real-world sounds.

Background to the Invention Electronic devices that understand and respond to spoken commands are becoming common but issues are frequently encountered where the devices mistake audio from TV or Radio as sound from a live source.

There are also devices that attempt to classify noises in the home and respond appropriately including, for example, recognising gunfire, breaking glass, shouts etc., or even identifying coughs, sneezes, doorbells or telephones. Again, these devices may undesirably treat similar sounds from a TV program as being "real".

Summary of the Invention

A first aspect of the invention provides a method of operating a sound responsive device comprising at least one microphone for receiving sounds from an environment, the method comprising: detecting a sound at said at least one microphone;

producing a corresponding audio signal from the or each microphone;

performing audio signal processing on the corresponding audio signal from the or each microphone;

determining from said audio signal processing if said sound is a real-world sound;

performing at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.

A second aspect of the invention provides a sound responsive device comprising at least one microphone for receiving sounds from an environment, the device further comprising audio signal processing means configured to perform audio signal processing on audio signals produced by the or each microphone to determine if said sound is a real-world sound, the device being configured to perform at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.

Preferably determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic corresponding to one or more audio signal processing process. Said determining typically comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio recording, audio broadcast and/or audio reproduction.

Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio encoding. Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio compression.

Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more compression artefact.

Said audio signal processing may comprise frequency analysis, and said determining involves determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more frequency characteristic corresponding to one or more audio signal processing process. Said one or more frequency characteristic may comprise a spectral distribution of said audio signal. Said one or more frequency characteristic may comprise a spectral distribution of said audio signal in one or more frequency band that is common to both real-world and non real-world sounds. Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in one or more frequency bands. Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in a low frequency band, for example below 500Hz. Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in a high frequency band, for example above 10kHz. Said one or more characteristic may comprise one or more bit rate characteristic. Said one or more bit rate characteristic may comprise a change in bit rate. Said one or more bit rate characteristic comprises use of different bit rates for different frequency bands of the audio signal. Said one or more bit rate characteristic may comprise use of a relatively low bit rate for a relatively low frequency band, for example below 500Hz. Said one or more bit rate characteristic may comprise use of a relatively low bit rate for a relatively high frequency band, for example above 10kHz. Said one or more bit rate characteristic may comprise a change in bit rate, in particular a reduction of the bit rate, after a high intensity signal event.

Said one or more characteristic may comprises noise floor level. Said one or more characteristic may comprise the noise floor level being above a threshold level.

Said determining may comprise determining if said sound was rendered by a loudspeaker.

Said determining from said audio signal processing if said sound is a real-world sound may involves comparing the audio signal from the or each microphone against at least one reference template, and determining that said sound is not a real world sound if said audio signal matches said at least one reference template. The or each template may comprise a transfer function template, for example a transfer function template corresponding to any one of an audio recording process, and audio broadcast process and/or an audio reproduction process.

Said one or more characteristics may be derived empirically from training data.

Said training data may comprise data representing pairs of non-processed and corresponding processed sound samples.

Said one or more characteristics may be derived from said training data by machine-learning.

Said determining if said sound is a real-world sound may comprise determining that said sound is not a real-world sound if it emanated from any one of at least one designated location in said environment.

Preferred embodiments employ either one or both of the following approaches to overcome the problem outlined above: 1 ) Recognition by spatial localisation of sound sources

2) Recognising characteristics of sound that indicate that the sound has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction. Preferred embodiments of the invention are capable of distinguishing between real-world sounds and pre-recorded or broadcast sound.

Further advantageous features of the invention will be apparent to those ordinarily skilled in the art upon review of the following description of a specific embodiment and with reference to the accompanying drawings. Brief Description of the Drawings

An embodimet of the invention is now described by way of example and with reference to the accompanying drawings in which:

Figure 1 is a schematic diagram of a room in which a sound responsive device embodying one aspect of the invention is installed;

Figure 2 is a block diagram of the sound responsive device of Figure 1 ; and

Figure 3 is a flow diagram illustrating a preferred operation of the device of Figure 1 Detailed Description of the Drawings Referring now to Figure 1 of the drawings there is shown a sound responsive device 10 embodying one aspect of the invention. The device 10 is shown installed in a room 12. In the illustrated example the room 12 is a typical living room but this is not limiting to the invention. At least one, but more typically a plurality of loudspeakers 14 are provided in the room 12. The loudspeakers 14 may be part of, or connected to (via wired or wireless connection), one or more electronic device (e.g. a television, radio, audio player, media player, computer, smart speaker) that is capable of providing audio signals to the loudspeakers 14 for rendering to listeners (not shown) in the room. In Figure 1 , a television 16 is shown as an example of such an electronic device. Each of the loudspeakers 14 shown in Figure 1 may for example be connected to the TV 16. More generally, the room may contain one or more electronic device connected to, or including, one or more loudspeakers 14. Ideally, the loudspeakers 14 occupy a fixed position in the room 12, or at least a position that does not change frequently. In typical embodiments, the loudspeakers 14 are not part of the sound responsive device 10, although the sound responsive device 10 may have one or more loudspeakers (not shown) of its own. Advantageously the sound responsive device 10 is connectable (by wired or wireless connection) to one or more of the loudspeakers 14.

The sound responsive device 10 may comprise any electronic apparatus or system (not illustrated) that supports speech and/or sound recognition as part of its overall functionality. For example the system/apparatus may comprise a smart speaker, or a voice-controlled TV, audio player, media player or computing device, or a monitoring system that detects sounds in its environment and responds accordingly (e.g. issues an alarm or operates itself or some other equipment accordingly, or takes any other responsive action(s)). The nature of the action(s) taken by the device 10 in response to detecting a sound depends on the overall functionality of the device 10 and may also depend on the type of the detected sound. Accordingly, the device 10 is typically configured to perform classification of received sounds. This may be achieved using any conventional speech recognition and/or sound recognition techniques. The device 10 may be configured to take one or more action only in response to sounds that it recognises as being of a known type as determined by the classification process. The device 10 may be configured to monitor the status of its environment depending on the detected recognised sounds (without necessarily taking action, or taking action depending on the determined status). The device 10 typically includes a controller 1 1 for controlling the overall operation of the device 10. The controller 1 1 may comprise any suitably configured or programmed processor(s), for example a microprocessor, microcontroller or multi-core processor. Typically the controller 1 1 causes the device 10 to take whichever action(s) are required in response to detection of recognised sounds. The controller 1 1 may also perform the sound classification or control the operation of a sound classification module as is convenient. Typically the device 10 is implemented using a multi-core processor running a plurality of processes, one of which may be designated as the controller and the others performing the other tasks described herein as required. Each process may be performed in software, hardware or a combination of software as is convenient. One or more hardware digital signal processors may be provided to perform one or more of the processes as is convenient and applicable.

Advantageously, the device 10 is capable of distinguishing between real-world sounds and non real- world sounds. In this context a real-world sound is a sound that is created, usually spontaneously, in the environment (which in this example comprises the room 12) in which the device 10 is located by a person, object or event in real time. As such, real-world sounds typically comprise sounds that have not been processed by any audio signal processing technique and/or that are not pre-recorded. Real-world sounds may also be said to comprise sounds that have not been rendered by a loudspeaker. Examples include live human and animal utterances, including live speech and other noises, crashes, bangs, alarms, bells and so on. In the present context therefore real-world sounds may be referred to as non-processed sounds, or sounds not emanating from a loudspeaker.

Non real-world sounds are typically sounds that have been processed by one or more audio signal processing technique, and may comprise pre-recorded or broadcast sounds. Non real-world sounds are usually rendered by a loudspeaker. Examples include sounds emanating from a TV, radio, audio or media player and so on. Non real-world sounds may be referred to as processed sounds or sounds emanating from a loudspeaker.

Advantageously, the device 10 is capable of distinguishing between real-world sounds and non real- world sounds even if the sounds are of the same type, e.g. distinguishing between live speech or other sounds (e.g. coughs, sneezes or shouts) emanating from a person in the environment and recorded speech or other sounds (e.g. coughs, sneezes or shouts) emanating from a TV or media player.

In preferred embodiments the device 10 is configured to employ either one or both of the following methods to achieve the above aim: 1 ) Recognition of sounds by spatial localisation of sound sources

2) Recognising characteristics of sound that indicate that the sound has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction, e.g. encoding, decoding, compression, decompression and/or rendering (or reproduction) via electronic amplifier and/or loudspeaker.

Either or both of the above techniques may be used by the device 10 to determine if a detected sound is a real-world sound or a non-real-world sound. In preferred embodiments, the device 10 is configured to respond only sounds that it has determined to be real-world sounds.

Figure 2 is a block diagram of a typical embodiment of the sound responsive device 10. The device 10 comprises at least one microphone 18. Typical embodiments include two or more (4 or more is preferred) microphones 18 to facilitate determining the location of sound sources. The device 10 comprises an audio signal processor 20 for receiving and processing audio signals produced by the microphones 18 in response to detecting sounds in the room 12 or other environment. The audio signal processor 20 may take any convenient conventional form, being implemented in hardware, software or a combination of hardware and software. Accordingly, the audio signal processor 20 may be implemented by one or more suitably configured ASIC, FPGA or other integrated circuit, and/or a computing device with suitably programmed microprocessor(s). In preferred embodiments the audio signal processor 20 may be configured to perform any one or more of the following audio signal processing functions: frequency spectrum analysis; compression artefact detection; and/or location analysis. The audio signal processor 20 includes components or other means for performing the relevant audio signal processing functions, as indicated in the example of Figure 2 as 22, 24 and 26. Optionally, the audio signal processor 20 may be configured to perform classification of detected sounds using any convention sound and/or speech recognition techniques.

Location analysis involves identifying one or more locations in the environment corresponding to the source of detected sounds, i.e. spatial localisation of sound sources within the environment. In the present example, this involves determining the location of the loudspeakers 14.

In preferred embodiments where the device 10 has two or more microphones 18, any one or more of several known techniques may be used to locate the source of a sound in space with accuracy, for example: using differential arrival times (phase difference) at each microphone; and/or using the difference in volume level at each microphone (optionally amplified by the use of highly directionally sensitive microphones).

The preferred device 10 is operable in a training mode in which it learns the location of one or more non-real-world sound source in its environment. In the present example this involves determining the location of the loudspeakers 14. In the training mode, the device 10 detects sounds using the microphones 18 (or at least two of them) and performs location analysis on the output signals of the microphones 18 to determine the location of one or more loudspeaker or other sound source. Preferably, in the training mode each loudspeaker 14 or other sound source is operated individually (i.e. one at a time) to produce sound for detection by the device 10. Alternatively, two or more loudspeakers 14 or other sound sources may be operated simultaneously in the training mode (for example where two or more loudspeakers 14 are driven by the same TV or other electronic device). In the training mode, the loudspeakers 14 or other sound source may be operated to produce sounds that they would produce during normal operation, or may be operated to produce one or more test sounds. In preferred embodiments, the device 10 is connectable (by wired or wireless connection as is convenient) to one or more of the sound producing devices (e.g. TV, radio, media player or other device having or being connected to one or more loudspeaker 14) in the environment in order to cause them to generate the sounds during the training mode. Advantageously, the device 10 uses test sounds for this purpose and may store test signals for sending to the sound producing devices for this purpose. For example the test signals may include full 5.1 or 7.1 sound signals to deal with environments with cinema-like loudspeaker installations. The preferred device 10 is also operable in a listening mode in which it detects real-world sounds in the environment and may take one or more actions in response to detecting a real-world sound. The nature of the actions may depend on a wider functionality of the device 10, or of a system or apparatus of which the device 10 is part. The actions may comprise generating one or more output, for example an audio and/or visual output, and/or one or more output signal for operating one or more other device to which the device 10 is connected or of which it is part. For example the device 10 may be connected to (or be integrated with) a TV or other electronic device and may operate the TV/electronic device depending on one or more detected sounds. The device 10 may be configured to take different actions depending on what sounds are detected. The device 10 itself may be provided with one or more output device (e.g. a loudspeaker, lamp, video screen, klaxon, buzzer or other alarm device or telecommunications device), which it may operate depending on what sounds are detected.

Advantageously, the device 10, upon determining that a detected sound is not a real-world sound, can ignore the detected sound, e.g. take no action in response to the detected sound. Optionally, the device 10 may be configured to take one or more actions in response to detecting non-real-world sounds. Typically such actions are different from those taken in response to detected real-world sounds. In embodiments where the device 10 is configured to classify detected sounds according to multiple sound types (e.g. speech, bangs, doorbells, telephone rings and so on), the device 10 may be configured to take different action (including no action) for real-world sounds and non-real-world sounds even if the sounds are of the same type.

Limitations to the sound source localisation technique include: localising portable devices such as radios or wireless speakers which may be moved regularly; incorrectly ignoring sounds from a person positioned close to one of the locations the device 10 has determined should be ignored; and locating sound sources that are close to the device 10 (e.g. speakers built into a TV set on which the device 10 is located). Such limitations can be mitigated by determining whether or not a detected sound has one or more characteristic indicating that it has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction, e.g. encoding, decoding, compression and/or rendering via electronic amplifier and/or loudspeaker, rather than being a non-processed, or raw, real-world sound. This analysis can be achieved by performing audio signal processing of the output signals produced by at least one of the

microphones 18 when a sound is detected. Analysis of detected sounds to differentiate between processed and non-processed sounds (and therefore between non-real-world and real-world sounds) can be performed in addition to, or instead of, the spatial localisation of sounds described above.

For example, sound broadcast via TV or radio, sounds produced from a CD, DVD or Blu-ray disc, or streamed media sounds have been subjected to one or more audio processes, including any one or more of the following: A. Encoding

Almost all recorded and/or broadcast sound (barring analogue vinyl records and magnetic tape played directly through an amplifier) has gone through an encoding process. While this can involve high sampling rates and very high-fidelity capture of the original analogue wave form, it will in almost all cases have been subject to a process of band-pass filtering where sounds at a frequency above or below "normal" hearing ranges have been removed (usually from 20Hz to 20kHz). So, although sound encoded at the sampling rate of a CD or higher is often referred to as "lossless", in practice not all of the original information is present and inaudible frequencies and harmonics will be missing;

B. Compression

For broadcast, recording and/or reproduction, audio signals will usually have undergone some form of audio compression, e.g. dynamic range compression. There are lossless forms of compression which can be restored to the full original encoding, but in practice audio signals tend to go through a lossy compression process using a codec (coder-decoder) which removes some of the audio information. For example, codecs commonly use a psychoacoustic technique that relies on knowledge of how humans perceive sound. Pyschoacoustic codecs compress the sound by removing parts of the sound that humans do not pay attention to, and/or devoting fewer bits of the data stream to capturing parts of the signal which are less important to the human experience than the others. So, for example, a codec might:

1 ) divide up the audio signal into multiple frequency bands and devote fewer bits of the compressed encoding to the highest or lowest frequency bands where the human ear/brain is less discerning and more to the range in which normal speech occurs

2) devote fewer bits to the sound immediately after a loud noise, during which time it is known that the brain is paying less attention;

3) devote fewer bits to frequency ranges with less acoustic energy in the signal - louder sounds are known to mask quieter sounds in human perception; and/or 4) further remove the highest and lowest frequency sounds, i.e. be more aggressive in removing those frequencies which few people can hear - especially as they get older. Not all audio codecs make use of psycho-acoustics to an appreciable degree e.g. the popular Aptx (trade mark) codec provided by Qualcomm. In such cases other techniques such as "dithering" are used to mask the audibly unpleasant artefacts of the compression process, and that in turn raises the noise floor of the signal which can be detected as an artefact in the audio signal. Hence, compression of an audio signal can lead to the presence of detectable artefacts in the signal that are not necessarily the result of psycho-acoustic compression techniques.

C. Reproduction

When decoding an encoded signal back to renderable sound, an amplifier generates a varying voltage/current to operate a loudspeaker. In practice, both the amplifier's electrical characteristics, and the loudspeaker's mechanical characteristics leave an imprint on the sound being produced - often referred to as the "transfer function". In most cases loudspeakers associated with a TV have a limited frequency response and yet more of the high and low frequencies will be lost.

In a typical broadcast chain, the audio signal is likely to undergo encoding, compression, decoding and decompression at least once and often more than once as it passes through the various network links from initial recording, to studio to transmitter. Different codecs may be used at different stages so the end result may bear traces of more than one kind of processing.

As a result of any one or more of the above (and/or other) processes, processed sounds commonly have one or more characteristics that non-processed real-world sounds do not have, and vice versa. For example, non-processed real-world sounds tend to include audio signal components at higher and/or lower frequencies than processes sounds such as those emanating from a television or audio system. Also, non-processed real-world sounds tend to have less inherent background noise than processed signals. Further, non-processed real-world sounds tend to have a more natural spread of frequency components than processed sounds. Hence the spectral distribution (which may be referred to as spectral power distribution) of the or each audio signal representing a detected sound can provide an indication of whether the sound is a real world sound or not.

Even within frequency band(s) that are common to both processed and non-processed sounds, the frequency distribution, i.e. the distribution of the frequency components of the audio signal, and other characteristics of processed sound are detectably different from those of real-world non-processed sounds. Some of these characteristics are complex, e.g. changes in bitrates of encoding (e.g. lower bit rate after a loud noise, or for very high or low frequencies), and introduce identifiable artefacts into the processed audio signals. A processed audio signal may include detectable artefacts arising from any one or more of the processes described above. Accordingly, any sound (and more particularly any corresponding audio signal representing the sound) detected by the device 10 may be analysed in respect of any one or more signal

characteristics in order to identify it as a processed sound or a non-processed sound. The relevant characteristics include, but are not limited to:

i. the frequency content of the audio signal, in particular the presence or absence of signal components in one or more frequency bands, especially a high frequency band (e.g. above 20kHz or above 500kHz) and/or a low frequency band (e.g. below 20Hz or below 50Hz). ii. the spectral distribution of the audio signal, especially within one or more frequency bands, e.g. between 20Hz and 500kHz, or between 500Hz and 2kHz, from 500Hz to 50kHz (or other frequency range, e.g. a frequency range deemed to correspond with the human voice) iii. the bitrate of the audio signal, including the absolute bitrate and/or changes in bitrate. For example this may involve detecting different bitrates being used for different frequency components (in particular relatively low bit rates being used for high (e.g. >15kHz) and/or low (e.g. <500Hz) frequency bands), and/or relatively low bitrates being used after a signal event such as loud noise (which may be referred to as a high intensity signal event).

iv. The noise floor level of the audio signal, in particular a relatively high noise floor level (e.g. above a threshold value that can be determined from reference data).

To make efficient use of computational resources, it is preferred to trigger the sound analysis once a minimum sound level and/or duration has been reached. For example, a rolling window of sound (e.g. of up to a few seconds) may be captured continuously from each microphone 18, and once the trigger condition(s) has been met a sound segment of defined duration, commencing with the trigger sound may be put into a queue for analysis. Any convenient technique to implement early, random or other discard technique may be employed if the queue grows beyond acceptable limits.

Figure 3 shows a preferred operation of the device 10 in the listening mode. In step 301 the device 10 captures a sample of detected sound from the output of one or more microphone 18 in response to the trigger condition(s) being met. In step 302 the device 10 performs location analysis on the detected sound as described above. This may involve determining the location of the sound's source using the phase difference between corresponding signals captures from at least two microphones 18 and/or sound intensity difference between corresponding signals captures from at least two microphones 18, and may depend on the directional sensitivity of the or each relevant microphone 18. Sounds that are determined as having emanated from the location of a know loudspeaker 14 (as determined during the training mode) can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.

In step 303 the device 10 performs transfer function, or frequency spectrum, analysis of the detected sound to identify one or more frequency characteristics that are indicative of it being either a real- world sound or a non-real-world sound. Typically this involves determining that the sound is a processed, or non-real-world, sound if it lacks high and/or low frequency components that are commonly removed by audio encoding and/or by rendering via amplifier and/or loudspeaker. Sounds that are determined as having been processed can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound. Alternatively or in addition the transfer function analysis may involve comparing the sound sample (conveniently a transfer function representing the sound sample) against one or more transfer function template associated with audio recording, audio broadcast and/or audio reproduction. Any audio playback system will have a transfer response h(t) and corresponding frequency domain response H(s). Playing the audio source signal sig(t) through the system will convolve sig(t) with h(t), or in a frequency domain representation, multiplication of SIG(s) (being the frequency domain representation of sig(t)) with H(s). For a given transient signal that has sufficient bandwidth across the region of H(s) where there is maximal variability, it is possible to recover an estimate of the multiplicative envelope of H(s) through parameter fitting to produce an estimate with some measure of certainty that the source signal was altered by reproduction through a rebroadcast system. The fitting technique can use any number of standard parametric techniques. The transfer function for broadcast compression and blu-ray encoding can for example be used as templates. Such templates are best suited to transients such as gunshots, breaking glass or TV screams, it will be less effective for narrower band sounds such as vehicle noise or human speech that does not have significant variability (inflection or emotion).

In step 304 the device 10 looks for artefacts in the detected sound (i.e. in the corresponding frequency spectrum and/or waveform of the corresponding audio signal) which indicate that the sound has been subjected to audio compression, e.g. psycho-acoustic compression or other compression technique. This may involve identifying relatively low bitrate encoding in high and/or low frequency bands and/or a reduction in encoding quality after a loud noise, and/or a noise floor level that can be associated with compression. Sounds that are determined as having been subjected to compression can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound. Sounds can be deemed to be processed or non-real-world sounds (and therefore ignored or rejected) upon being identified as such by any one of steps 302, 303 or 304, or alternatively upon being identified as such by any two or more of steps 302, 303 and 304. Any determinations made by the audio signal processor 20 in this regard may be communicated to the controller 1 1 which may make the decision on whether or not to ignore the detected sound and/or to determine which actions are to be taken in response to the detected sound.

It is noted that the sequencing of the sequence of steps 302, 303, 304 in Figure 3 is illustrative and in alternative embodiments, these (and/or other) steps may be performed in different orders, or be merged, and/or be operated in parallel, dependent on the requirements of the application and capabilities of the device 10. In preferred embodiments the device 10 combines the techniques of location analysis, spectrum analysis and artefact detection. In alternative embodiments any one of these techniques may be used on its own, or in combination with any other of the techniques. For example, spectrum analysis and artefact detection in particular may each be sufficient on its own to achieve an effective level of specificity for a given use-case.

It is noted that there are at least two approaches to implementation of spectrum analysis and artefact detection:

1 ) Development of one or more specific algorithm to detect the or each relevant signal

characteristic, for example based on analysis of reference data, and

2) Training using machine-learning techniques - this may, for example, involve training the device 10 with training data which may comprise pairs of sound samples - an original sound generated live (or a very high-fidelity or artefact-free recording), and the same sound after typical encoding, compression and/or reproduction.

The second process does not generate an algorithm as such and it may not be apparent how the system is achieving subsequent levels of effective differentiation. The machine-learning approach may also collapse steps in the processing: in other words it may not be necessary to separately look for spectrum differences and compression artefacts, a trained system may just learn the difference between processed and non-processed sounds using whatever characteristics it finds to be most capable of allowing the distinction to be made. The machine learning approach may involve providing the device 10 with reference real-world and non-real-world sounds, the device 10 being configured through machine-learning to develops its own criteria empirically for distinguishing between them. These criteria may involve elements of location, spectral distribution and artefacts, and may differ for different types of input sound.

An example of a practical application of the device 10 is now described for illustration purposes. In this example it is assumed that the device 10 is installed in the home of a vulnerable person to monitor their health and safety. For maximum visibility of the room 12 being monitored the device 10 is positioned on top of the TV set 16.

The device 10 is intended to monitor for coughs, sneezes, cries for help, sounds of danger and other noises but in a normal home the TV is likely to be active for several hours a day and generate many similar artificial sound events.

The device 10 has a plurality of microphones 18 and audio signal processing circuitry 20 configured to perform the following:

• Separate processing of the audio signal from each microphone 18

• Capture of audio input samples in response to detection of a trigger signal, e.g. when sound exceeding a trigger intensity and/or duration is detected

• Measurement of the phase shift between corresponding sound samples from each (or at least two) microphone 18 • Audio signal analysis of each sample, which may involve transfer function analysis and or artefact detection.

During the training mode the device 10 determines the position of the loudspeakers 14 within the room 12, preferably by playing test signals through the television (e.g. via HDMI or other connection) and detecting the corresponding sounds rendered by the loudspeakers 14 using the microphones 18. At a minimum it is preferred that alternate left and right channel test signals are used, but more preferably test signals for 2.1 , 5.1 and 7.1 sound set-ups are used, selecting channels and frequencies as appropriate.

During the listening mode, the device 10 can preform location analysis and reject sounds from the designated speaker locations. Alternately, sounds from those locations can simply be marked as "suspect" and further processed before making a final decision, for example based on weighted probabilities from each phase of analysis.

The invention is not limited to the embodiment(s) described herein but can be amended or modified without departing from the scope of the present invention.

Previous Patent: METHOD FOR FILTER SELECTION

Next Patent: MESSAGE DELIVERY SYSTEM AND METHOD