Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DISEASE DETECTION, IDENTIFICATION AND/OR CHARACTERIZATION USING MULTIPLE REPRESENTATIONS OF AUDIO DATA
Document Type and Number:
WIPO Patent Application WO/2023/143995
Kind Code:
A1
Abstract:
Systems, methods, and computer programs disclosed herein relate to training and using a machine learning model for audio classification, particularly for the purpose of disease detection, identification and/or characterization.

Inventors:
LENGA MATTHIAS (DE)
MOHAMMADI SEYED SADEGH (DE)
DINH TRUONG TUAN (DE)
Application Number:
PCT/EP2023/051189
Publication Date:
August 03, 2023
Filing Date:
January 19, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
BAYER AG (DE)
International Classes:
G16H50/20
Domestic Patent References:
WO2020222985A12020-11-05
Foreign References:
EP3166049A12017-05-10
US20170262996A12017-09-14
Other References:
ANKIT PAL ET AL: "Pay Attention to the cough: Early Diagnosis of COVID-19 using Interpretable Symptoms Embeddings with Cough Sound Signal Processing", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 October 2020 (2020-10-06), XP081779264
JOHN MENDON\C{C}A ET AL: "Using Self-Supervised Feature Extractors with Attention for Automatic COVID-19 Detection from Speech", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 30 June 2021 (2021-06-30), XP081996879
CHANG YI ET AL: "Transformer-based CNNs: Mining Temporal Context Information for Multi-sound COVID-19 Diagnosis", 2021 43RD ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY (EMBC), IEEE, 1 November 2021 (2021-11-01), pages 2335 - 2338, XP034042950, DOI: 10.1109/EMBC46164.2021.9629552
G. ALTAN ET AL.: "disclose a comparison of multiple machine-learning algorithms for early diagnosis of chronic obstructive pulmonary disease using multichannel lung sound (''Deep Learning on Computerized Analysis of Chronic Obstructive Pulmonary Disease", IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, vol. 24, no. 5, May 2020 (2020-05-01), pages 1344 - 1350
G. H. R. BOTHA ET AL.: "disclose a method of detection of tuberculosis by automatic cough sound analysis", PHYSIOL. MEAS., vol. 39, no. 4, 2018, Retrieved from the Internet
G. DESPHANDEB. W. SCHULLER: "disclose an overview on audio, signal, speech, and language processing for COVID-19 detection", ARXIV, 18 May 2020 (2020-05-18)
P. SUPPAKITJANUSANT ET AL.: "disclose a method for identifying individuals with recent COVID-19 through voice classification using deep learning", SCI REP, vol. 11, 2021, pages 19149, Retrieved from the Internet
J. LAGUARTA ET AL.: "disclose an artificial intelligence (AI) speech processing framework that leverages acoustic biomarker feature extractors to pre-screen for COVID-19 from cough recordings (COVID-19 Artificial Intelligence Diagnosis Using Only Cough Recordings", IEEE OPEN JOURNAL OF ENGINEERING IN MEDICINE AND BIOLOGY, vol. 1, 2020, pages 275 - 281
JOHN MENDONCA ET AL.: "disclose a method for automatic COVID-19 detection from speech using self-supervised feature extractors with attention", ARXIV, 30 June 2021 (2021-06-30)
F. RUMSEY: "Sound Recording - Application and Theory", 2021, FOCAL PRESS
G. A. AMBAYE: "International Journal of Engineering Research & Technology", vol. 09, 2020, article "Time and Frequency Domain Analysis of Signals: A Review"
P. HILL: "Audio and Speech processing with MATLAB", 2018, CRC PRESS
A. BAEVSKI ET AL.: "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations", ARXIV, 2020
F. CADY: "Data Science - The Executive Summary", 2020, WILEY
Attorney, Agent or Firm:
BIP PATENTS (DE)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method, the method comprising: receiving one or more sets of audio data, each set of audio data representing a body sound, generating at least two different representations from each set of audio data, extracting features from each representation, generating a joint representation on the basis of all features, generating a classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data contain an indication of the presence of a disease, outputting the classification result.

2. The computer-implemented method according to claim 1, wherein the body sound is produced by a patient suffering from a respiratory disease.

3. The computer-implemented method according to claim 1 or 2, wherein the body sound is a sound resulting from one or more of the following: coughing, breathing, talking, singing, screaming, swallowing, wheezing, speech, voice, rales.

4. The computer-implemented method according to any one of claims 1 to 3, wherein different sets of audio data are created from at least two of the following: coughing, breathing, voice.

5. The computer-implemented method according to any one of claims 1 to 4, wherein from each set of audio data a time-domain representation and a time -frequency representation are generated.

6. The computer-implemented method according to any one of claims 1 to 5, wherein the different representations of audio data have different resolutions.

7. The computer-implemented method according to any one of claims 1 to 6, wherein an attention mechanism is used for generating the j oint representation, the attention mechanism being a self-attention mechanism.

8. The computer-implemented method according to any one of claims 1 to 6, wherein an attention mechanism is used for generating the joint representation, the attention mechanism being an attentionbased pooling.

9. The computer-implemented method according to any one of claims 1 to 8, wherein the steps of extraction features, generating the joint representation and generating a classification result are executed using a trained artificial neural network.

10. The computer-implemented method according to claim 9, wherein the artificial neural network comprises an input layer for each representation of the different representations, a feature extraction unit for each representation of the different representations, wherein each feature extraction unit is configured to extract features from the representation and generate a feature vector, a feature vector combination unit which is configured to generate a joint representation on the basis of all feature vectors, and a classifier which is configured to assign the joint representation to one of at least two classes.

11. The computer-implemented method according to any one of claims 1 to 10, wherein the joint representation is assigned to one of two classes, one class representing patients suffering from a respiratory disease, such as chronic obstructive pulmonary disease, corona virus disease 2019, bronchitis, chronic bronchitis, emphysema, cystic fibrosis, pneumonia, tuberculosis, interstitial lung disease, pulmonary hypertension, chronic cough, the other class representing patients not suffering from the respiratory disease, or one of a plurality of classes, each class representing a severity of a respiratory disease, or one of several classes, wherein one class represents patients who do not have any respiratory disease and each of the remaining classes represents a different respiratory disease.

12. The computer-implemented method according to any one of claims 1 to 11 comprising: receiving one or more sets of audio data, each set of audio data representing one or more body sounds, providing a trained artificial neural network, wherein the trained artificial neural network comprises an output, and, for each set of audio data, a first input, and a second input, for each set of audio data: o generating a time-domain representation of the audio data, o generating a time -frequency representation of the audio data, o inputting the time-domain representation into the first input of the trained artificial neural network and the time -frequency representation into the second input of the artificial neural network, wherein the artificial neural network is configured and trained to

■ generate a first feature vector on the basis of the time-domain representation,

■ generate a second feature vector on the basis of the time -frequency representation,

■ generate a joint representation on the basis of the first feature vector and the second feature vector, preferably using an attention mechanism, and

■ generate, on the basis of the joint representation, a classification result, the classification result indicating whether and/or to what extent the one or more sets of audio data contain an indication of the presence of a disease, receiving, from the trained machine learning model the classification result, outputting the classification result.

13. The computer-implemented method according to any one of claims 1 to 12 comprising: receiving non-audio signal data, generating at least two different representations from the non-audio signal data, extracting features from each representation, generating the joint representation on the basis of all features, preferably using an attention mechanism, generating the classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data and non-audio signal data contain an indication of the presence of a disease, outputting the classification result.

14. The computer-implemented method according to any one of claims 1 to 13 comprising: receiving further patient data, extracting features from the patient data, generating the joint representation on the basis of all features, preferably using an attention mechanism, generating the classification result on the basis of the joint representation, outputting the classification result.

15. A computer system comprising: a processor; and a memory storing an application program configured to perform, when executed by the processor, an operation, the operation comprising: receiving one or more sets of audio data, each set of audio data representing a body sound, generating at least two different representations from each set of audio data, extracting features from each representation, generating a joint representation on the basis of all features, generating a classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data contain an indication of the presence of a disease, outputting the classification result.

16. A non-transitory computer readable medium having stored thereon software instructions that, when executed by a processor of a computer system, cause the computer system to execute the following steps: receiving one or more sets of audio data, each set of audio data representing a body sound, generating at least two different representations from each set of audio data, extracting features from each representation, generating a joint representation on the basis of all features, generating a classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data contain an indication of the presence of a disease, outputting the classification result.

Description:
Disease detection, identification and/or characterization using multiple representations of audio data

FIELD

Systems, methods, and computer programs disclosed herein relate to training and using a machine learning model for audio classification, particularly for the purpose of disease detection, identification and/or characterization.

BACKGROUND

The use of machine learning models for disease detection and/or characterization is increasing. This applies, among other things, to the analysis of body sounds for diagnostic purposes.

For example, G. Altan et al. disclose a comparison of multiple machine -learning algorithms for early diagnosis of chronic obstructive pulmonary disease using multichannel lung sound ("Deep Learning on Computerized Analysis of Chronic Obstructive Pulmonary Disease," in IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 5, pp. 1344-1350, May 2020, doi: 10.1109/JBHI.2019.2931395).

G. H. R. Botha et al. disclose a method of detection of tuberculosis by automatic cough sound analysis (Physiol. Meas., 2018, Vol. 39, Nr. 4, https://doi.org/10.1088/1361-6579/aab6d0).

G. Desphande and B. W. Schuller disclose an overview on audio, signal, speech, and language processing for COVID-19 detection (arXiv:2005.08579vl [cs.CY] 18 May 2020).

P. Suppakitj anusant et al. disclose a method for identifying individuals with recent COVID- 19 through voice classification using deep learning (Sci Rep 11, 19149 (2021), https://doi.org/10.1038/s41598-021- 98742-x).

J. Laguarta et al. disclose an artificial intelligence (Al) speech processing framework that leverages acoustic biomarker feature extractors to pre-screen for COVID-19 from cough recordings (COVID-19 Artificial Intelligence Diagnosis Using Only Cough Recordings, in IEEE Open Journal of Engineering in Medicine and Biology, vol. 1, pp. 275-281, 2020, doi: 10.1109/OJEMB.2020.3026928).

John Mendonca et al. disclose a method for automatic COVID- 19 detection from speech using self- supervised feature extractors with attention (arXiv:2107.00112vl [eess.AS] 30 Jun 2021).

In the aforementioned publications, the detection of a disease is based on waveforms or spectrograms of audio data or on individual features extracted from audio data.

However, the reliable detection, identification and/or characterization of diseases remains a challenge. There is still a need for improvement.

SUMMARY

As shown in the present disclosure, disease detection, identification and/or characterization can be improved if it is based on multiple representations of data.

In a first aspect, the present disclosure provides a computer-implemented method, the method comprising: receiving one or more sets of audio data, each set of audio data representing a body sound, and optionally non-audio signal data and/or further patient data, generating at least two different representations from each set of audio data and optionally from each of the non-audio signal data, extracting features from each representation and optionally from the further patient data, generating a joint representation on the basis of all features, generating a classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data and optionally the non-audio signal data and/or further patient data contain an indication of the presence of a disease, outputting the classification result.

In another aspect, the present disclosure provides a computer system, the computer system comprising: a processor; and a memory storing an application program configured to perform, when executed by the processor, an operation, the operation comprising: receiving one or more sets of audio data, each set of audio data representing a body sound, and optionally non-audio signal data and/or further patient data, generating at least two different representations from each set of audio data and optionally from each of the non-audio signal data, extracting features from each representation and optionally from the further patient data, generating a joint representation on the basis of all features, generating a classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data and optionally the non-audio signal data and/or further patient data contain an indication of the presence of a disease, outputting the classification result.

In another aspect, the present disclosure provides a non-transitory computer readable medium having stored thereon software instructions that, when executed by a processor of a computer system, cause the computer system to execute the following steps: receiving one or more sets of audio data, each set of audio data representing a body sound, and optionally non-audio signal data and/or further patient data, generating at least two different representations from each set of audio data and optionally from each of the non-audio signal data, extracting features from each representation and optionally from the further patient data, generating a joint representation on the basis of all features, generating a classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data and optionally the non-audio signal data and/or further patient data contain an indication of the presence of a disease, outputting the classification result.

Further aspects of the present invention are disclosed in the dependent claims, the specification, and the drawings.

DETAILED DESCRIPTION

The invention will be more particularly elucidated below without distinguishing between the aspects of the invention (method, computer system, computer-readable storage medium). On the contrary, the following elucidations are intended to apply analogously to all the aspects of the invention, irrespective of in which context (method, computer system, computer-readable storage medium) they occur.

If steps are stated in an order in the present description or in the claims, this does not necessarily mean that the invention is restricted to the stated order. On the contrary, it is conceivable that the steps can also be executed in a different order or else in parallel to one another, unless one step builds upon another step, this absolutely requiring that the building step be executed subsequently (this being, however, clear in the individual case). The stated orders are thus preferred embodiments of the invention. As used herein, the articles “a” and “an” are intended to include one or more items and may be used interchangeably with “one or more” and “at least one.” As used in the specification and the claims, the singular form of “a”, “an”, and “the” include plural referents, unless the context clearly dictates otherwise. Where only one item is intended, the term “one” or similar language is used. Also, as used herein, the terms “has”, “have”, “having”, or the like are intended to be open-ended terms. Further, the phrase “based on” is intended to mean “based at least partially on” unless explicitly stated otherwise. Further, the phrase “based on” may mean “in response to” and be indicative of a condition for automatically triggering a specified operation of an electronic device (e.g., a controller, a processor, a computing device, etc.) as appropriately referred to herein.

Some implementations of the present disclosure will be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all implementations of the disclosure are shown. Indeed, various implementations of the disclosure may be embodied in many different forms and should not be construed as limited to the implementations set forth herein; rather, these example implementations are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The present disclosure provides means for the classification of audio data and optionally further data.

“Audio data” is a digital representation of one or more sounds. In particular, “audio data” is a digital representation of one or more sounds that can be used to analyze and/or reproduce the one or more sounds.

The term “digital” means that the data can be processed by a computer system.

“Sounds” are pressure variations in the air (or any other medium) that can be converted into an electrical signal with the help of a microphone and recorded mechanically or digitally. Other terms that are used for the term “sound” are “acoustic wave(s)” and “sound wave(s)”, which indicate that the pressure variations propagate through a transmission medium such as air.

The term “audio” indicates that the sound is usually a pressure variation that is within a range that is audible to (can be heard by) the human ear. The human hearing range is commonly given as 20 to 20,000 Hz, although there is considerable variation between individuals. However, the term “audio” should not be understood to mean that the present disclosure is limited to sound waves in the range of 20 to 20,000 Hz. In principle, the methods presented herein can also be applied to sound waves that are outside the range perceived by humans.

For capturing sound(s) as audio data, one or more microphones can be used. A microphone is a transducer that converts sound into an electrical signal. Several types of microphones can be used, which employ different methods to convert the air pressure variations of a sound wave to an electrical signal. The most common are dynamic microphones, which use a coil of wire suspended in a magnetic field, condenser microphones, which use a vibrating diaphragm as a capacitor plate, and contact microphones, which use a crystal of piezoelectric material. Microphones typically need to be connected to a preamplifier and/or an amplifier before the signal can be recorded.

The one or more microphones used to capture sound and convert it into an electric signal can be worn on the (human) body as so called “wearables”; they can be part of a device that a person carries with him/her, such as a mobile phone or a wristwatch; and/or they can be installed stationary in one or more rooms in which a person is occasionally or frequently present.

The electrical signal generated by the microphone(s) can be converted by an analog-to-digital converter into digital audio signal. The digital audio signal can then be stored as audio data in a data memory.

One or more microphones, optionally one or more preamplifiers and/or amplifiers, one or more analog- to-digital converters and one or more data memories can be part of one device, or they can be part of separate devices which are connected to one another in order to generate audio data as described herein. The audio data can be saved in various audio fde formats, e g., in uncompressed waveform formats such as the waveform audio file format (WAV) and/or the audio interchange file form (AIFF), and/or with lossless compression such as FLAC (free lossless audio codes), and or with lossy compression such as MP3.

Details about generating audio data can be found in various textbooks (see, e.g., F. Rumsey: Sound Recording - Application and Theory, 8 th edition, Focal Press, 2021, ISBN 9780367553029).

Preferably, the one or more sound(s) which is/are represented by the audio data is/are caused by a human. Preferably, these are body sounds. The human is preferably a patient suffering from a respiratory disease such as chronic obstructive pulmonary disease (COPD), corona virus disease 2019 (COVID- 19), (chronic) bronchitis, emphysema, cystic fibrosis, pneumonia, tuberculosis, interstitial lung disease (ILD), pulmonary hypertension, chronic cough, and/or others.

The sound(s) can be produced by the human consciously and/or unconsciously. The sound(s) can be produced by the human intentionally and/or unintentionally.

The sound(s) can be caused by an internal stimulus or it can be a reaction to an external stimulus.

The sound(s) may be from one or more of the following: coughing, snoring, sneezing, hiccupping, vomiting, breathing, talking, singing, making noises, screaming, swallowing, wheezing, shortness of breath, chewing, grinding teeth, speech, voice, rales and/or others.

In a preferred embodiment, various sounds made by a human are captured and stored as different sets of audio data on one or more storage media.

In a preferred embodiment, the sounds of coughing, breathing, and/or voice are captured separately and stored as separate audio data sets on one or more storage media.

The sound(s) can be recorded by a patient and/or a physician and/or a physician's assistant and stored as audio data on one or more storage media.

In a preferred embodiment, a patient is asked to breathe in and/or out deeply. The resulting breathing sounds can then be recorded and stored as audio data on one or more storage media.

The patient may be asked to form the vowels "a", "e", "i", "o" and/or "u" with his/her voice. The patient may be asked to pronounce one or more pre-defined words . Spoken sounds and/or words can be captured and stored as audio data on one or more storage media. The sounds and/or words to be pronounced can be spoken to the patient and/or displayed on a monitor.

The patient can be asked to cough, and the coughing sound can be captured and stored as audio data on one or more storage media. Similarly, spontaneous coughing can be recorded and stored as audio data on one or more storage media.

It is conceivable, for example, that body sounds from a patient are continuously recorded with the aid of one or more microphones and stored as audio data on one or more storage media.

Each audio recording represented by a set of audio data has a certain length, i.e., the time that passes when the audio recording is played, and the sound(s) contained in the audio recording are played back.

Preferably, audio recordings with a predefined length are used. The length is preferably greater than the length of the respective sound event (such as a single cough event). The length is preferably in the range of 1 second to 20 seconds, more preferably in the range of 2 seconds to 10 seconds.

If the length of an audio recording exceeds the predefined length, it can be divided into sections having the defined length. The term “receiving one or more sets of audio data” preferably means that the one or more sets of audio data are read from one or more storage media and/or transmitted from one or more remote computer systems.

Different representations are then generated from the one or more audio data sets. At least two different representations are generated from each set of audio data. The representations can be, for example, representations in different domains.

A domain can be, for example, a time domain, a frequency domain, or a time-frequency domain.

A “time-domain representation” represents a physical quantity as a function of time. In case of an audio signal the time-domain representation can represent the loudness (amplitude) of captured sound(s) as a function of time. Examples of time-domain representations of captured (cough) sounds are given, e.g., in Figs. 1, 2 and 8 of US2020015709A1, and Figs. 1 and 2 of DOI: 10.1109/ACCESS.2020.3018028.

A “frequency-domain representation” provides information about the different frequencies present in an audio signal and the magnitude of the frequencies present. A frequency-domain representation can be obtained, e g., by Fourier transformation of a time -domain representation of an audio signal. Conversely, a frequency-domain representation can be converted to a time-domain representation by inverse Fourier transformation. Fig. 4 of the following publication shows an example of a frequency-domain representation: G. A. Ambaye: Time and Frequency Domain Analysis of Signals: A Review, International Journal of Engineering Research & Technology, 2020, Volume 09, Issue 12.

A “time-frequency representation” is a representation of a signal over both time and frequency. A timefrequency representation can represent the intensities of frequency components as a function of time. A time-frequency representation can be obtained from an audio signal, e.g., by short-time Fourier transform. Short-time Fourier transform (STFT) is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time. In practice, the procedure for computing STFTs is to divide a longer time signal into shorter segments of equal length and then compute the Fourier transform separately on each shorter segment. This reveals the Fourier spectrum on each shorter segment. One then usually plots the changing spectra as a function of time, known as a spectrogram or waterfall plot. A spectrogram can be plotted, e.g., as a three- dimensional graphic in which one axis represents the time (t), the second axis represents frequencies (/), and the third axis represents magnitude (m) of the observed frequencies at a particular time. Very often a spectrogram is displayed as a two-dimensional image in which one image axis represents the time, the other image axis represents frequencies, and the colors or grey values represent magnitude (amplitude) of the observed frequencies at a particular time. Examples of spectrogram representations of captured sounds are given, e.g., in Figs. 1 and 2 of DOI: 10.1186/1745-9974-2-1 and Figs. 7 and 8 of US6436057.

In a preferred embodiment of the present disclosure, the at least two representations of an audio data set are selected from the following list of representations: amplitude as a function of time, frequency spectrum, amplitude spectrum, phase spectrum, STFT representation, STFT amplitude representation, STFT phase representation, spectrogram.

Methods for generating time-domain representations, frequency-domain representations and timefrequency representations from audio signals are known and are described in the publications cited herein and/or are disclosed in various textbooks (see e.g.: P. Hill: Audio and Speech processing with MATLAB, CRC Press, 2018, ISBN: 9780429813962).

Fig. 1 shows schematically by way of example, the generation of a time-domain representation TDR and a time -frequency representation TFR from an audio recording AR. In the example shown in Fig. 1, both, the time-domain representation TDR, and the time-frequency representation TFR are generated from the audio recording AR. However, it is also possible to first generate a time-domain representation from the audio recording and then use the time-domain representation in order to generate the timefrequency representation. It is also possible to generate representations with different resolutions from one set of audio data.

For example, if the original audio recording has a sample rate of 96 kHz and a 24-bit audio bit depth, two representations having different resolutions can be generated using down-sampling techniques, for example, a first representation having the original 96 kHz / 24-bit resolution and a second representation having a 44.1 kHz / 16-bit resolution.

The different representations of the audio data (and optionally further data) are used to classify the audio data. This can be done in three steps. In a first step, features are extracted from the different representations (and optionally from further data). The extracted features from the different representations (and optionally from further data) are combined in a second step, preferably by using an attention mechanism. In a third step, the joint representation of all audio data (and optionally further data) resulting from the combination is assigned to one of at least two classes.

This approach is shown schematically in Fig. 2 and Fig. 3 in the form of examples.

In the example depicted in Fig. 2, there are two sets of audio data, a first set of audio data ADI, and a second set of audio data AD2. The first set of audio data ADI can be, for example, sound(s) originating from a patient's inhalation and/or exhalation, and/or from cough. The second set of audio data AD2 can be, for example, voice sound(s) of the patient (such as one or more pronounced vowels and/or words). In the example depicted in Fig. 2, two representations are generated from each set of audio data. From the first set of audio data ADI, a first time-domain representation TDR1, and a first time-frequency representation TFR1 are generated. From the second set of audio data AD2, a second time-domain representation TDR2, and a second time -frequency representation TFR2 are generated. From each representation, features are extracted. The extracted features are in the form of feature vectors. From the first time-domain representation TDR1 a first feature vector FV1 is generated; from the first timefrequency representation TFR1 a second feature vector FV2 is generated; from the second time-domain representation TDR2 a third feature vector FV3 is generated; from the second time -frequency representation TFR2 a fourth feature vector FV4 is generated. The feature vectors FV1, FV2, FV3, and FV4 are combined in a joint representation JR. The joint representation JR is assigned to one of the at least two classes Cl, C2, ... , Cm, wherein m is an integer greater than 1. Note that in the example depicted in Fig. 2, the joint representation JR is generated directly from the feature vectors FV1, FV2, FV3, and FV4. However, it is also possible to first combine feature vectors FV1 and FV2 into a joint feature vector, combine feature vectors FV3 and FV4 into another joint feature vector, and then combine the resulting joint feature vectors into a joint representation JR. Such an approach is shown in Fig. 3. It is also possible to first combine feature vectors FV 1 and FV3 into a joint feature vector, combine feature vectors FV2 and FV4 into another joint feature vector, and then combine the resulting joint feature vectors into a joint representation JR.

In the example depicted in Fig. 3, there are three sets of audio data, a first set of audio data ADI, a second set of audio data AD2, and a third set of audio data AD3. The first set of audio data AD 1 can be, for example, cough sound(s) of a patient. The second set of audio data AD2 can be, for example, breathing sound(s) of the patient. The third set of audio data AD3 can be, for example, voice sound(s) of the patient (such as one or more pronounced vowels and/or words). In the example depicted in Fig. 3, two representations are generated from each set of audio data: a time-domain representation TDR, and a time-frequency representation TFR. There are two feature extraction units FEi and FE 2 . The feature extraction unit FEi is configured to extract time-domain features from each time-domain representation TDR. The result is three feature vectors FVl l, FV1 2, and FV1 3. The feature vector FVl l represents the time domain features of the first set of audio data AD 1. The feature vector FV1 2 represents the time domain features of the second set of audio data AD2. The feature vector FV1 3 represents the time domain features of the third set of audio data AD3. The feature extraction unit FE 2 is configured to extract time-frequency features from each time -frequency representation TFR. The result is three feature vectors FV2 1 , FV2 2, and FV2 3. The feature vector FV2 1 represents the timefrequency features of the first set of audio data ADI. The feature vector FV2 2 represents the time- frequency features of the second set of audio data AD2. The feature vector FV2 3 represents the timefrequency features of the third set of audio data AD3. The feature vectors FVl l and FV2 1 are combined, e.g., concatenated. The feature vectors FV1 2 and FV2 2 are combined, e.g., concatenated. The feature vectors FV1 3 and FV2 3 are combined, e.g., concatenated. The combined feature vectors are combined to a joint representation JR by a feature vector combination unit FU. The joint representation JR is used by the classifier C for classification.

As used herein, “feature extraction” is a process of dimensionality reduction by which an initial set of data is reduced to more manageable groups for processing. Feature extraction starts from the initial set of data and builds derived values (features) intended to be informative and non-redundant. A characteristic of these large data sets is a large number of variables that require a lot of computing resources to process. Feature extraction is the name for methods that select and/or combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original data set.

In a preferred embodiment, a feature vector is generated from each representation of an audio data set. A “feature vector” is a /^-dimensional vector of numerical features that represent an object (in the present case an audio signal and/or further data), wherein p is an integer greater than 0. The term “feature vector” shall also include single values, matrices, tensors, and the like.

In a preferred embodiment, an artificial neural network is used for feature extraction (hereinafter also referred to as feature extraction network/unit). An artificial neural network (ANN) is a biologically inspired computational model. An ANN usually comprises at least three layers of processing elements: a first layer with input neurons (nodes), a k th layer with at least one output neuron (node), and k-2 inner (hidden) layers, where k is an integer greater than 2.

In such a network, the input neurons serve to receive the input data. The output neurons serve to output data such as a feature vector, a joint representation, or a classification result. The processing elements of the layers are interconnected in a predetermined pattern with predetermined connection weights therebetween. Each network node can represent a calculation of the weighted sum of inputs from prior nodes and a non-linear output function. The combined calculation of the network nodes relates the inputs to the outputs.

Fig. 4 shows schematically by way of example a feature extraction network. The feature extraction network comprises an input layer IL, a number n of hidden layers HLi to HL n and an output layer. The input neurons of the input layer IL serve to receive a representation ADR of audio data. The output neurons serve to output a feature vector FV.

The neurons of the input layer IL and the hidden layer HLi are connected by connection lines having a connection weight, and the neurons of the hidden layer HL n and the output layer are also connected by connection lines with a connection weight. Similarly, the neurons of the hidden layers are connected to the neurons of neighboring hidden layers in a predetermined manner (not shown in Fig. 4). The connection weights can be learned through training.

In a preferred embodiment of the present invention, one or more feature extraction networks used for feature extraction are or comprise a convolutional neural network (CNN)

A CNN is a class of deep neural networks that comprises an input layer with input neurons, an output layer with at least one output neuron, as well as multiple hidden layers between the input layer and the output layer. The hidden layers of a CNN typically comprise filters (convolutional layer) and aggregation layers (pooling layer) which are repeated alternately and, at the end, one layer or multiple layers of completely connected neurons (dense/fiilly connected layer(s)).

In a preferred embodiment, the convolutional neural network of the one or more feature extraction networks comprise(s) shortcut connections. Usually, there is one feature extraction network for each representation of audio data, each feature extraction unit being configured to generate a feature vector from the respective representation of the audio data. The design of each feature extraction network can be different depending on the representation of audio data. For example, a convolutional neural network may use a 2D kernel on timefrequency representation (spectrogram) but a ID kernel on time presentation.

All feature vectors are also collectively referred to as the set of feature vectors.

For example, for the extraction of features from a time-domam representation, a feature extraction network as disclosed by A. Baevski el al. ( av2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, arXiv:2006.11477 [cs, CL], 2020) can be used.

For example, for the extraction of features from a time -frequency representation, a feature extraction network as disclosed by A. Dosovitskiy et l. (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv:2010.11929 [cs.CV], 2021) can be used.

Preferably, the feature extraction networks are components of a higher-level classification model for audio data. Preferably, the feature extraction networks are trained together (end-to-end) with the higher- level classification model (details about the training can be found later in the description.).

From the set of feature vectors, a joint representation is generated, preferably using an attention mechanism. It was found that the use of an attention mechanism leads to a higher prediction accuracy than a (simple) combination (e.g., concatenation) of the feature vectors.

In neural networks, attention is a technique that mimics cognitive attention. The effect enhances some parts of the input data while diminishing other parts - the thought being that the network should devote more focus to that small but important part of the data. Learning which part of the data is more important than others is trained in the training phase.

The joint representation can be a weighted combination of all of the feature vectors, with the most relevant features being attributed the highest weights.

The attention mechanism can be, e.g., an attention-based pooling or a self-attention mechanism.

Fig. 5 shows schematically, by way of example, the generation of a joint representation from a set of feature vectors using attention-based pooling. In the example depicted in Fig. 5, the set of feature vectors consists of three feature vectors FV1, FV2 and FV3. The number c indicates the number of feature vectors, the number d indicates the dimension of each feature vector. In a first step (110), a score vector SV is generated from the set of feature vectors. For example, a series of 1x1 convolutions can be applied on each feature vector to reduce the dimension to 1. In a second step ( 120), a softmax activation function is applied to the score vector SV to obtain a set of attentions weights. In a third step, a weighted combination (e.g., a weighted sum) of the feature vectors is computed according to the attention weights to produce a joint representation JR.

Fig. 6 shows schematically, by way of example, the generation of a joint representation from a set of feature vectors using a self-attention mechanism. The self-attention mechanism allows the inputs to interact with each other (“self’) and find out who they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores. In a first step (210), for each feature vector FVi a projected feature vector FVi’ is computed, wherein i is an index that runs through the numbers 1 to 3. The projected feature vectors FVi’ can be computed, e.g., by multiplying the respective feature with a learnable matrix M: FVi’ = M x FVi. In a second step (220), attention weights W_II, W_12, and W_13, are computed for the first feature vector FV 1. W_11 quantifies the attention that the first feature vector FVI pays to itself, W 12 quantifies the attention that the first feature vector FVI pays to the second feature vector FV2, W_13 quantifies the attentions that the first feature vector FVI pays to the third feature vector FV3. The attention weights can denote the similarity between one vector and the others. The attentions weights can be computed by any similarity measures, e.g., by calculating the dot product (scalar product) of the respective vectors. Analogously, attentions weights can be computed for the second feature vector FV2, and the third feature vector FV3. In step (230), attention weights W_21, W_22, and W_23, are computed for the second feature vector FV2. W_21 quantifies the attention that the second feature vector FV2 pays to the feature first vector FV1, W 22 quantifies the attention that the second feature vector FV2 pays to itself, W_23 quantifies the attentions that the second feature vector FV2 pays to the third feature vector FV3. In step (240), attention weights W_31, W_32, and W_33, are computed for the third feature vector FV3. W_31 quantifies the attention that the third feature vector FV3 pays to the first feature vector FV1, W_32 quantifies the attention that the third feature vector FV3 pays to the second feature vector FV2, W_33 quantifies the attentions that the third feature vector FV3 pays to itself. In step (250), an attention weighted feature vector FVT” is computed from the projected feature vectors and the attention weights: FV1” = FVT • W i l + FV2’ • W_12 + FV3’ • W_13. In step (260), an attention weighted feature vector FV2” is computed from the projected feature vectors and the attention weights: FV2” = FVT • W_21 + FV2’ • W_22 + FV3’ • W_23. In step (270), an attention weighted feature vector FV3” is computed from the projected feature vectors and the attention weights: FV3” = FVT • W_31 + FV2’ • W_32 + FV3’ • W_33. In step (280), the attentions weighted feature vectors FVT’, FV2”, and FV3” are combined into a joint representation IR, e.g., by using a multi-layer perceptron.

The attention mechanisms described herein and other attention mechanisms that may be used to carry out the present invention are described in scientific articles and patents / patent applications (see e.g.: arXiv: 1807.03380vl [cs.CV] 9 Jul 2018, arXiv: 1911.07559v2 [cs.CV] 5 Dec 2019, arXiv: 1904.02874v3 [cs.LG] 12 Jul 2021, arXiv:2006.03347vl [cs.CV] 5 Jun 2020, DOI: 10.3390/fil 1010009, EP3166049A1, US20170262996, WO2020/222985).

The joint representation can be used as an input of a classifier.

The classifier is configured to assign the joint representation to one of at least two groups (classes). For example, the classifier can be configured to assign the joint representation to one of two classes, one class representing patients suffering from a specific disease, such as chronic obstructive pulmonary disease (COPD), corona virus disease 2019 (COVID- 19), (chronic) bronchitis, emphysema, cystic fibrosis, pneumonia, tuberculosis, interstitial lung disease (ILD), pulmonary hypertension, chronic cough, and/or others, the other class representing patients not suffering from the specific disease.

Alternatively, the classifier may be configured to assign the j oint representation to one of several severity classes of a specific disease, where one class may represent patients who do not exhibit symptoms of the specific disease, one class may represent patients who exhibit a mild form of the disease that does not require medical intervention, and a third class may represent patients who exhibit a severe form of the disease that requires immediate medical intervention.

Alternatively, the classifier may be configured to assign the joint representation to one of a plurality of several classes, wherein one class represents patients who do not have one of a plurality of different diseases and each of the remaining classes represents patients suffering from one of the plurality of the different diseases.

The classifier can be an artificial neural network a support vector machine, a random forest and/or any other model that allows classification.

In a preferred embodiment, the classifier is or comprises an artificial neural network. Such an artificial neural network can be, e.g., a multi-layer perceptron in which the last layer has a number of neurons corresponding to the number of classes.

In a preferred embodiment, the feature extraction, the generation of the joint representation as well as the classification is performed by a single artificial neural network which is trained end-to-end to classify audio data (and optionally further data). Fig. 7 shows schematically by way of example an artificial neural network for the classification of audio data (and optionally further data).

The artificial neural network ANN comprises a number n of feature extraction units (networks) FEi, FE 2 , ... , FE n . Each feature extraction unit provides an input layer for inputting representations of audio data, and optionally further data. The input layers are indicated by the reference signs ILi, IL 2 , .. . , IL n . The input data are indicated by the reference signs DRi, DR 2 , ..., DR„. Each feature extraction unit is configured to extract features from the input data and generate a feature vector. The feature vectors are indicated by the reference signs FV1, FV2, ... , FVn. The feature vectors are inputted into a feature vector combination unit FU. The feature vector combination unit is configured to generate a joint representation JR from the feature vectors. The joint representation JR is inputted into a classifier C. The classifier C is configured to generate a classification result CR; the classification result CR indicates whether and/or to what extent the input data contain an indication of the presence of a disease.

As already indicated, the input data may include further data about a patient besides the two or more representations of one or more sets of audio data.

Further data can be, e.g., data representing one or more signals other than audio signals (non-audio signal data).

In general, a “signal” is a function that conveys information about a phenomenon and/or an event. Preferably, a signal is a change in a physical quantity over time. Signal data is a digital representation of said signal.

Preferably, the signal is a representation of a body action. Body actions are measurable events that can provide information about a person's health status. Examples of body actions include breathing, cough, snoring, sneezing, hiccups, vomiting, shouting, swallowing, wheezing, shortness of breath, chewing, teeth grinding, chills, convulsions, spasm and/or the like.

For example, coughing produces sound that can be detected as an acoustic signal (audio signal). At the same time, coughing leads to movements of at least the upper body, which can be detected, e.g., as a visible signal by a camera and/or as an electrical signal by means of acceleration sensors. In addition, cough produces electromyographic signals.

The way in which a signal is generated, present and/or detected is also referred to as the “modality” of the signal. To stay with the example of coughing, a cough event usually generates an acoustic signal (audio signal), an acceleration signal and an electrocardiogram signal and other metrologically detectable signals that can be used for detection, identification and/or characterization of the event causing the signals and/or of a disease. The acoustic signal, the acceleration signal and the electromyographic signal are examples of signals of different modalities.

In a preferred embodiment, the one or more signals relate to the same event. An “event” is a recognizable occurrence at a specific time or during a specific time period. An event can, e.g., be a (specific) body action.

From each signal which is used as further data for the classification as described herein, two or more different representations can be generated, such as a time-domain representation, a frequency -domain representation and/or a time-frequency-representation. From such representations, features can be extracted, and the resulting feature vector(s) can be combined together with feature vectors generated from representations of audio data to a joint representation.

Additionally or alternatively, further data can comprise patient data such as age, gender, body size, body weight, body mass index, ethnics, resting heart rate, heart rate variability, blood pressure, sugar concentration in urine, body temperature, impedance (e.g., thoracic impedance), lifestyle information about the life of the patient, such as consumption of alcohol, smoking, and/or exercise and/or the patient’s diet, medical intervention parameters such as regular medication, occasional medication, or other previous or current medical interventions and/or other information about the patient’s previous and/or current treatments and/or reported health conditions and/or combinations thereof.

Further data may comprise information about a person's condition obtained from the person himself/herself (self-assessment data, (electronic) patient reported outcome data (e)PRO)).

Further data may comprise one or more medical images. A medical image is a visual representation of the human body or a part thereof.

Techniques for generating medical images include X-ray radiography, computerized tomography, fluoroscopy, magnetic resonance imaging, ultrasonography, endoscopy, elastography, tactile imaging, thermography, microscopy, positron emission tomography and others.

Examples of medical images include CT (computer tomography) scans, X-ray images, MRI (magnetic resonance imaging) scans, fluorescein angiography images, OCT (optical coherence tomography) scans, histopathological images, ultrasound images and others.

Preferably, the artificial neural network is trained in an end-to-end training procedure to assign the input data to one of at least two classes.

Such training is based on training data, the training data comprising, for each person of a multitude of persons, reference input data (including different representations of one or more sets of audio data), and an information whether or not the person suffers from a specific (e.g., respiratory) disease and/or information about the severity of the disease (target).

In the training process, reference input data are inputted into the artificial neural network and the artificial neural network generates an output. The output is compared with the (known) target. Parameters of the artificial neural network (such as network connections and/or other parameters) are modified in order to reduce the deviations between the output and the (known) target to a (defined) minimum.

During training, a loss function can be used to evaluate the prediction accuracy of the network. For example, a loss function can include a metric of comparison of the output and the target. The loss function may be chosen in such a way that it rewards a wanted relation between output and target and/or penalizes an unwanted relation between an output and a target. Such a relation can be, e.g., a similarity, or a dissimilarity, or another relation.

For example, a loss function can be used to calculate a loss value for a given pair of output and target. The aim of the training process can be to modify (adjust) parameters of the machine learning model in order to reduce the loss value to a (defined) minimum.

A loss function may for example quantify the deviation between the output of the machine learning model for a given input and the target. If, for example, the output and the target are numbers, the loss function can be the absolute difference between these numbers. In this case, a high value of the loss function can mean that a parameter of the model needs to undergo a strong change.

In the case of vector-valued outputs, for example, difference metrics between vectors such as the root mean square error, a cosine distance, a norm of the difference vector such as a Euclidean distance, a Chebyshev distance, an Lp-norm of a difference vector, a weighted norm or any other type of difference metric of two vectors can be chosen. These two vectors may for example be the desired output (target) and the actual output.

In the case of higher dimensional outputs, such as two-dimensional, three-dimensional or higherdimensional outputs, for example an element-wise difference metric may be used. Alternatively or additionally, the output data may be transformed, for example to a one -dimensional vector, before computing a loss function. Common loss functions such as Cross-Entropy Loss, Focal-Loss, Label Regression Loss (e.g., LI) and/or others and/or weighted combinations of loss functions can also be used to train the machine learning models described in this description. Modifying model parameters to reduce the loss value(s) calculated by one or more loss functions can be done using an optimization procedure, such as, a gradient descent procedure.

Fig. 8 shows schematically by way of example the process of training an artificial neural network. The artificial neural network ANN is trained on the basis of training data TD. The training data comprise a multitude of data sets, each data set comprising input data ID and target data T. In the example shown in Fig. 8, only one data set comprising input data ID and a target T is shown. The input data ID is inputted into the artificial neural network ANN. The artificial neural network is configured to generate, at least partially on the basis of the input data ID and network parameters, an output 0 (e.g., a classification result). The output O is compared with the target T. This is done by using a loss function LF, the loss function quantifying the deviations between the output 0 and the target T. For each pair of an output 0 and the respective target T, a loss value is computed. During training the network parameters are modified in a way that reduces the loss values to a defined minimum. The aim of the training is to let the artificial neural network generate for each input data an output which comes as close to the corresponding target as possible. Once the defined minimum is reached, the (now fully trained) artificial neural network can be used to predict an output for new input data (input data which have not been used during training and for which the target is usually not (yet) known).

The machine learning model of the present disclosure can be trained, and the trained machine learning model can be used, for example, forthe detection of a respiratory disease, such as Corona Disease 2019 (CO VID-19), which is caused by Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2).

There are several advantages of using body sounds for screening a respiratory disease, such as COVID- 19. First, because PCR (PCR: polymerase chain reaction) testing capacities are limited, screening with body sound or in conjunction with antigen tests can help prioritize who is eligible for PCR tests. If anyone with flu-like symptoms can order a PCR test, it will soon exceed the testing capacity. Only the suspects indicated by body sound screening should proceed with PCR tests. Body sound screening can rapidly identify suspect cases without asking them to quarantine while waiting for PCR results. Second, similar to antigen tests, body sound screening is fast, affordable, and conveniently conducted without medical professionals. The cost for running body sound screening can even be lower than that of antigen tests because it can be installed as software or mobile app on any device and can use the device microphone. Users do not need to buy additional support kits and can use their device to record, analyze, and monitor their status unlimited times. This is particularly useful in regions or countries where testing capacities are scarce, inaccessible, or expensive.

Body sounds such as cough, breath, and/or speech contain biomarkers that are indicative of numerous respiratory diseases such as COVID-19 and can be combined using an appropriate fusion rule such as attention.

Unlike research that typically studies each body sound independently, this disclosure combines different body sounds at the feature level. In other words, a machine learning model is trained that leams a joint feature vector for different body sounds. The joint feature vector is optimized to implicitly reflect the relative importance of each body noise in the final prediction.

The operations in accordance with the teachings herein may be performed by at least one computer system specially constructed for the desired purposes or at least one general-purpose computer system specially configured for the desired purpose by at least one computer program stored in a typically non- transitory computer readable storage medium.

The term “non-transitory” is used herein to exclude transitory, propagating signals or waves, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application.

A “computer system” is a system for electronic data processing that processes data by means of programmable calculation rules. Such a system usually comprises a “computer”, that unit which comprises a processor for carrying out logical operations, and also peripherals. In computer technology, “peripherals” refer to all devices which are connected to the computer and serve for the control of the computer and/or as input and output devices. Examples thereof are monitor (screen), printer, scanner, mouse, keyboard, drives, camera, microphone, loudspeaker, etc. Internal ports and expansion cards are, too, considered to be peripherals in computer technology.

Computer systems of today are frequently divided into desktop PCs, portable PCs, laptops, notebooks, netbooks and tablet PCs and so-called handhelds (e.g., smartphone); all these systems can be utilized for carrying out the invention.

The term “process” as used above is intended to include any type of computation or manipulation or transformation of data represented as physical, e.g., electronic, phenomena which may occur or reside, e.g., within registers and/or memories of at least one computer or processor. The term processor includes a single processing unit or a plurality of distributed or remote such units.

Fig. 9 illustrates a computer system (1) according to some example implementations of the present disclosure in more detail.

Generally, a computer system of exemplary implementations of the present disclosure may be referred to as a computer and may comprise, include, or be embodied in one or more fixed or portable electronic devices. The computer may include one or more of each of a number of components such as, for example, a processing unit (20) connected to a memory (50) (e.g., storage device).

The processing unit (20) may be composed of one or more processors alone or in combination with one or more memories. The processing unit (20) is generally any piece of computer hardware that is capable of processing information such as, for example, data, computer programs and/or other suitable electronic information. The processing unit (20) is composed of a collection of electronic circuits some of which may be packaged as an integrated circuit or multiple interconnected integrated circuits (an integrated circuit at times more commonly referred to as a “chip”). The processing unit (20) may be configured to execute computer programs, which may be stored onboard the processing unit (20) or otherwise stored in the memory (50) of the same or another computer.

The processing unit (20) may be a number of processors, a multi -core processor or some other type of processor, depending on the particular implementation. For example, it may be a central processing unit (CPU), a field programmable gate array (FPGA), a graphics processing unit (GPU) and/or a tensor processing unit (TPU). Further, the processing unit (20) may be implemented using a number of heterogeneous processor systems in which a main processor is present with one or more secondary processors on a single chip. As another illustrative example, the processing unit (20) may be a symmetric multi-processor system containing multiple processors of the same type. In yet another example, the processing unit (20) may be embodied as or otherwise include one or more ASICs, FPGAs or the like. Thus, although the processing unit (20) may be capable of executing a computer program to perform one or more functions, the processing unit (20) of various examples may be capable of performing one or more functions without the aid of a computer program. In either instance, the processing unit (20) may be appropriately programmed to perform functions or operations according to example implementations of the present disclosure.

The memory (50) is generally any piece of computer hardware that is capable of storing information such as, for example, data, computer programs (e.g., computer-readable program code (60)) and/or other suitable information either on a temporary basis and/or a permanent basis. The memory (50) may include volatile and/or non-volatile memory, and may be fixed or removable. Examples of suitable memory include random access memory (RAM), read-only memory (ROM), a hard drive, a flash memory, a thumb drive, a removable computer diskette, an optical disk, a magnetic tape or some combination of the above. Optical disks may include compact disk - read only memory (CD-ROM), compact disk - read/write (CD-R/W), DVD, Blu-ray disk or the like. In various instances, the memory may be referred to as a computer-readable storage medium. The computer-readable storage medium is a non-transitory device capable of storing information, and is distinguishable from computer-readable transmission media such as electronic transitory signals capable of carrying information from one location to another Computer-readable medium as described herein may generally refer to a computer-readable storage medium or computer-readable transmission medium.

The machine learning model, the trained machine learning model and the training data may be stored in the memory (50).

In addition to the memory (50), the processing unit (20) may also be connected to one or more interfaces for displaying, transmitting and/or receiving information. The interfaces may include one or more communications interfaces and/or one or more user interfaces. The communications interface(s) may be configured to transmit and/or receive information, such as to and/or from other computer(s), network(s), database(s) or the like. The communications interface may be configured to transmit and/or receive information by physical (wired) and/or wireless communications links. The communications interface(s) may include interface(s) (41) to connect to a network, such as using technologies such as cellular telephone, Wi-Fi, satellite, cable, digital subscriber line (DSL), fiber optics and the like. In some examples, the communications interface(s) may include one or more short-range communications interfaces (42) configured to connect devices using short-range communications technologies such as NFC, RFID, Bluetooth, Bluetooth LE, ZigBee, infrared (e.g., IrDA) or the like.

The user interfaces may include a display (30). The display may be configured to present or otherwise display information to a user, suitable examples of which include a liquid crystal display (LCD), lightemiting diode display (LED), plasma display panel (PDP) or the like. The user input interface(s) (11) may be wired or wireless, and may be configured to receive information from a user into the computer system (1), such as for processing, storage and/or display. Suitable examples of user input interfaces include a microphone, image or video capture device, keyboard or keypad, joystick, touch -sensitive surface (separate from or integrated into atouchscreen) orthe like. In some examples, the user interfaces may include automatic identification and data capture (AIDC) technology (12) for machine -readable information. This may include barcode, radio frequency identification (RFID), magnetic stripes, optical character recognition (OCR), integrated circuit card (ICC), and the like. The user interfaces may further include one or more interfaces for communicating with peripherals such as printers and the like.

As indicated above, program code instructions (60) may be stored in memory (50), and executed by processing unit (20) that is thereby programmed, to implement functions of the systems, subsystems, tools and their respective elements described herein. As will be appreciated, any suitable program code instructions (60) may be loaded onto a computer or other programmable apparatus from a computer- readable storage medium to produce a particular machine, such that the particular machine becomes a means for implementing the functions specified herein. These program code instructions (60) may also be stored in a computer-readable storage medium that can direct a computer, processing unit or other programmable apparatus to function in a particular manner to thereby generate a particular machine or particular article of manufacture. The instructions stored in the computer-readable storage medium may produce an article of manufacture, where the article of manufacture becomes a means for implementing functions described herein. The program code instructions (60) may be retrieved from a computer- readable storage medium and loaded into a computer, processing unit or other programmable apparatus to configure the computer, processing unit or other programmable apparatus to execute operations to be performed on or by the computer, processing unit or other programmable apparatus.

Retrieval, loading and execution of the program code instructions (60) may be performed sequentially such that one instruction is retrieved, loaded and executed at a time. In some example implementations, retrieval, loading and/or execution may be performed in parallel such that multiple instructions are retrieved, loaded, and/or executed together. Execution of the program code instructions (60) may produce a computer-implemented process such that the instructions executed by the computer, processing circuitry or other programmable apparatus provide operations for implementing functions described herein. Execution of instructions by processing unit, or storage of instructions in a computer-readable storage medium, supports combinations of operations for performing the specified functions. In this manner, a computer system (1) may include processing unit (20) and a computer-readable storage medium or memory (50) coupled to the processing circuitry, where the processing circuitry is configured to execute computer-readable program code instructions (60) stored in the memory (50). It will also be understood that one or more functions, and combinations of functions, may be implemented by special purpose hardware-based computer systems and/or processing circuitry which perform the specified functions, or combinations of special purpose hardware and program code instructions.

Further embodiments of the present invention are:

1. A computer system configured to carry out the following steps: receiving one or more sets of audio data, each set of audio data representing a body sound, generating at least two different representations from each set of audio data, extracting features from each representation, generating a joint representation on the basis of all features, generating a classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data contain an indication of the presence of a disease, outputting the classification result.

2. The computer system according to embodiment 1, wherein the body sound is produced by a patient suffering from a respiratory disease.

3. The computer system according to embodiment 1 or 2, wherein the body sound is a sound resulting from one or more of the following: coughing, breathing, talking, singing, screaming, swallowing, wheezing, speech, voice, rales.

4. The computer system according to any one of the embodiments 1 to 3, wherein different sets of audio data are created from at least two of the following: coughing, breathing, voice.

5. The computer system according to any one of the embodiments 1 to 4, wherein from each set of audio data a time-domain representation and a time-frequency representation are generated.

6. The computer system according to any one of the embodiments 1 to 5, wherein the different representations of audio data have different resolutions.

7. The computer system according to any one of the embodiments 1 to 6, wherein an attention mechanism is used for generating the j oint representation, the attention mechanism being a self-attention mechanism.

8. The computer system according to any one of the embodiments 1 to 7, wherein an attention mechanism is used for generating the joint representation, the attention mechanism being an attentionbased pooling.

9. The computer system according to any one of the embodiments 1 to 8, wherein the steps of extraction features, generating the joint representation and generating a classification result are executed using a trained artificial neural network.

10. The computer system according to the embodiment 9, wherein the artificial neural network comprises an input layer for each representation of the different representations, a feature extraction unit for each representation of the different representations, wherein each feature extraction unit is configured to extract features from the representation and generate a feature vector, a feature vector combination unit which is configured to generate a joint representation on the basis of all feature vectors, and a classifier which is configured to assign the joint representation to one of at least two classes.

11. The computer system according to any one of the embodiments 1 to 10, wherein the joint representation is assigned to one of two classes, one class representing patients suffering from a respiratory disease, such as chronic obstructive pulmonary disease, corona virus disease 2019, bronchitis, chronic bronchitis, emphysema, cystic fibrosis, pneumonia, tuberculosis, interstitial lung disease, pulmonary hypertension, chronic cough, the other class representing patients not suffering from the respiratory disease, or one of a plurality of classes, each class representing a severity of a respiratory disease, or one of several classes, wherein one class represents patients who do not have any respiratory disease and each of the remaining classes represents a different respiratory disease.

12. A computer system that is configured to carry out the following steps: receiving one or more sets of audio data, each set of audio data representing one or more body sounds, providing a trained artificial neural network, wherein the trained artificial neural network comprises an output, and, for each set of audio data, a first input, and a second input, for each set of audio data: o generating a time-domain representation of the audio data, o generating a time -frequency representation of the audio data, o inputting the time-domain representation into the first input of the trained artificial neural network and the time -frequency representation into the second input of the artificial neural network, wherein the artificial neural network is configured and trained to

■ generate a first feature vector on the basis of the time-domain representation,

■ generate a second feature vector on the basis of the time -frequency representation,

■ generate a joint representation on the basis of the first feature vector and the second feature vector, preferably using an attention mechanism, and

■ generate, on the basis of the joint representation, a classification result, the classification result indicating whether and/or to what extent the one or more sets of audio data contain an indication of the presence of a disease, receiving, from the trained machine learning model the classification result, outputting the classification result.

13. The computer system according to any one of the embodiments 1 to 12, wherein the computer system is further configured to carry out the following steps: receiving non-audio signal data, generating at least two different representations from the non-audio signal data, extracting features from each representation, generating the joint representation on the basis of all features, preferably using an attention mechanism, generating the classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data and non-audio signal data contain an indication of the presence of a disease, outputting the classification result.

14. The computer system according to any one of the embodiments 1 to 13 , wherein the computer system is further configured to carry out the following steps: receiving further patient data, extracting features from the patient data, generating the joint representation on the basis of all features, preferably using an attention mechanism, generating the classification result on the basis of the joint representation, outputting the classification result.

16. A non-transitory computer readable medium having stored thereon software instructions that, when executed by a processor of a computer system, cause the computer system to execute the following steps: receiving one or more sets of audio data, each set of audio data representing a body sound, generating at least two different representations from each set of audio data, extracting features from each representation, generating a joint representation on the basis of all features, generating a classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data contain an indication of the presence of a disease, outputting the classification result.

17. The computer readable medium according to embodiment 16, wherein the body sound is produced by a patient suffering from a respiratory disease.

18. The computer readable medium according to embodiment 16 or 17, wherein the body sound is a sound resulting from one or more of the following: coughing, breathing, talking, singing, screaming, swallowing, wheezing, speech, voice, rales.

19. The computer readable medium according to any one of the embodiments 16 to 18, wherein different sets of audio data are created from at least two of the following: coughing, breathing, voice.

20. The computer readable medium according to any one of the embodiments 16 to 19, wherein from each set of audio data a time -domain representation and a time -frequency representation are generated.

21. The computer readable medium according to any one of the embodiments 16 to 20, wherein the different representations of audio data have different resolutions.

22. The computer readable medium according to any one of the embodiments 16 to 21, wherein an attention mechanism is used for generating the joint representation, the attention mechanism being a self-attention mechanism.

23. The computer readable medium according to any one of the embodiments 16 to 22, wherein an attention mechanism is used for generating the joint representation, the attention mechanism being an attention-based pooling.

24. The computer readable medium according to any one of the embodiments 16 to 23, wherein the steps of extraction features, generating the joint representation and generating a classification result are executed using a trained artificial neural network.

25. The computer readable medium according to the embodiment 24, wherein the artificial neural network comprises an input layer for each representation of the different representations, a feature extraction unit for each representation of the different representations, wherein each feature extraction unit is configured to extract features from the representation and generate a feature vector, a feature vector combination unit which is configured to generate a joint representation on the basis of all feature vectors, and a classifier which is configured to assign the joint representation to one of at least two classes.

26. The computer readable medium according to any one of the embodiments 16 to 25, wherein the joint representation is assigned to one of two classes, one class representing patients suffering from a respiratory disease, such as chronic obstructive pulmonary disease, corona virus disease 2019, bronchitis, chronic bronchitis, emphysema, cystic fibrosis, pneumonia, tuberculosis, interstitial lung disease, pulmonary hypertension, chronic cough, the other class representing patients not suffering from the respiratory disease, or one of a plurality of classes, each class representing a severity of a respiratory disease, or one of several classes, wherein one class represents patients who do not have any respiratory disease and each of the remaining classes represents a different respiratory disease.

27. A non-transitory computer readable medium having stored thereon software instructions that, when executed by a processor of a computer system, cause the computer system to execute the following steps: receiving one or more sets of audio data, each set of audio data representing one or more body sounds, providing a trained artificial neural network, wherein the trained artificial neural network comprises an output, and, for each set of audio data, a first input, and a second input, for each set of audio data: o generating a time-domain representation of the audio data, o generating a time -frequency representation of the audio data, o inputting the time-domain representation into the first input of the trained artificial neural network and the time-frequency representation into the second input of the artificial neural network, wherein the artificial neural network is configured and trained to

■ generate a first feature vector on the basis of the time-domain representation,

■ generate a second feature vector on the basis of the time -frequency representation,

■ generate a joint representation on the basis of the first feature vector and the second feature vector, preferably using an attention mechanism, and

■ generate, on the basis of the joint representation, a classification result, the classification result indicating whether and/or to what extent the one or more sets of audio data contain an indication of the presence of a disease, receiving, from the trained machine learning model the classification result, outputting the classification result.

28. The computer readable medium according to any one of the embodiments 16 to 27, wherein the software instructions further cause the computer system to execute the following steps: receiving non-audio signal data, generating at least two different representations from the non-audio signal data, extracting features from each representation, generating the joint representation on the basis of all features, preferably using an attention mechanism, generating the classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data and non-audio signal data contain an indication of the presence of a disease, outputting the classification result.

29. The computer readable medium according to any one of the embodiments 16 to 28, wherein the software instructions further cause the computer system to execute the following steps: receiving further patient data, extracting features from the patient data, generating the joint representation on the basis of all features, preferably using an attention mechanism, generating the classification result on the basis of the joint representation, outputting the classification result.

Fig. 10 shows schematically, an embodiment of the computer-implemented method according to the present invention in form of a flow chart.

The method (100) comprises:

(110) receiving one or more sets of audio data, each set of audio data representing a body sound, and optionally non-audio signal data and/or further patient data,

(120) generating at least two different representations from each set of audio data and optionally from each of the non-audio signal data,

(130) extracting features from each representation and optionally from the further patient data,

(140) generating a joint representation on the basis of all features,

(150) generating a classification result on the basis of the joint representation, the classification result indicating whether and/or to what extent the one or more sets of audio data and optionally the non-audio signal data and/or further patient data contain an indication of the presence of a disease,

(160) outputting the classification result.

Example

Tables 1, 2 and 3 summarize the classification results of different classification models in terms of the arithmetically averaged AUC values (AUC: area underthe curve; see, e.g.: F. Cady: Data Science - The Executive Summary, Wiley 2020, ISBN: 9781119544173, in particular chapter 5.2.3).

Different sets of audio data were used for classifying the audio data into one of two classes, one class representing patients suffering from COVID- 19, the other class representing patients not suffering from COVID-19.

Persons were asked to cough (Cough), deeply breath in and out (Breath), and to pronounce defined words (Speech).

The column “Body Sound Fusion Technique” indicates how the feature vectors generated from the representations of the audio data have been combined.

The classification was performed on the basis of a time-domain representation (Waveform), on the basis of a time-frequency representation (Spectogram), and on the basis of two representations (Hybrid): a time-frequency representation and a time-domain representation.

The following results were obtained:

The use of different audio data sets leads to a higher accuracy of the classification model than the use of only one audio data set.

Using two different representations of an audio dataset leads to higher accuracy than using only one representation.

Using an attention mechanism leads to higher accuracy of the classification model than combining (e.g., concatenating) feature vectors without attention.

Table 1: Performance comparison between using a single representation (waveform or spectrogram) vs. multi-representation (hybrid)

(*) A simple ID convolutional neural network (**) A Vision Transformer network

Table 2: Performance comparison between models using waveform representation with and without attention

Table 3: Performance comparison between models using spectrogram representation with and without attention (*) without attention = just concatenation