Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM FOR IMPROVING THE SPEECH INTELLIGIBILITY OF PEOPLE WITH TEMPORARY OR PERMANENT SPEECH DIFFICULTIES
Document Type and Number:
WIPO Patent Application WO/2024/056899
Kind Code:
A1
Abstract:
A system for improving the speech intelligibility of people with temporary or permanent speech difficulties is described, said system comprising a microphone, a speaker, a processor and a memory which stores a speech synthesis algorithm, said processor being configured to perform, by means of the speech synthesis algorithm, the following steps: - acquisition of a microphone signal at a predefined sampling frequency and storage of the samples acquired in blocks comprising a predefined number of samples; - filtering of the samples in each of said blocks; - subdivision of each block into frames containing a predefined number of samples, where each frame in a block has a predefined overlap of samples with at least one of a following frame and a preceding frame; - for each frame of samples acquired, generation a synthesized frame with n partials; - processing the synthesized frames and output of the synthesized sound by means of the inverse overlap-add method.

Inventors:
CENCESCHI SONIA (CH)
DANI FRANCESCO ROBERTO (IT)
TRIVILINI ALESSANDRO (CH)
COLLETTI ELISA (CH)
Application Number:
PCT/EP2023/075531
Publication Date:
March 21, 2024
Filing Date:
September 15, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SPINELLI HOLDING SA (CH)
International Classes:
G10L21/057
Foreign References:
US20060004569A12006-01-05
US20120150544A12012-06-14
Other References:
HANSON B. ET AL: "Speech enhancement with harmonic synthesis", vol. 8, 1 January 1983 (1983-01-01), pages 1122 - 1125, XP093017091, Retrieved from the Internet [retrieved on 20230202], DOI: 10.1109/ICASSP.1983.1171978
HAMID REZA SHARIFZADEH ET AL: "Reconstruction of Normal Sounding Speech for Laryngectomy Patients Through a Modified CELP Codec", IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, IEEE, USA, vol. 57, no. 10, 1 October 2010 (2010-10-01), pages 2448 - 2458, XP011327030, ISSN: 0018-9294, DOI: 10.1109/TBME.2010.2053369
Attorney, Agent or Firm:
M. ZARDI & CO S.A. (CH)
Download PDF:
Claims:
CLAIMS

1. System for improving the speech intelligibility of people with temporary or permanent speech difficulties, comprising a microphone, a speaker, a processor and a memory which stores a speech synthesis algorithm, said processor being configured to carry out, by means of the speech synthesis algorithm, the following steps:

- acquisition of a microphone signal at a predefined sampling frequency and storage of the samples acquired in blocks comprising a predefined number of samples;

- filtering of the samples in each of said blocks;

- subdivision of each block into frames containing a predefined number of samples, where each frame in a block has a predefined overlap of samples with at least one of a following frame and a preceding frame;

- for each of said sample frames, the processor being configured to perform the following processing:

* calculate the root mean square (RMS) of the samples in the frame and detect voice activity (VAD) in the frame on the basis of the calculated root mean square (RMS);

* if voice activity is detected (VAD) in the frame, carry out a linear predictive coding (LPC) in order to determine the characteristics of the samples in the frame, said characteristics comprising formant frequencies, and bandwidths and amplitudes of the formant frequencies;

* determining if the frame includes anyone of vowels or if the frame includes anyone of consonants;

- for each frame of acquired samples, generation of a synthesized frame having n partials, wherein the processor is configured to attribute to each partial a set volume based on the characteristics of the samples in the frames, wherein the synthetized frame is generated by means of additive synthesis of sinusoids with amplitudes and bandwidth of the formant frequencies, if the frame includes anyone of said vowels, and the synthetized frame is generated by means of a random noise generation algorithm, if the frame includes anyone of the consonants;

- calculation of a Fast Fourier Transform (FFT) on each synthesized frame and on each frame of acquired samples, providing as output two respective vectors of complex numbers, wherein magnitudes are modules of the complex numbers;

- calculation of two magnitude masks of the transform (FFT) of the frame of acquired samples, said two magnitude masks being calculated as the average of the spectral magnitudes of the samples in the frame using two different average dimensions, and subsequent normalization;

- execution of spectral multiplications of the magnitudes of the transform (FFT) of the synthesized frame and the magnitudes in the two magnitude masks of the transform (FFT) of the frame of acquired samples, the modulus of the complex numbers of the Fast Furious Transform of the synthetized frame is substituted by the values obtained by the spectral multiplications;

- time translation of the synthesized and multiplied frame, by means of an inverse transform (IFFT);

- processing of the synthesized and time-translated frames by means of a Bartlett window and output of the synthesized sound by means of the inverse overlap-add method.

2. System according to Claim 1 wherein the processor is configured to determine if the frame includes anyone of said vowels or anyone of said consonants by means of:

-computation of power spectrum values as the square of the magnitudes in the vector output from the Fast Fourier Transform (FFT) on each frame, and storing the power spectrum values into a power spectrum vector;

-mapping the power spectrum values of the power spectrum vector into triangular windows, wherein each triangular window includes a number of values lower than the number of power spectrum values in the power spectrum vector, and each triangular window is associated to power spectrum values stored in a portion of the power spectrum vector among a plurality of portions, wherein each portion of the power spectrum vector includes at least one value of another portion, the power spectrum values being indexed in each portion and values in each triangular window being indexed in each triangular window; wherein said mapping multiplies values of each triangular window with the power spectrum values of the associated portion of the power spectrum vector, wherein a value with index i in the triangular window is multiplied with a value with index i in the associated portion of the power spectrum vector, and wherein results of said multiplication are stored in a power spectrum matrix;

-in the matrix, for each row in the matrix, and for each power spectrum value in said each row, applying the logarithmic function to obtain a corresponding logarithmic value of the power spectrum value;

-summing all the corresponding logarithmic values obtained from the row to obtain a new power spectrum value and

-determining a new vector including the new power spectrum values obtained by summing the corresponding logarithmic values of all the rows in the matrix;

-applying the discrete cosine transform to the power spectrum values of the new vector and obtaining a vector of LFCC features corresponding to the energy values;

-dividing the vector of the LFCC features into two sub-vectors of equal length corresponding, respectively, to the energy values in a first range and in a second range, the corresponding frequencies of the energy values in the first range being lower than the corresponding frequencies of the energy values in the second range;;

-calculating the average of the energy values in the first and second sub-vectors to determine two corresponding first and second scalar values representative of the amount of power in the first range and second range; wherein if the first scalar value is greater than the second scalar value, the frame is determined to include any one of said vowel; if the first scalar value is lower than the second scalar value, the frame is determined to include any one of said consonant. 3. System according to Claim 1 , characterized in that said predefined number of samples is 1024 samples per block and said sampling frequency is 44,100 Hz.

4. System according to Claim 1 , characterized in that said processor is further configured to perform the step of eliminating the audio return from each block, said elimination step being carried out if the distance between the microphone and the loudspeaker is less than a predefined threshold value, said predefined threshold value being stored in the memory and corresponding to the physical distance between the microphone and the loudspeaker incorporated in an apparatus of the system which can be worn by persons with said speech difficulties.

5. System according to Claim 4, characterized in that it comprises an antiLarsen filter and in that the processor is configured to carry out said step of eliminating the audio return by means of the anti-Larsen filter.

6. System according to Claim 1 , characterized in that it comprises a high-pass filter with a cut-off frequency at 100 Hz and in that the processor is configured to perform said step of filtering each block by means of the high-pass filter.

7. System according to Claim 1 , characterized in that each frame comprises 256 samples and in that each frame has a predefined overlap of samples with a following frame and preceding frame, the overlap being preferably 50% (128 samples).

8. System according to Claim 7, characterized in that the processor is configured to store each sample of a frame in a respective stack of samples, said stack being in said memory.

9. System according to Claim 7, characterized in that the processor is configured to perform said subdivision by means of a Bartlett window or Hamming window.

10. System according to Claim 1 , characterized in that the number n of partials is 80.

11. System according to Claim 8, characterized in that the processor is configured to store the characteristics of the samples in the stack.

12. System according to Claim 11 , characterized in that the processor is configured to apply moving average filters to the characteristics of the samples in the stack, in order to reduce the variability of the characteristics of the samples between overlapping frames.

13. System according to Claim 1 , characterized in that the processor is configured to apply a noise filter to the n partials and a Bartlett window.

14. System according to Claim 1 , characterized in that said processor is configured to perform a reduction of the F1 and F2 formant frequencies, by 100 Hz and 150 Hz respectively, of the samples in the frame.

Description:
Title: System for improving the speech intelligibility of people with temporary or permanent speech difficulties

Filed of application

The present invention relates to a system for improving the speech intelligibility of people with temporary or permanent speech difficulties. In particular, the present invention relates to a system of the aforementioned type provided with a microphone, a speaker, a processor and a memory which stores a speech synthesis algorithm.

Prior art

A number of speech recognition systems which are based on a speech corpus, namely a database of speech audio files and corresponding text transcriptions, are known. The speech corpus is used to create an acoustic model to be then used for speech recognition. These systems are useful for recognizing speech which is audible to humans and are therefore used in various applications, also in the telecommunications sector.

However, the speech corpus for recognizing whispered speech, during which the vocal cords do not vibrate, is not available, at least in some languages, such as Italian, to mention just one. Therefore the aforementioned recognition systems are not suitable for improving the speech intelligibility of persons with temporary or permanent speech difficulties.

Other systems based exclusively on the audio signal which can be acquired from persons with speech difficulties are known, these consisting in increasing the intensity of the signal for the listener. However, it has been found that these systems are also entirely unsuitable for improving the speech intelligibility of persons with speech difficulties, which in fact requires the reconstruction of absent or very muted parts of the speech, which are not acquired with the audio signal of the speech-impaired person.

The absent parts are segment components such as vowels, consonants or silence or suprasegmental components such as intonation. These parts are absent in so-called whispered or more precisely soft-whispered speech, which is the speech which the present invention in particular intends to improve. Soft-whispered speech has certain specific acoustic characteristics which are partly indicated below:

- no or limited vibration of the vocal cords, which results in the consequent absence of fundamental peaks and the consequent absence or reduction of harmonic relations between the formants;

- near total absence of FO formants and the associated acoustic characteristics;

- flatter frequency distribution of the formants between 500 and 2000 Hz;

- tendentially higher frequency of the formants, especially the F1 formant;

- the F2 formant is similar to the F1 formant;

- the F2 and F1 formants normally have a power of less than 20dB compared to the phonic version;

- the vowels have a longer duration.

The limitations of the systems briefly commented upon are greater in the presence of noisy environments or voice telephone calls and are devoid of echo equalization or cancellation and therefore improve partially the quality of speech, but not significantly its intelligibility for the interlocutor (listener).

Also known are other systems used for example for patients who have undergone a laryngectomy (laryngoscopy), but these are outside the application context in question of the present invention, since they are invasive to use, being intended for patients without vocal cords and not patients with a vocal apparatus which is healthy, but affected by muscular or pulmonary problems, for example.

The technical problem underlying the present invention is therefore to devise a system for improving the speech intelligibility of persons with temporary or permanent speech difficulties, which is able to overcome the limitations which currently affect the already known systems.

Summary of the invention

The idea underlying the present invention is to provide a system for improving the speech intelligibility of persons with temporary or permanent speech difficulties, which is able to reconstruct also parts of the speech which are substantially inaudible to humans.

The system recognizes the voice of patients who are able to perform only minimum muscular movements, for example because they have a pulmonary or muscular force which is insufficient to emit a sound (fully) audible to humans, or at least insufficient to emit a continuous vocalized sound. These patients are able to articulate speech well (i.e. move muscles, face, tongue, etc., correctly), but the continuous sound emitted by them is difficult to understand owing to the lack of power and volume. These patients in fact cannot make their vocal cords vibrate in the correct manner, and their speech is aphonic and barely audible.

The system according to the invention is able to convert non-vocalized speech into intelligible sound, and therefore assist patients who are capable of articulating speech well, but are unable to make their vocal cords vibrate, with resultant speech which is comparable to an extremely low-powered whisper. Recognition and synthesis of the patient’s speech are performed with a minimum delay (a few milliseconds), in keeping with commercially available speech recognition and processing systems used in other application environments, for example the Siri system.

The system comprises a microphone, a speaker, a processor and a memory which stores a speech synthesis algorithm. The processor, by executing the algorithm, reconstructs the parts of speech which are inaudible to humans and reproduces for the listener a synthesis signal which is perfectly intelligible for humans. The rapidity of execution of the algorithm and the quality (intelligibility) of the synthesis signal emitted are superior to the performance of the currently known systems which are used for similar purposes.

According to the proposed solution indicated above, the technical problem is solved by a system for improving speech intelligibility according to Claim 1 .

Preferred embodiments of the system according to the present invention are described in Claims 2-14.

Further characteristic features and advantages of the system according to the present invention are described with reference to the attached drawings provided purely by way of a non-limiting example of the present invention. Brief description of the attached drawings

Figure 1 is a block diagram of the main steps of an algorithm executed by the system for improving speech intelligibility according to the present invention.

Figure 2 is a photograph of a prototype of the system according to Figure 1 .

Detailed description of the invention

An example of embodiment of the system for improving speech intelligibility according to the present invention is illustrated in the description below.

The improvement system is suitable for improving the speech intelligibility of persons with temporary or permanent speech difficulties.

The system comprises a microphone, a speaker, a processor and a memory which stores a speech synthesis algorithm.

The system is incorporated in an apparatus which may be worn by the user. For example, the microphone may be supported by a boom in the vicinity of the mouth and earphones may be fitted over the user’s ears. The earphones are optional and speakers may be used.

The processor is configured to execute, by means of the speech synthesis algorithm, the detailed steps indicated below.

Firstly, the microphone signal is acquired at a predefined sampling frequency.

The samples acquired in blocks comprising a predefined number of samples are stored in a memory.

The samples in each of the blocks are preferably filtered.

Each block is then divided up into frames containing a predefined number of samples. Each frame in a block has a predefined overlap of samples with at least one of a following frame and a preceding frame.

For each of the sample frames, the processor is configured to perform the following calculations:

* calculate the root mean square (RMS) of the samples in the frame and detect voice activity (VAD) in the frame on the basis of the calculated root mean square (RMS);

* if voice activity is detected (VAD) in the frame, carry out a linear predictive coding (LPC) in order to determine the characteristics of the samples in the frame, said characteristics comprising the formant frequencies, and the bandwidth and amplitude thereof;

* determine if the frame includes anyone of vowels or if the frame includes anyone of consonants.

For each frame of acquired samples, a synthesized frame having n partials is generated, on the bases of the characteristics of the samples. The processor is configured to attribute to each partial a set volume based on the characteristics of the samples in the frames.

The synthetized frame is generated by means of additive synthesis of sinusoids whose amplitude and frequency values are determined by the (values of) formant frequencies, bandwidths and amplitudes determined by the linear predictive coding (LPC), if the frame includes anyone of said vowels. The synthetized frame is generated by means of additive synthesis of sinusoids with amplitudes and bandwidth of the formant frequencies.

The synthetized frame is generated by means of random noise generation algorithm, if the frame includes anyone of the consonants.

A Fast Fourier Transform (FFT) is calculated on each synthesized frame and on each frame of acquired samples, providing as output two respective vectors of complex numbers, wherein magnitudes are modules of the complex numbers.

Two magnitude masks of the transform (FFT) of the frame of acquired samples are calculated, said two magnitude masks being calculated as the average of the spectral magnitudes of the samples in the frame using two different average dimensions, and subsequent normalization;

Spectral multiplications of the magnitudes of the transform (FFT) of the synthesized frame and the magnitudes in the two magnitude masks of the transform (FFT) of the frame of acquired samples are executed, to obtain a modified version of the magnitude data of the synthesized frame's spectrum.

In other words, the modulus of the complex numbers of the Fast Furious Transform of the synthetized frame is substituted by the values obtained by the spectral multiplications. By means of said substitution, the Fast Furious Transform of the synthetized frame is transformed in a multiplied Fast Furious Transform of the synthetized frame.

Time translation of the synthesized and multiplied frame (i.e. of the multiplied Fast Furious Transform of the synthetized frame) is executed, by means of an inverse transform (IFFT).

The synthesized and time-translated frames is processed by means of a Bartlett window and the synthesized sound is output by means of the inverse overlap-add method.

According to the present invention, the processor is configured to determine if the frame includes anyone of said vowels or anyone of said consonants by means of:

-computation of power spectrum values as the square of the magnitudes in the vector output from the Fast Fourier Transform (FFT) on each frame, and storing the power spectrum values into a power spectrum vector;

-mapping the power spectrum values of the power spectrum vector into triangular windows, wherein each triangular window includes a number of values lower than the number of power spectrum values in the power spectrum vector, and each triangular window is associated to power spectrum values stored in a portion of the power spectrum vector among a plurality of portions, wherein each portion of the power spectrum vector includes at least one value of another portion, the power spectrum values being indexed in each portion and values in each triangular window being indexed in each triangular window; wherein said mapping multiplies values of each triangular window with the power spectrum values of the associated portion of the power spectrum vector, wherein a value with index i in the triangular window is multiplied with a value with index i in the associated portion of the power spectrum vector, and wherein results of said multiplication are stored in a power spectrum matrix; -in the matrix, for each row in the matrix, and for each power spectrum value in said each row, applying the logarithmic function to obtain a corresponding logarithmic value of the power spectrum value;

-summing all the corresponding logarithmic values obtained from the row to obtain a new power spectrum value and

-determining a new vector including the new power spectrum values obtained by summing the corresponding logarithmic values of all the rows in the matrix;

-applying the discrete cosine transform to the power spectrum values of the new vector and obtaining a vector of LFCC features corresponding to the energy values;

-dividing the vector of the LFCC features into two sub-vectors of equal length corresponding, respectively, to the energy values in a first range and in a second range, the corresponding frequencies of the energy values in the first range being lower than the corresponding frequencies of the energy values in the second range;;

-calculating the average of the energy values in the first and second sub-vectors to determine two corresponding first and second scalar values representative of the amount of power in the first range and second range; wherein if the first scalar value is greater than the second scalar value, the frame is determined to include any one of said vowel; if the first scalar value is lower than the second scalar value, the frame is determined to include any one of said consonant.

The computation of the feature describing whether a single frame of whispered speech could be considered as consonant or vowel is also summarized as follows. The frame is assumed to have been preferably preprocessed (high- pass filtered) and windowed in the previous steps.

1) Calculate the Fast Fourier Transform of the frame and compute the power spectrum. The power spectrum of the Fast Fourier Transform of the frame is representative of the distribution of power into frequency components composing samples in the frame. Given the frame (the respective samples), the FFT of the frame determine a plurality of component signals, each with a respective frequency, amplitude and phase. The plurality of component signals composes the frame. Among the plurality of component signals so determined, a frequency may have more weight than another since the amplitude of its signal is higher. The power spectrum is the entire set of amplitudes of the component signals at the respective frequency (the power spectrum may be represented as a function plotting the amplitudes as a function of frequency of the component signals determined by the FFT). ) Map the resulting power spectrum values within linearly spaced triangular overlapping windows. The power spectrum values are the values of the amplitudes in the power spectrum of the of the component signals determined by the FFT. The linearly spaced triangular overlapping windows are a set of equally spaced triangular windows used to be multiplied with the power spectrum values to form a set of N filterbanks (where N is the number of triangular windows). In particular, mapping the power spectrum values into the linearly spaced triangular overlapping windows consist in multiplying the values of each band of the power spectrum with a triangular window. This operation results in a matrix named TRI_MAT with shape (n_windows, len_window), where n_windows is the number of triangular windows and len_window is the length of a triangular window. Each vector of the matrix contains the data of each triangular band. The vector of the matrix is the one mentioned above as “vector of power spectrum values”. ) Sum the logarithm of the values for each vector in TRI_MAT to form a new vector that will be treated as a signal, named DCT_VEC. The log function is applied to the power spectrum to provide the importance of each frequency of the component signal in the overall frame. The new vector is the one mentioned above as “new vector of power spectrum values”. ) Apply the discrete cosine transform to DCT_VEC and name the output complex vector DCT_OUT. ) The magnitudes of the complex values of DCT_OUT will be considered as the LFCC feature vector. This (the LFCC vector) is the one mentioned above as “vector of LFCC features”. ) Split the LFCC feature vector into two equal sub-vectors called LFCCJow and LFCC_high (e.g., at sampling rate=16000Hz the first sub-vector LFCCJow will be representative of the frequency bands within 0-4000Hz, while the second LFCC_high will describe frequencies between 4000- 8000Hz). The two equal sub-vectors are those mentioned above as “first and second vectors".

7) Mean the two sub-vectors LFCCJow and LFCC_high to retrieve two scalar numbers (LFCC_L and LFCC_H) describing the amount of power within each frequency band.

8) Apply a scalar comparison between LFCC_L and LFCC_H to compute the final value of the feature: a. If LFCC_L is greater than LFCC_H, then the spectrum is greatly displaced within its lower side, thus giving an estimator of vocality of the whispered speech frame. In this case the single frame is categorized as “vowel”. b. If LFCC_L is lower than LFCC_H, then the spectrum is greatly displaced within its higher side, thus reflecting the behavior of consonant sounds. In this case the single frame is categorized as “consonant”.

For each frame of acquired samples: if it is considered as a vowel frame, a synthesized frame is generated by means of additive synthesis of n partials (sinusoids) whose amplitude and frequency values are computed with respect to the frame’s formant frequencies, bandwidths and amplitudes; if it results in a consonant frame, a synthesized frame is generated by means of a random noise generation algorithm. In either case, the synthesized frame is then multiplied with a triangular window.

In another embodiment, the processor is configured to determine if the frame includes anyone of said vowels or anyone of said consonants In an alternative, by execution on an LFCC (Linear Frequency Cepstral Coefficients) algorithm in order to associate a vowel or a consonant with the characteristics determined in the frame. Such association is a coarse association, meaning that it does not determine a specific vowel or consonant but only distinguish vowels with respect to consonants.

As cited above, the processor is configured to attribute to each partial a volume which is set based on the characteristics of the samples in the frames. For example, an amplitude vector is generated.

For example, a vowel sound is recognized in the frame. Formant frequencies, amplitudes, and bandwidths are computed by means of LPC algorithm. A set of n sinusoids (where n is the number of partials) is then synthesized, each having a frequency value of F0 * i, i=1..n and an amplitude value calculated with respect to the nearest formant’s amplitude and proximity within the partial frequency and the nearest formant frequency in relation with its bandwidth. The sinusoids are then summed (additive synthesis) to produce the synthesized frame.

The amplitude vector V is associated with the amplitude spectral curve of the samples in the frame (i.e. a frame derived from the subdivision of a block from the sampled or whispered microphone signal). In particular, the amplitude vector V is a smoothed (averaged) version of the samples of the frame. The vector V may comprise n amplitude values V[i], i=0..n.

The volume of each partial / of the n partials (i=0..n) in the synthesized frame is obtained by multiplying the amplitude of the partial / for the amplitude value V[i] in the amplitude vector V.

A Fast Fourier Transform (FFT) is calculated on each synthesized frame.

A Fast Fourier Transform (FFT) is also calculated on each frame of acquired samples.

Two magnitude masks of transforms (FFT) of the synthesized and acquired frames are also calculated.

Each magnitude mask is a vector M containing magnitude values M[j] of the FFT transforms, where each value M[j] is averaged by means of a moving average. In particular, the two magnitude masks M (for example M1 , M2) are calculated as the average of the spectral magnitudes of the samples in the frame using two different average dimensions and subsequently applying a normalization.

On the basis of the magnitude values calculated above, the following processing operations are performed.

Firstly, spectral multiplication of the magnitudes of the transform (FFT) of the synthesized signal and the magnitudes in the two magnitude masks of the transform (FFT) of the acquired sample frame is performed. In particular, the multiplication of the magnitude values M[j] memorized in corresponding positions j in the magnitude vector of the transform (FFT) of the synthesized signal and in the magnitude vector of the transform (FFT) of the signal of the acquired sample frames is carried out.

Then a time translation of the transform (FFT) of the synthesized and multiplied signal is carried out by means of an inverse transform (IFFT).

The synthesized and time-translated frames as described above are then processed to obtain the output signal.

A Bartlett window is used to process the synthesized and time-translated frames.

The synthesized sound is output by means of the inverse overlap-add method.

This sound is much more intelligible than the speech of the person with speech difficulties, namely the input received by the system according to the present invention, which would not be - at least not entirely - intelligible for humans.

Some characteristics of the system, according to embodiments, are indicated below.

These characteristics are optional and therefore do not limit the invention.

In one embodiment, the predefined number of samples is 1024 per block and the sampling frequency is 44,100 Hz.

In one embodiment, the processor is further configured to carry out the step of eliminating the audio return from each block.

The step of eliminating the audio return is carried out if the distance between the microphone and the loudspeaker is less than a predefined threshold value. The predefined threshold value is stored in the memory and corresponds to the physical distance between the microphone and the loudspeaker.

In one embodiment, an anti-Larsen filter is used to carry out the step of eliminating the audio return. In one embodiment, a high-pass filter with cut-off frequency at 100 Hz is used to carry out the step of filtering each block.

In a preferred embodiment, each frame comprises 256 samples and each frame has a predefined overlap of samples with a following or preceding frame, preferably the overlap is 50% (128 samples).

The processor is configured to store each sample of a frame in a respective stack of samples, and the stack is memorized.

In one embodiment, the processor is configured to perform the subdivision by means of a Bartlett window and a Hamming window.

Preferably, the number n of partials is 80.

Preferably, the processor is configured to store in a memory the characteristics (bandwidth, amplitude and reduced formant frequencies) of the samples in the stack itself.

In one embodiment, the processor is configured to apply moving average filters to the characteristics of the samples in the stack, in order to reduce the variability of the characteristics of the samples between overlapping frames.

In one embodiment, the processor is configured to apply a noise filter to the partials, with subsequent application of a Bartlett window.

The linear predictive coding (LPC) step determines further characteristics of the samples in the frame, the characteristics comprising the bandwidth, the amplitude and the formant frequencies.

In one embodiment, the processor is configured to carry out a reduction of the F1 and F2 formant frequencies, by 100 Hz and 150 Hz, respectively, of the samples in the frame.

The implementation details of one embodiment are further illustrated below, with reference to Figure 1 .

A first coding of the algorithm was carried out in Python and a subsequent coding carried out in C++. Other programming languages and support environments may be used. The microphone signal was acquired in blocks of 1024 samples, with a sampling frequency of 44,100 Hz.

Each block was filtered by means of an anti-Larsen filter in order to avoid the audio feedback. This filtering is optional and serves to eliminate the need for earphones, allowing the system to be used with audio speakers. The antiLarsen filter is a spectral filter which compares the average of each amplitude band of the signal with its total magnitude spectrum; if the ratio between the power of a single band exceeds a threshold, a band stop filter is set to the central frequency of the band, with a gain of -60dB, in order to suppress the feedback.

The block is further filtered with a first order high-pass filter (cut-off frequency of 100 Hz) and then subdivided into overlapping frames of 256 samples each with a 50% overlap.

The frames are processed with windows, preferably (but not necessarily) using the two following methods:

• Bartlett window, for further audio use of the data;

• Hamming window, for FFT (Fast Fourier Transform) and LPC (Linear Predictive Coding) analysis.

The processed frames are stored in stacks together with the previously processed stacks so that a block of 1024 samples may always be reconstructed using the inverse overlap-add method.

Each new frame is further analyzed.

The RMS (Root Mean Square) values and the formant frequencies are calculated.

The bandwidths and the amplitudes are calculated by means of the LPC analysis.

Figure 1 is a block diagram of the procedural steps performed.

The frequencies of the F1 and F2 formants are lowered by a fixed factor, for example 100 Hz for F1 and 150 Hz for F2, while the frequencies of the F3 and F4 formants are unchanged.

Owing to the instability of LPC with whispered speech, the resultant formant frequencies are rounded-off during the canonical approximation of the interval frequency of each formant, and these characteristics are grouped together in their respective data structures.

The following step is to smooth the values of the characteristics acquired with moving average filters, since there is too much variability of characteristics in the nearby frames.

FO is then calculated from the filtered RMS value.

If a frame contains a whispered signal, a corresponding synthesized frame is generated based on the values of the frame characteristics.

The synthesizer uses the additive synthesis method to generate up to 80 partials and the volumes of each one is set based on the formants of the frame.

A filtered noise (from 4000 Hz to 20,000 Hz) is then added and the result is multiplied with a Bartlett window.

From the whispered and synthesized frames, three overlapping frames (50% overlap) of 1024 samples are reconstructed using the inverse overlap-add method from the data stacks. This step is intended for subsequent processing since operating with 256 samples would generate too much variability in the resultant sound.

The FFT is calculated on the whispered frame and for each one two magnitude masks are calculated, calculating the average of the magnitude spectrum with two different average dimensions and then normalizing them.

Each synthesized frame is transformed by FFT.

Each bin (each value in a position of a mask) of the transform of the synthesized frame is multiplied by two bins (the values in a previous position of the masks) in the masks of the magnitudes of the whispered frame.

Then the resultant vector is brought back into the time domain by IFFT.

The synthesized frames are finally fine-processed with a Bartlett window and the output frame is calculated by means of the inverse overlap-add method.

The output of the system according to the present invention was evaluated by means of a preliminary test carried out on 10 listeners who were not informed beforehand. The output is clearly intelligible, almost in real time, and characterized by a “full” timbre similar to the sound of normal language.

Advantageously, the delicately whispered spoken input is clear in the synthesized sound: the force of the person speaking is thus reduced to a minimum.

The users’ experience encourages them not to emit too much air and to minimize the pronunciation effort.

According to one embodiment, the processor of the system is incorporated in a microcomputer. The microphone may be positioned during use along the side of the mouth with a wire connection, supported by a flexible ultralight bow, which may be replaced by just a single earphone side support.