METHOD FOR MODIFYING AN AUDIO SIGNAL WITHOUT PHASINESS

Title:

METHOD FOR MODIFYING AN AUDIO SIGNAL WITHOUT PHASINESS

Document Type and Number:

WIPO Patent Application WO/2023/218028

Kind Code:

Abstract:

The invention relates to a method for generating a modified audio signal with a reduced phasiness artifact, to a digital audio system which comprises at least one digital processor (503) adapted to modify the digitized audio signal by applying said method, and the computer program allowing implementing said method when executed.

Inventors:

DEGOTTEX GILLES (FR)

Application Number:

PCT/EP2023/062752

Publication Date:

November 16, 2023

Filing Date:

May 12, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

ALTA VOCE (FR)

International Classes:

G10L21/04; G10L19/022; G10L21/0332

Domestic Patent References:

WO2019038275A1

2019-02-28

Foreign References:

US7502733B2	2009-03-10
US20060206318A1	2006-09-14
US6954727B1	2005-10-11
US20140372131A1	2014-12-18

Other References:

LAROCHE, J.DOLSON, MARK, PHASE-VOCODER: ABOUT THIS PHASINESS BUSINESS, 1997, pages 4
AXEL ROBEL, SINUSOIDAL MODELING, SUMMER LECTURE ON ANALYSIS, MODELING AND TRANSFORMATION OF AUDIO SIGNALS, 25 August 2006 (2006-08-25)

Attorney, Agent or Firm:

A.P.I. CONSEIL (FR)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1. A method (100) for generating a modified audio signal with a reduced phasiness artifact comprising the steps of :

- Acquiring (110) a digitized audio signal,

- Segmenting the acquired audio signal (120) into successive frames at a given time step, of a given width in time domain,

- Resampling (140) at least one frame in either time or spectral domain,

- Aligning (160) in phase said each frame with one another,

- Reconstructing (170) the modified audio using an overlap-add process, characterized in that, said step of aligning (160) comprises, i. A frame windowing substep (161 ) wherein, for a given previous frame F_n- i[u] and a current subsequent frame Fn[u] to be aligned with the previous frame :

On the current frame Fn[u] o segmenting a time window of a determined width in time domain shorter than the width in time domain of the current frame F_n[u] and centered on the current frame Fn[u] o Applying a window function to said time window, to obtain a windowed frame, o Obtaining a frequency representation Cn[u] of the audio signal to be aligned in time,

In the previous frame Fn-i[u] : o segmenting a time window of a determined width in time domain shorter than the width in time domain of the previous frame Fn-i[u], center of which is time shifted forward from the center of the previous frame F_n- i[u], using the timestep of the segmenting step (120), o Obtaining a reference frequency representation R_n[u] of said shifted windowed frame, ii. Determining the time delay “a” (162) that maximizes the correlation between frequency representations Rn[u] and Cn[u], iii. Correcting time position (163) of the current frame Fn[u] according to the time delay “a”, thereby obtaining a corrected current frame ~Fn[u], iv. Segmenting a time window in ~F_n[u] (164), whose width is essentially the same than that of Cn[u], and whose center is the same as that of ~F_n[u], v. Obtaining a windowed corrected current frame ~Cn[u] (165) by applying a window function to time window of previous step, said windowed corrected current frame ~Cn[u] being used in the reconstructing step (170), vi. Optionally, repeating steps (161 ) to (165), corrected current frame ~F_n[u] becoming previous frame Fn-i[u], thereby resulting in a modified audio signal with no phasiness artifact.

2. The method according to claim 1 wherein at least one processing step (130) is applied before step of resampling (140) at least one frame.

3. The method according to any one of preceding claims wherein at least one processing step (150) is applied after step of resampling (140) at least one frame.

4. The method according to any one of preceding claims wherein the width in time domain of time windows is essentially the same in both Fn[u] and Fn-i[u].

5. The method according to anyone of preceding claims wherein the windowing function is selected from rectangular window, B-spline windows (as e.g. Triangular, Parzen windows), Welch window, Sine window (Power-of-sine/cosine windows), Cosine-sum windows (e.g. Hann, Hamming, Blackman, Nuttall, Blackman-Nuttall, Blackman- Harris, Flat top, Rife-Vincent windows), adjustable windows (e.g. Gaussian, Confined Gaussian, approximate confined Gaussian, Generalized normal, Tukey, DPSS or Slepian, Kaiser, Dolph-Chebyshev, Ultraspherical, Exponential or Poisson windows), hybrid windows (as e.g. Bartlett-Hann, Planck-Bessel, Hann-Poisson window), Generalized adaptive polynomial (GAP) window and Lanczos window, preferably selected from Hamming, Hann and Blackman window functions.

6. The method according to anyone of preceding claims wherein the time position of the very first frame of the audio signal is not corrected.

7. The method according to anyone of preceding claims wherein each frame size is from between 2 ms and 200 ms long and frame time interval is such as allowing frames overlap. The method according to anyone of preceding claims wherein the audio signal is a representation of at least a human voice, for example at least a spoken and/or at least a singing human voice or a mix thereof. A computer program for audio modification comprising instructions which, when the program is executed by a computer, implement step (160) of aligning in phase frames of a digitized audio signal which have been segmented into successive frames at a given time step, of a given width in time domain, at least one of said frames having been resampled according to a predetermined desired effect, said aligning step (160) comprising implementing the following substeps: i) A frames windowing substep (161 ) wherein, for a given previous frame F_n- i[u] and a current subsequent frame Fn[u] to be aligned with the previous frame :

In the previous frame Fn-i[u] : o segmenting a time window of a determined width in time domain shorter than the width in time domain of the previous frame Fn-i[u], center of which is time shifted forward from the center of the previous frame F_n- i[u], using the timestep of the segmenting step (120), o Obtaining a reference frequency representation R_n[u] of said shifted windowed frame, ii) Determining the time delay “a” (162) that maximizes the correlation between frequency representations Rn[u] and Cn[u], iii) Correcting time position (163) of the current frame Fn[u] according to the time delay “a”, thereby obtaining a corrected current frame ~Fn[u], iv) Segmenting a time window in ~F_n[u] (164), whose width is essentially the same than that of Cn[u], and whose center is the same as that of ~F_n[u], v) Obtaining a windowed corrected current frame ~Cn[u] (165) by applying a window function to time window of previous step, said windowed corrected current frame ~Cn[u] being used in a reconstructing step using an overlapadd process, vi) Optionally, repeating steps (161 ) to (165), corrected current frame ~F_n[u] becoming previous frame Fn-i[u].

10. A digital audio system (500) which comprises at least one digital processor (503) adapted to modify the digitized audio signal by implementing step (160) of aligning in phase frames of a digitized audio signal which have been segmented into successive frames at a given time step, of a given width in time domain, at least one of said frames having been resampled according to a predetermined desired effect, said aligning step (160) comprising implementing the following substeps: i) A frames windowing substep (161 ) wherein, for a given previous frame F_n- i[u] and a current subsequent frame Fn[u] to be aligned with the previous frame :

In the previous frame Fn-i[u] : o segmenting a time window of a determined width in time domain shorter than the width in time domain of the previous frame Fn-i[u], center of which is time shifted forward from the center of the previous frame F_n- i[u], using the timestep of the segmenting step (120), o Obtaining a reference frequency representation R_n[u] of said shifted windowed frame, ii) Determining the time delay “a” (162) that maximizes the correlation between frequency representations Rn[u] and Cn[u], iii) Correcting time position (163) of the current frame Fn[u] according to the time delay “a”, thereby obtaining a corrected current frame ~Fn[u], iv) Segmenting a time window in ~F_n[u] (164), whose width is essentially the same than that of Cn[u], and whose center is the same as that of ~F_n[u], v) Obtaining a windowed corrected current frame ~Cn[u] (165) by applying a window function to time window of previous step, said windowed corrected current frame ~Cn[u] being used in a reconstructing step using an overlap add process, vi) Optionally, repeating steps (161 ) to (165), corrected current frame ~F_n[u] becoming previous frame Fn-i[u].

Description:

METHOD FOR MODIFYING AN AUDIO SIGNAL WITHOUT PHASINESS

FIELD OF THE INVENTION

[0001 ] The invention relates to the field of audio signal processing. The invention especially relates to a new method for generating a modified audio signal aiming at reducing phasiness artifact which is often encountered while applying a desired transformation to an audio signal through, for example, vocoding. Invention also relates to digital audio system adapted to implement said method.

BACKGROUND OF THE INVENTION

[0002] The phase vocoder is a known technique for audio modification, mainly used for pitch shifting and time stretching.

[0003] Even though the method is quite efficient and offers high audio quality, it is also plagued with a well-known drawback, the phasiness artifact (Laroche & Dolson, 1997). The main concept of the phase vocoder consists in resampling short-time frames (ex. 20ms) and rearranging them on the timeline in the way the user wants, to make the sound shorter or longer, higher or lower. Because the coherence of the phase of one frame with that of its neighbors is important for auditory reasons, it is critically important to align neighboring frames properly on the timeline. The technique to do this alignment process is at the core of the difficulties of a phase vocoder. Numerous methods have been suggested for this purpose : US20060206318 uses phase matching to correct discontinuities in the decoded signal when the encoder and decoder may be out of sync in signal phase ; WO2019038275 uses adjusted phase derivative value set using finite differences ; US6954727 relates to a method of reducing sinusoidal artifact generation in a vocoder, using a codebook excitation vector selection process to prevent the suspected noise-inducing codebook excitation vector from being continuously generated, when the input is below a determined input energy threshold. In more details, the phase alignment process optimizes the temporal correlation between one frame and its precedent.

[0004] The phasiness artifact is a well-known drawback of the current vocoding methods which results from undue temporal correlation optimization of the frame’s noise components which results in the reduction of variance of noise frame phases. On the one hand, if the frame-to-frame correlation is not high enough, the smooth continuity of the frames’ deterministic components (ex. pitched sounds) will not be preserved. Clicks and glitches would then be audible. From this perspective, one might want to maximize the frame-to-frame time correlation. On the other hand, if the frame-to-frame correlation is too high, the frame’s noise component won't be perceived as noisy any longer since the frame-to-frame phase variance will be reduced. In this case, excessive time correlation is artificially increased, which results in the so-called phasiness artifact. Many methods have been suggested to find the optimal balance of correlation, but at the expense of computational power (e.g. through sinusoid modelling (Robel, 2006) or other approaches as in US 2014/0372131 where phase adjustment is operated depending on control information determined from the vertical phase coherence of the audio signal). The plugin Trax (IRCAM) incorporates a phase vocoder, (https://www.flux. audio/project/ircam-trax- v3/).

[0005] The invention provides a new method for generating a modified audio signal which allows a good balance between the frame to frame correlation of deterministic and random components of the audio signal, while avoiding the above mentioned phasiness artifact.

SUMMARY

[0006] The following sets forth a simplified summary of selected aspects, embodiments and examples of the present invention for the purpose of providing a basic understanding of the invention. However, the summary does not constitute an extensive overview of all the aspects, embodiments and examples of the invention. The sole purpose of the summary is to present selected aspects, embodiments and examples of the invention in a concise form as an introduction to the more detailed description of the aspects, embodiments and examples of the invention that follow the summary.

[0007] The invention aims to overcome the disadvantages of the prior art. In particular, the invention proposes for generating a modified audio signal wherein no or few phasiness artifact is generated upon audio signal modification.

In that regard, according to a first aspect the invention relates to a method for generating a modified audio signal with a reduced phasiness artifact comprising the steps of : - Acquiring a digitized audio signal,

- Segmenting the acquired audio signal into successive frames at a given time step, of a given width in time domain,

- Resampling at least one frame in either time or spectral domain,

- Aligning in phase said each frame with one another,

- Reconstructing the modified audio using an overlap-add process, characterized in that, said step of aligning comprises, i. A frame windowing substep wherein, for a given previous frame Fn-i[u] and a current subsequent frame Fn[u] to be aligned with the previous frame :

On the current frame Fn[u] o segmenting a time window of a determined width in time domain shorter than the width in time domain of the current frame F _n[u] and centered on the current frame Fn[u] o Applying a window function to said time window, to obtain a windowed frame, o Obtaining a frequency representation Cn[u] of the audio signal to be aligned in time,

In the previous frame Fn-i[u] : o segmenting a time window of a determined width in time domain shorter than the width in time domain of the previous frame Fn-i[u], center of which is time shifted forward from the center of the previous frame F _n- i[u], using the timestep of the segmenting step, o Obtaining a reference frequency representation R _n[u] of said shifted windowed frame, ii. Determining the time delay “a” that maximizes the correlation between frequency representations Rn[u] and Cn[u], iii. Correcting time position of the current frame Fn[u] according to the time delay “a”, thereby obtaining a corrected current frame ~Fn[u], iv. Segmenting a time window in ~F _n[u], whose width is essentially the same than that of Cn[u], and whose center is the same as that of ~F _n[u], v. Obtaining a windowed corrected current frame ~Cn[u] by applying a window function to the time window of previous step, said windowed corrected current frame ~Cn[u] being used in the reconstructing step with the overlapadd process, vi. Optionally, repeating steps i) to v), corrected current frame ~F _n[u] becoming previous frame Fn-i[u], thereby resulting in a modified audio signal with no phasiness artifact.

[0008] The further advantage of this method is that it necessitates few computing resources and allows real time and/or embedded implementations.

[0009] According to other optional features of the method according to the invention, it can optionally include one or more of the following characteristics alone or in combination:

- at least one processing step is applied before step of resampling at least one frame,

- at least one processing step is applied after step of resampling at least one frame,

- the width in time domain of time windows is essentially the same in both F _n[u] and Fn-i[u],

- the windowing function is selected from rectangular window, B-spline windows (as e.g. Triangular, Parzen windows), Welch window, Sine window (Power-of- sine/cosine windows), Cosine-sum windows (e.g. Hann, Hamming, Blackman, Nuttall, Blackman-Nuttall, Blackman-Harris, Flat top, Rife-Vincent windows), adjustable windows (e.g. Gaussian, Confined Gaussian, approximate confined Gaussian, Generalized normal, Tukey, DPSS or Slepian, Kaiser, Dolph- Chebyshev, Ultraspherical, Exponential or Poisson windows), hybrid windows (as e.g. Bartlett-Hann, Planck-Bessel, Hann-Poisson window), Generalized adaptive polynomial (GAP) window and Lanczos window, preferably selected from Hamming, Hann and Blackman window functions,

- the time position of the very first frame of the audio signal is not corrected,

- each frame size is from between 2 ms and 200 ms long and frame time interval is such as allowing frames overlap, or the audio signal is a representation of at least a human voice, for example at least a spoken and/or at least a singing human voice or a mix thereof. [0010] According to another aspect, the invention can also relate to a computer program for audio modification comprising computer-readable instructions which, when executed by one or more processors, causes the one or more processors to perform the method for audio modification according to the method of the invention as exposed above.

According to this second aspect, the invention relates to a computer program for audio modification comprising instructions which, when the program is executed by a computer, implements step of aligning in phase frames of a digitized audio signal which have been segmented into successive frames at a given time step, of a given width in time domain, at least one of said frames having been resampled according to a predetermined desired effect, said aligning step comprising implementing the following substeps: i) A frames windowing substep wherein, for a given previous frame F _n-i[u]and a current subsequent frame Fn[u] to be aligned with the previous frame :

On the current frame Fn[u] o segmenting a time window of a determined width in time domain shorter than the width in time domain of the current frame F _n[u] and centered on the current frame Fn[u] o Applying a window function to said time window, to obtain a windowed frame, o Obtaining a frequency representation Cn[u] of the audio signal to be aligned in time,

In the previous frame Fn-i[u] : o segmenting a time window of a determined width in time domain shorter than the width in time domain of the previous frame Fn-i[u], center of which is time shifted forward from the center of the previous frame F _n- i[u], using the timestep of the segmenting step, o Obtaining a reference frequency representation R _n[u] of said shifted windowed frame, ii) Determining the time delay “a” that maximizes the correlation between frequency representations Rn[u] and Cn[u], iii) Correcting time position of the current frame Fn[u] according to the time delay “a”, thereby obtaining a corrected current frame ~Fn[u], iv) Segmenting a time window in ~F _n[u] (164), whose width is essentially the same than that of Cn[u], and whose center is the same as that of ~F _n[u], v) Obtaining a windowed corrected current frame ~Cn[u] by applying a window function to the time window of previous step, said windowed corrected current frame ~Cn[u] being used in the reconstructing step with the overlapadd process, vi) Optionally, repeating steps i) to v), corrected current frame ~F _n[u] becoming previous frame Fn-i[u].

[001 1 ] According to another aspect of the present invention, it is provided a digital audio system comprising: one or more processors to perform the method according to the invention. More particularly, said digital audio system comprises at least one digital processor adapted to modify the digitized audio signal by implementing step of aligning in phase frames, of a digitized audio signal which have been segmented into successive frames at a given time step, of a given width in time domain, at least one of said frames having been resampled according to a predetermined desired effect, said aligning step comprising implementing the following substeps: i) A frames windowing substep wherein, for a given previous frame Fn-i[u] and a current subsequent frame Fn[u] to be aligned with the previous frame :

On the current frame Fn[u] o segmenting a time window of a determined width in time domain shorter than the width in time domain of the current frame F _n[u] and centered on the current frame Fn[u] o Applying a window function to said time window, to obtain a windowed frame, o Obtaining a frequency representation Cn[u] of the audio signal to be aligned in time,

In the previous frame Fn-i[u] : o segmenting a time window of a determined width in time domain shorter than the width in time domain of the previous frame Fn-i[u], center of which is time shifted forward from the center of the previous frame F _n- i[u], using the timestep of the segmenting step, o Obtaining a reference frequency representation R _n[u] of said shifted windowed frame, ii) Determining the time delay “a” that maximizes the correlation between frequency representations R _n[u]and Cn[u], iii) Correcting time position of the current frame Fn[u] according to the time delay “a”, thereby obtaining a corrected current frame ~Fn[u], iv) Segmenting a time window in ~F _n[u], whose width is essentially the same than that of C _n[u], and whose center is the same as that of ~F _n[u], v) Obtaining a windowed corrected current frame ~Cn[u] by applying a window function to the time window of previous step, said windowed corrected current frame ~Cn[u] being used in the reconstructing step with the overlapadd process, vi) Optionally, repeating steps i) to v), corrected current frame ~F _n[u] becoming previous frame Fn-i[u].

FIGURE LEGENDS

[0012] The foregoing and other objects, features and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which :

Figure 1 represents a flowchart of a method according to an embodiment of the invention. Steps framed in dotted lines are optional.

Figure 2 details the specific step of aligning in phase said at least one resampled frame with one another according to an embodiment of the present invention.

Figure 3 exemplifies an alignment process according to an embodiment of the method of the invention for pitch shifting of the audio signal with factor 0.5.

Figure 4 represents a digital audio system according to an embodiment of the invention adapted to implement the method of generating a modified audio signal according to the invention, comprising at least one digital processor adapted to implement aligning step according to the invention.

DETAILED DESCRIPTION

Definitions

[0013] By “process”, “compute", “determine”, “display”, “extract”, “compare” or more broadly “executable operation” is meant, within the meaning of the invention, an action performed by a computing device or a processor unless the context indicates otherwise. In this regard, the operations relate to actions and/or processes of a data processing system, for example a computing system or an electronic computing device, which manipulates and transforms the data represented as physical (electronic) quantities in the memories of the computing system or other devices for storing, transmitting or displaying information. In particular, calculation operations are carried out by the processor of the device, the produced data are entered in a corresponding field in a data memory. These operations may be based on applications or software. [0014] The terms or expressions “application”, “software”, “program code”, “computer program” or “executable code” mean any expression, code or notation, of a set of instructions intended to cause a data processing to perform a particular function directly or indirectly (for example after a conversion operation into another code). Exemplary program codes may include, but are not limited to, a subprogram, a function, an executable application, a source code, an object code, a library and/or any other sequence of instructions designed for being performed on a computing system.

[0015] By “processor” is meant, within the meaning of the invention, at least one hardware circuit configured to perform operations according to instructions contained in a code. The hardware circuit may be an integrated circuit. Examples of a processor include, but are not limited to, a central processing unit, a graphics processor, an application-specific integrated circuit (“ASIC” according to Anglo-Saxon terminology), and a programmable logic circuit. A single processor or several other units may be used to implement the invention.

[0016] By “coupled” is meant, within the meaning of the invention, connected, directly or indirectly, with one or more intermediate elements. Two elements may be coupled mechanically, electrically or linked by a communication channel.

[0017] By “computing device”, it should be understood any device comprising a processing unit or a processor, for example in the form of a microcontroller cooperating with a data memory, possibly a program memory, said memories possibly being dissociated. The processing unit cooperates with said memories by means of internal communication bus.

[0018] The term "essentially" when referred to a value, a datum, a numeric range, etc., as used herein refers to a value, a datum, a numeric range etc., for which a degree of variability is allowed for example, of 10%, of 5%, or of 1 % or even less. When a value, a numeric range etc.... is said herein as being essentially the same as another, it means that these values differ one from each other of 10% or less, preferably of 5% or less, or even more preferably of 1 % or even less.

[0019] An “Audio signal” is meant to refer to a representation of sound. Audio signals may be synthesized directly, or may originate at a transducer such as a microphone, musical instrument pickup, phonograph cartridge, or tape head which produces an analog audio signal from a sound. Typically, this sound is of any origin, more particularly, a voice (spoken and or sung), sound from a monophonic music instrument, a music or a mix thereof. Accordingly, a digitized audio signal refers to an audio signal that has been recorded in, converted into, or encoded into, digital form, typically using an analog-to- digital converter (ADC) when the sound has been originally transformed into an analog signal.

[0020] A description of example embodiments of the invention follows.

Method for generating a modified audio signal

[0021 ] The present method 100 for generating a modified audio signal allows avoiding the drawbacks of current vocoding methods which imply the so-called phasiness artifact relying on the undue temporal correlation of the frame’s noise components that occurs upon phase alignment. Such unwanted correlation brings out in creating new and unwanted deterministic sounds from what are in the original signal non-deterministic sounds. Method 100 according to the invention is a new method based on a novel alignment method of the segmented frames of a modified audio signal.

[0022] The method 100 results in the production of a modified audio signal, if needed in real time, without phasiness artifact, at a low cost in term of computational needs, though most of the method being implemented in the time domain. Typically said method comprises the steps of :

- Acquiring 1 10 a digitized audio signal,

- Segmenting the acquired audio signal 120 into successive frames at a given time interval, of a given width in time domain,

- Resampling 140 at least one frame in either time or spectral domain,

- Aligning 160 in phase each frame with one another,

- Reconstructing 170 the modified audio using an overlap-add process.

[0023] As mentioned above, the acquiring step 110 of the method according to the invention can comprise the real time acquisition of a sound which is transduced in real time under a digitized audio signal, and/or the transduction of a stored analog audio signal or even providing a stored digitized audio signal. In an embodiment, time duration of said digitalized audio signal is ranging from several milliseconds to several minutes or even hours. Classically, time duration is from between 3 ms and 1 hour. Preferably time duration of the digitized audio signal is from between 1 s and 30 min, preferably from between 5 s and 10 min.

[0024] Accordingly, the method according to the invention can be implemented on pre recorded digitized signal(s) or in real time. Low requirements in terms of computing resource and its efficiency make the method according to the invention particularly suitable for a real-time implementation. Also, in a particular embodiment, the method of generating a modified audio signal according to the invention is performed in real time.

[0025] Time step and frame window size of the segmenting step 120 have to respect the usual constraints of devices for audio signal modification (e.g. a vocoder). The precise choice depends on the application and of the desired effect, providing frames’ time windows are overlapping. Classically, time step and frame window size have to respect the basic constraint of any overlap-add process: the time step has to be short enough so that the consecutive windows overlap. Also, the sum of two frame window has to be greater the 2 times the machine s (i.e. machine precision). Accordingly, in the method according to the invention, the predetermined timestep can be of any duration. In an embodiment, it is comprised from between 0.1 ms and 30 ms, preferably between 0.5 and 15 ms. In a more particular embodiment, said time step is of 15 ms,14 ms, 13 ms, 12 ms, 1 1 ms, 10 ms, 9 ms, 8 ms, 7 ms, 6 ms, 5 ms, 4 ms, 3 ms, 2 ms, 1 ms or even 0.5 ms. Preferably, said predetermined time step is of 5 ms, which allows to obtain particularly robust performances, while optimising the use of computing ressources. Also, in a particular embodiment, the frame window size is comprised from between 2 ms and 200 ms, preferably from between 5 ms and 100 ms, more preferably from between 10 ms and 50 ms. In a more particular embodiment, said frame window size can be of 50 ms, 40 ms, 30 ms, 20 ms or even 10 ms. Preferably, said widows frame is of 20 ms. In the method according to the invention the number of frame is at least of two.

[0026] The resampling step 140 of at least one frame can be done in either time or spectral domain. In a preferred embodiment, said resampling 140 is performed on each frame of the signal. Said resampling can be, e.g., for the purpose of time stretching (making the audio signal longer or shorter without changing the others perceived characteristics) or for the purpose of pitch scaling (make the audio sounds higher or lower). Any effect can be applied through the resampling method as in common methods of audio signal modification. A resampling step of each frame in the time domain is particularly preferred. In certain embodiments, resampling step, is performed in the spectral domain. In these embodiments, therefore, an inverse transform is used to put back the signal in time domain.

[0027] Advantageously, any processing steps (130, 150) of the audio signal can be performed on the digitized audio signal before and/or after resampling step 140. Said processing step can comprise any modification of the audio signal as, for example: the modification of the sound timbre, equalization, voice alteration, denoising, signal restauration, improvement of intelligibility or any other audio modification and/or audio enhancement. In the method according to the invention, these steps are optional. Also in an embodiment, the method comprises at least one processing step 130 performed on the digitized signal before the resampling step 140. In another embodiment, the method comprises at least one processing step 150 performed on the digitized signal once resampled under step 140, i.e. after said step 140. In a further embodiment, the method comprises at least one processing step 130 performed on the digitized signal before the resampling step 140 and a processing step 150 performed on the digitized signal once resampled under step 140.

[0028] In the method for generating a modified audio signal of the invention, the step of aligning 160 in phase each frame one with another comprises the following substeps, an example of which is given in Figures 2 and 3 : i) A frame windowing substep 161 wherein, for a given previous frame F _n-i[u] and a current subsequent frame F _n[u] to be aligned with the previous frame: On the current frame F _n[u] o segmenting a time window of a determined width in time domain shorter than the width in time domain of the current frame F _n[u] and centered on the current frame F _n[u], o Applying a window function to said time window, to obtain a windowed frame, o Obtaining a frequency representation C _n[u] of the audio signal to be aligned in time,

In the previous frame Fn-i[u] : o segmenting a time window of a determined width in time domain shorter than the width in time domain of the previous frame F _n-i[u], center of which is time shifted forward from the center of the previous frame F _n- i[u], using the timestep of the segmenting step 120, o Obtaining a reference frequency representation R _n[u] of said shifted windowed frame, ii) Determining the time delay “a” 162 that maximizes the correlation between frequency representations R _n[u] and C _n[u], iii) Correcting time position 163 of the current frame F _n[u] according to the time delay “a”, thereby obtaining a corrected current frame ~F _n[u], iv) Segmenting a time window in ~F _n[u] 164, whose width is essentially the same than that of C _n[u], and whose center is the same as that of ~F _n[u], v) Obtaining a windowed corrected current frame ~Cn[u] 165 by applying a window function to the time window of previous step, said windowed corrected current frame ~Cn[u] being used in the reconstructing step 170 with the overlap-add process, vi) Optionally, repeating steps 161 to 165, corrected current frame ~F _n[u] becoming previous frame F _n-i[u].

[0029] In said step 160, widths in time domain of time windows of Fn-i[u] and Fn[u] can be identical or different. Nevertheless, they should be essentially the same. Better performances in terms of quality of the transformed signal are obtained when differences in width of Fn-i[u] and Fn[u] are of or lower than 5% even more preferably of or lower than 1%. Accordingly, in an embodiment of the method according to the invention widths of time windows of Fn-i[u] and Fn[u] are essentially the same. In a preferred embodiment, differences in said widths is equal to or lower than 5%, more preferably equal to or lower than 1 %. Even more preferably widths in time domain of time widows of Fn-i[u] and F _n[u] are identical.

[0030] Any frame windowing function can be used in frame windowing substep 161 to implement the aligning step of the method of the invention. For example, applying a window function comprises applying at least one of the function chosen for the group comprising : Rectangular window, B-spline windows (as e.g. Triangular, Parzen windows), Welch window, Sine window (Power-of-sine/cosine windows), Cosine-sum windows (e.g. Hann, Hamming, Blackman, Nuttall, Blackman-Nuttall, Blackman-Harris, Flat top, Rife-Vincent windows), adjustable windows (e.g. Gaussian, Confined Gaussian, approximate confined Gaussian, Generalized normal, Tukey, DPSS or Slepian, Kaiser, Dolph-Chebyshev, Ultraspherical, Exponential or Poisson windows), hybrid windows (as e.g. Bartlett-Hann, Planck-Bessel, Hann-Poisson window), Generalized adaptive polynomial (GAP) window and Lanczos window. Preferred window functions are Hamming, Hann and Blackman window functions. A particularly preferred widowing function is Blackman window function which is has been found particularly adapted to the determination step of the time delay “a” which allows the maximum of correlation as explain below.

[0031 ] In the method according to the invention, each frame F _n[u] is windowed, and the frames are chosen to be wider than the length of the window (as it is exemplified in the method of Figure 2).

[0032] Frequency representations of windowed frames can be obtained by any method suitable for representing in frequency signals. Such methods can comprise applying wavelets transform, least square estimations, adaptive frequency representation, a Fourier-related transform or a Fourier transform. Fourier transforms are particularly preferred. Fast Fourier transforms are even more preferred, as such algorithms allow to spare computing resources and therefore contribute making the method according to the invention even more suitable for real-time audio modification and/or even, for example, to consider embedded devices capable of implementing said method. Also, in a preferred embodiment, the steps of obtaining a frequency representation of the window functions are implemented using a Fourier transform and more preferably a fast Fourier transform.

[0033] The frequency representation R _n[u] is a selected segment of time signal, within Fn-i[u] and is chosen at one time step further than the previously corrected windows. Consequently, the frequency representation R _n[u] is placed where the corrected frequency representation is supposed to be and a maximisation of cross correlation method can be used to align Cn[u] on Rn[u], leading to a corrected window. Indeed, R _n[u] and Cn[u] are not equivalent as selecting a time segment shifted forward one time step after the previously corrected window but within the signal Fn[u] is not equivalent to resampling a window one time step further in the original signal. Any maximisation of cross correlation method can be used, for example applying the following formula : wherein a is the time delay for which the maximum of correlation is observed between Cn[u] and Rn[u], according to the function K which computes the cross correlation of its arguments.

[0034] Then, according to step 163 of correcting the time position of the current frame Fn[u], said current frame Fn[u] is displaced within the time frame to a position centered on the time delay “a”, thereby resulting in a so called corrected current frame ~Fn[u], which is segmented according to step 164 and windowed according to step 165 thereby generating a windowed corrected current frame ~Cn[u], to be used in the reconstructing step 170.

[0035] Particularly, in the method of the invention, the time position of the very first frame of the digitized audio signal is not corrected and said first frame is used as such as said previous Fn-i[u] in the above detailed aligning step 160. Aligning step 160 is reiterated for each still all the successive frames of the signal are aligned. Also, when only 2 frames are segmented in step 120, there is no need to repeat steps 161 to 165.

[0036] In the reconstructing step 170 of the method according to the invention, each of the windowed corrected current frame ~Cn[u]are put back in time, using a fonction in accordance with the function that has been used to obtain frequency representation. For example, if a Fourier transform has been used to obtain a frequency representation of the signal, then an inverse Fourier transform will be used to put back signal in time ; if a wavelets transform is used then an inverse wavelet transform will be used to put back corrected frame in time. In a preferred embodiment, an inverse Fourier transform is used, even more preferably, an inverse fast Fourier transform is used to resynthesize the signal, a Fourier transform and an inverse Fourier transform having been respectively used to obtain frequency representation ; then the modified digitized audio is finally reconstructed through a current overlap-add of all of its frames.

[0037] Overlap-add method or any derivative thereof well known by the skilled in the art can be implemented in step 170 following the aligning step according to the invention. Optionally, said overlap-add method comprises a normalization procedure in order to account for the potential modulation of the windowed corrected current frame~Cn[u] whose sum, on each sample, might not be equal to 1 .0.

[0038] While not wishing to be bound by any particular theory, it can be inferred that because the deterministic components of R _n[u] are correlated to those of the previous corrected window, the deterministic components of the corrected window and ~Cn[u] will be consequently also correlated in time. This ensures the smooth continuity of the deterministic components throughout the whole transformed signal. Conversely, because the noise component of R _n[u] are not correlated to that of the previously corrected window, noise components will neither be correlated in time between ~Cn[u] and ~Fn-i[u]. This will prevent noise components to over-correlate across frames and prevent or reduce phasiness artifact, in relation with the undue correlation of not deterministic components.

[0039] Method 100 of the invention is suited for generating a modification of any sound, the audio signal resulting from said modification showing no or reduced phasiness artifact. In particular, said audio signal resulting from said modification does not present unwanted deterministic signal resulting from the undue temporal correlation optimization of non- deterministic frame’s component that would other occur in other audio signal modification methods of the art. Method 100 is particularly suited for audio signal corresponding to human voice, either spoken or sung, or to sound from a monophonic music instrument.

Computer program

[0040] Another object of the invention is a computer program for audio modification comprising instructions which, when the program is executed by a computer, implements step 160 as detailed above in any of its embodiments, of aligning in phase frames of a digitized audio signal which has been segmented into successive frames at a given time step, of a given width in time domain, at least one of said frames having been resampled according to a predetermined desired effect.

Digital audio system

[0041 ] Another object of the invention relates to a digital audio system 500 as exemplified in figure 4 which comprises :

- At least one digital processor 503 adapted to modify the digitized audio signal and notably to implement aligning step 160 of the method as described above, in any its embodiments. Said digital processor can be adapted to implement any transformation or modification of the digitized audio signal. In a particular embodiment, said digital processor is adapted to implement the method 100 for generating a modified audio signal of the invention as detailed above, said digital processor 503 is particularly suited for real time modification of the audio signal.

[0042] Even more particularly, said digital processor 503 is adapted to segment the acquired audio signal into successive frames at a given time interval, of a given width in time domain, to resample at least one frame according to a predetermined desired effect, to align frames by performing the aligning step 160 of the method as described above, and then to re synthetized the modified digitized audio signal through an overlap add method.

[0043] The digital audio system 500 can comprises also, as in one of the embodiments of the invention illustrated in Figure 4 :

- At least one audio signal transducer 501 (as e.g. a microphone, musical instrument pickup, phonograph cartridge, or tape head...) which converts a sound in an electric analog audio signal,

- At least one analog to digital converter (ADC) 502 which is adapted to convert the electric analog signal into a digitized audio signal,

- At least one digital to analog converter (DAC) 504 which is adapted to convert the modified digitized audio signal into a modified analog audio signal,

- At least one loudspeaker 505 adapted to generate a modified sound corresponding to the modified analog audio signal, and/or

- At least one memory 506 configured to store and/or to provide the stored digitized audio signal before or after its modification by the digital processor, which is operatively coupled to said audio signal transducer 501 , ADC 502 and/or DAC 504. Such a digital audio system is particularly adapted to real time modification of the audio signal.

[0044] As mentioned above, the aligning step 160 of the method for generating a modified audio signal is particularly suitable to be implemented in the temporal domain, yet requiring few computing resources, therefore allowing a rapid generation of a modified audio signal and its implementation in embedded digital audio system. Also, in a particular embodiment the digital audio system of the invention is embedded. REFERENCES

Laroche, J. & Dolson, Mark. (1997). Phase-vocoder: about this phasiness business. 4 pp.. 10.1109/ASPAA.1997.625603.

Axel Rbbel (2006) Sinusoidal Modeling, summer lecture on analysis, modeling and transformation of audio signals. 25th August 2006.

Previous Patent: MULTICHAIN MULTITARGETING BISPECIFIC ANTIGEN-BINDING MOLECULES OF INCREASED SELECTIVITY

Next Patent: SCREW CONNECTION AND METHOD FOR MOUNTING A CANTILEVERED PLATFORM SUPPORT TO A BUILDING