Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM FOR EPOCH BASED MODIFICATION OF SPEECH SIGNALS
Document Type and Number:
WIPO Patent Application WO/2016/035022
Kind Code:
A2
Abstract:
The present subject matter relates to a method and speech modification system for modification of speech signals. In an embodiment, the method comprises receiving a speech signal in a digital format. Then, the method determines one or more epoch locations in the speech signal. Further, the method comprises assigning identification to the one or more epoch locations in the speech signal by dedicating at least one bit of least significant nibble of the speech signal. Then, the method comprises aligning frames of the speech signal based on the one or more epoch locations identified by state of the dedicated at least one bit of the least significant nibble, wherein the frames are aligned using compare and shift operations. Finally, the method performs a weighted overlap-add of the aligned frames of the speech signal to modify time-scale of the speech signal.

Inventors:
SEELAMANTULA CHANDRA SEKHAR (IN)
RUDRESH SUNIL (IN)
VASISHT ADITYA (IN)
Application Number:
PCT/IB2015/056661
Publication Date:
March 10, 2016
Filing Date:
September 02, 2015
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INDIAN INST SCIENT (IN)
Attorney, Agent or Firm:
RAGHAVENDRA, Ramya Rao et al. (#4121/B 6th Cross, 19A Main, HAL II Stage, Bangalore Karnataka 8, IN)
Download PDF:
Claims:
We claim:

1. A method for modifying a speech signal, comprising:

receiving, by a speech modification system, a speech signal in a digital format; determining, by a speech modification system, one or more epoch locations in the speech signal;

assigning an identification to the one or more epoch locations in the speech signal, by a speech modification system, by dedicating at least one bit of least significant nibble of the speech signal;

aligning, by a speech modification system, frames of the speech signal based on the one or more epoch locations identified by state of the dedicated at least one bit of the least significant nibble, wherein the frames are aligned using compare and shift operations; and

performing, by a speech modification system, a weighted overlap-add of the aligned frames of the speech signal to modify time-scale of the speech signal.

2. The method as claimed in claim 1 further comprising resampling the time-scale modified speech signal at a predefined rate to modify pitch of the speech signal.

3. The method as claimed in claim 1, wherein the performing of overlap-add of weighted aligned frames comprises at least one of :

repeating a part of the aligned frames by a predetermined amount and adding the repeated frames to obtain time stretched speech signals; and

overlapping the aligned frames by a predetermined amount, and adding the overlapped frames by removing some regions of the frame to obtain time compressed speech signals.

4. The method as claimed in claim 1, wherein the one or more epoch locations are determined using Zero Frequency Resonator (ZFR) method.

5. A speech modification system for modification of speech signals, comprising:

a processor; a memory communicatively coupled to the processor, wherein the memory stores processor-executable instructions, which, on execution, causes the processor to:

receiving a speech signal in a digital format;

determining one or more epoch locations in the speech signal;

assigning an identification to the one or more epoch locations in the speech signal by dedicating at least one bit of least significant nibble of the speech signal;

aligning frames of the speech signal based on the one or more epoch locations identified by state of the dedicated at least one bit of the least significant nibble, wherein the frames are aligned using compare and shift operations; and

performing a weighted overlap-add of the aligned frames of the speech signal to modify time-scale of the speech signal.

The system as claimed in claim 5, wherein the one or more epoch locations are determined using Zero Frequency Resonator (ZFR) method.

The system as claimed in claim 5, wherein the time-scale modified speech signal is resampled at a predefined rate to modify pitch of the speech signal.

Description:
"METHOD AND SYSTEM FOR EPOCH BASED MODIFICATION OF SPEECH

SIGNALS"

FIELD OF THE DISCLOSURE The present subject matter is related, in general to processing of speech signals, and more particularly, but not exclusively, to a method and a system for modification of speech signals.

BACKGROUND Time-scale modification (TSM) of speech/audio signal results in the overall effect of speeding up or slowing down the perceived playback rate of a speech/audio signal, retaining the signal's local frequency content. In other words, perceptually important features of the original signal should remain unchanged, while the duration of the original signal is increased or decreased. In case of speech, the time-scaled signal sounds as if the original speaker has spoken at a faster or slower rate. Pitch-scale modification combines time-scale modification with resampling the signal at a required rate. Pitch-scaling also has many applications in areas such as text-to-speech synthesis, movies and advertisements to give special effects to sound tracks, to fit a speech-based advertisement to a given time slot, to assist in lip-synchronization, impersonation of performances by vocalists, etc.

The conventional method for time-scaling a speech signal is to apply overlap-add (OLA) technique. In OLA technique, the input speech is segmented into overlapping frames with an analysis frame shift of 'Sa'. Time-scaled speech is obtained by outputting the frames after each synthesis period, 'Ss', determined as Ss = aSa, where 'a' is the time-scaling factor. The last few samples of a speech frame are overlapped and added with the first few samples of the next speech frame with a cross-fading function to minimize discontinuities in the output speech. The OLA technique generally fails to preserve the pitch of an arbitrary input speech signal as output speech is determined by choices of Sa and Ss. One way to overcome this problem is to synchronize the synthesis stage of OLA using Synchronized Overlap-Add (SOLA).

In SOLA method, a synchronization step is introduced to determine the overlapping regions of two frames before addition of the frames. However, the drawback of the SOLA method is that more than two windows may overlap in certain regions, which complicates the process of overlapping and adding them with suitable fading function.

The conventional TSM algorithms employ the technique of correlation to find the similar portions of speech for time-scaling operation. Although, the quality of time-scaled speech signal is good, use of correlation for synchronization does not always guarantee the similarity between overlapping frames, particularly in the presence of noise. A synchronization method based on correlation is also not efficient in terms of computation. The state-of-the-art technologies that produce high-quality time-scaled speech perform correlation-based synchronization. However, calculating correlations between frames requires lot of computational resources.

The existing technologies rely on correlation to find the similar portions in two speech frames. But, the presence of noise affects correlation measure and leads to overlapping of non- similar regions. This causes audible/perceptible artifacts in the time/pitch-scaled speech.

The conventional method of pitch- scaling, which includes time-scaling followed by resampling suffers from poor quality output signal mainly because of the lack of good time- scaling techniques employed. For pitch-scaling, even though time-scaling techniques which produce good time-scaled speech signals are used, but these techniques are computationally inefficient.

Therefore, there is a need for a method and a system for modification of speech signals to overcome the above-mentioned problems. SUMMARY

One or more shortcomings of the prior art are overcome and additional advantages are provided through the present disclosure. Additional features and advantages are realized through the techniques of the present disclosure. Other embodiments and aspects of the disclosure are described in detail herein and are considered a part of the claimed disclosure.

Disclosed herein is a method for modifying a speech signal. The method comprises receiving, by a speech modification system, a speech signal in a digital format. Then, the method comprises determining one or more epoch locations in the speech signal and assigning identification to the one or more epoch locations in the speech signal by dedicating at least one bit of least significant nibble of the speech signal. The method further comprises aligning frames of the speech signal based on the one or more epoch locations identified by state of the dedicated at least one bit of the least significant nibble. In an embodiment, the frames are aligned using compare and shift operations. Then, the method comprises performing a weighted overlap-add of the aligned frames of the speech signal to modify time-scale of the speech signal.

In an aspect of the present disclosure, a speech modification system for modification of speech signals is provided. The speech modification system comprises a processor and a memory communicatively coupled to the processor. The memory stores processor-executable instructions, which, on execution, cause the processor to receive a speech signal in a digital format and determine one or more epoch locations in the speech signal. The processor further assigns identification to the one or more epoch locations in the speech signal by dedicating at least one bit of least significant nibble of the speech signal. Also, the processor aligns frames of the speech signal based on the one or more epoch locations identified by state of the dedicated at least one bit of the least significant nibble, wherein the frames are aligned using compare and shift operations. The processor then performs a weighted overlap-add of the aligned frames of the speech signal to modify time-scale of the speech signal.

The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figures to reference features and components. Some embodiments of system and/or methods in accordance with embodiments of the present subject matter are now described, by way of example only, and with reference to the accompanying figures, in which:

Figure 1A illustrates exemplary speech waveforms for utterances or words at different rates in accordance with some embodiments of the present disclosure;

Figure IB illustrates system diagram for modifying a speech signal in accordance with some embodiments of the present disclosure; Figure 2 illustrates a block diagram of a system for modifying a speech signal in accordance with some embodiments of the present disclosure;

Figure 3 illustrates a flowchart of method for modifying a speech signal in accordance with some embodiments of the present disclosure;

Figure 4 illustrates a schematic representation of alignment of epochs in accordance with some embodiments of the present disclosure.

Figure 5 illustrates a representation of waveforms for synchronizing and overlap-addition of the frames in accordance with some embodiments of the present disclosure;

Figure 6 illustrates exemplary waveforms of speech signal modified at different rates in accordance with some embodiments of the present disclosure; and

It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes that may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown. DETAILED DESCRIPTION

In the present document, the word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any embodiment or implementation of the present subject matter described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the spirit and the scope of the disclosure.

The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, device or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or device or method. In other words, one or more elements in a system or apparatus proceeded by "comprises... a" does not, without more constraints, preclude the existence of other elements or additional elements in the system or apparatus.

In the following detailed description of the embodiments of the disclosure, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present disclosure. The following description is, therefore, not to be taken in a limiting sense.

The present disclosure relates to epoch-based time/pitch-scaling of speech signals. The epoch based scaling is derived from observation of the process of generation of voiced sounds. Figure 1A illustrates exemplary speech waveforms for utterance of words at different rates in accordance with some embodiments of the present disclosure.

In the illustrated figure, the speech waveforms for the sound 'aa' in utterance 'raa' spoken by a speaker at different rates is shown. From the figure, it appears that the time-stretched signal (b) is obtained by repeating the glottal cycles, i.e. speech signal between two successive epochs of the upper waveform (a). Therefore, epochs play an important role in time-scaling of speech signals. To achieve time-stretching of speech signals, parts of the present frame are repeated by required amount and overlap-added. To time-compress a speech signal, two frames are overlapped by the required amount and added with possible removal of some regions of the signal. The amount of shifting to be done during overlapping and adding is synchronized with epoch locations.

Figure IB illustrates system diagram for modifying a speech signal in accordance with some embodiments of the present disclosure.

As shown in figure IB, a system 100 for modifying a speech signal is illustrated. Figure IB illustrates system diagram for modifying a speech signal in accordance with some embodiments of the present disclosure and comprises one or more components coupled with each other. In one implementation, the system 100 comprises an audio source 102, an Analog-to- Digital (A/D) Converter 104, a speech modification system 106, a Digital-to-Analog (D/A) Converter 108 and an audio sink 110. In an embodiment, the speech modification system 106 processes the input (audio/ speech) signals inputted via the audio source 102. The audio source 102 comprises any device that receives input (audio/ speech) signals. In some embodiments, the audio source 102 is configured to receive analog audio signals. In one example, the audio source 102 is a microphone coupled to the A/D converter 104. The microphone is configured to receive analog audio signals while the A/D converter samples the analog audio signals to convert the analog audio signals into digital audio signals suitable for further processing. In alternative embodiments, the audio source 102 is configured to receive digital audio signals. For example, the audio source 102 is a disk device capable of reading audio signal data stored on a hard disk or other forms of media. Further embodiments may utilize other forms of audio signal sensing/capturing devices.

The audio sink 110 comprises any device for outputting the reconstructed audio signal. In some embodiments, the audio sink 110 is communicatively coupled to D/A converter 108 for outputting an analog reconstructed audio signal. In an embodiment, the D/A converter 108 may be configured in the audio sink 110. As an example, the audio sink 110 may comprise a speaker. In this example, the D/A converter 108 is configured to receive and convert the digital reconstructed audio signal from the speech modification system 106 into the analog reconstructed audio signal. The speaker can then receive and output the analog reconstructed audio signal. The audio sink 110 can comprise any analog output device including, but not limited to, headphones, ear buds, or a hearing aid. Alternatively, the audio sink 110 comprises the D/A converter and an audio output port configured to be coupled to external audio devices e.g., speakers, headphones, ear buds, hearing aid.

In alternative embodiments, the audio sink 110 outputs a digital reconstructed audio signal. In another example, the audio sink 110 is a disk device, wherein the reconstructed audio signal may be stored onto a hard disk or other medium. In alternative embodiments, the D/A converter 108 and audio sink 110 are optional and the speech modification system 106 produces the reconstructed audio signal for further processing. In one embodiment, the block diagram in FIG. IB can be implemented on a microcontroller, for example it can be a digital signal processor.

In one implementation, the speech modification system 106, as shown in FIG. 2, includes a Central Processing Unit ("CPU" or "processor") 202, a memory 204 and an Interface 206. Processor 202 may comprise at least one data processor for executing program components and for executing user- or system-generated requests. A user may include a person, a person using a device such as those included in this disclosure, or such a device itself. The processor may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc. The processor may include a microprocessor, such as AMD Athlon, Duron or Opteron, ARM's application, embedded or secure processors, IBM PowerPC, Intel's Core, Itanium, Xeon, Celeron or other line of processors, etc. The processor 202 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies such as application- specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc. Among other capabilities, the processor 202 is configured to fetch and execute computer- readable instructions stored in the memory 204. The memory 204 can include any non-transitory computer-readable medium known in the art including, for example, volatile memory (e.g., RAM), and/or non- volatile memory (e.g., EPROM, flash memory, etc.). The interface(s) 206 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, etc. The interface 206 is coupled with the processor 202 and an I/O device. The I/O device is configured to receive inputs from user via the interface 206 and transmit outputs for displaying in the I/O device via the interface 206. In one implementation, the speech modification system 106 further comprises modules

208. In one example, the modules 208 may be stored within the memory 204. In one example, the modules 208, amongst other things, include routines, programs, objects, components, and data structures, which perform particular tasks or implement particular abstract data types. The modules 208 may also be implemented as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulate signals based on operational instructions. Further, the modules 208 can be implemented by one or more hardware components, by computer-readable instructions executed by a processing unit, or by a combination thereof.

The modules 210 may include, for example, an epoch estimation module 210, an identification assigning module 212, a frame alignment module 214 and an overlap-add module 216 coupled with the processor 202. The speech modification system 106 may also comprise other modules 218 to perform various miscellaneous functionalities of the speech modification system 106. It will be appreciated that such aforementioned modules may be represented as a single module or a combination of different modules.

In operation, the epoch estimation module 210 determines one or more epoch locations in the speech signal.

In an embodiment, the identification assigning module 212 assigning identification to the one or more epoch locations in the speech signal by dedicating at least one bit of least significant nibble of the speech signal. The frame alignment module 214 aligns frames of the speech signal based on the one or more epoch locations identified by state of the dedicated at least one bit of the least significant nibble. In some embodiment, the frames are aligned using compare and shift operations. In an embodiment, the overlap-add module 216 performs a weighted overlap-add of the aligned frames of the speech signal to modify time-scale of the speech signal.

Figure 3 illustrates a flowchart of method for modifying a speech signal in accordance with some embodiments of the present disclosure.

As illustrated in Figure 3, the method 300 comprises one or more blocks for registration and profile creation of a participant. The method 300 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, and functions, which perform particular functions or implement particular abstract data types.

The order in which the method 300 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 300. Additionally, individual blocks may be deleted from the method 300 without departing from the spirit and scope of the subject matter described herein. Furthermore, the method 300 can be implemented in any suitable hardware, software, firmware, or combination thereof.

The method of present disclosure takes analysis frames of the input speech signal at an average rate of Sa with each starting position allowed to vary within limits as in equation (1):

and an output signal is reconstructed using a fixed synthesis frame length S s as in equations

(3) where,

L ov = N - S s is the number of samples in overlapping region.

S s = aS a , where 'a' is the timescale factor.

k m = Starting point of analysis frame.

The shift k m in analysis frame affects starting point of the analysis frame x m (j) in input signal. In an embodiment, a suitable k m is chosen by aligning the epochs in the overlapping region of present output and analysis frames. P(j) is a weighting/cross-fading function, which can be linear, Hamming window, or raised cosine and is used to perform weighting in overlap-add reconstruction to suppress block artifacts.

At block 310, receive a speech signal in a digital format. The speech modification system 106 receives the speech signal from the audio source 102 through the A/D converter 104 in a digital format.

At block 320, determine one or more epoch locations in the speech signal. The location of epochs on a speech signal can be determined using one or more methods including, but not limited to, Zero Frequency Resonator (ZFR) and Spectral Zero-Crossing Rate (SZCR). A person skilled in the art would understand that any other epoch estimation method could also be used with the method of the present disclosure. The ZFR method for epoch estimation is accurate and robust to different types of noise corrupted data at relatively high SNRs of 10-15 dB. Also, the ZFR epoch estimation method is computationally efficient as it uses two IIR filters with local mean subtraction process.

In an embodiment, position of first epoch in the 'πι Λ output frame y(mS s + j) is determined for 0 < j < L ov - 1 (overlapping region). Let this position be j y . Then, positions of epochs in 'πΛ analysis frame, x m (j) i s determind for 0 < j < L ov - 1.

Let positions of the epochs be denoted by jxi , jx 2 , , jx p , where p is the number of epochs in the overlapping region. At block 330, assign identification to the one or more epoch locations in the speech signal. In an embodiment, the identification is provided by dedicating at least one bit of least significant nibble of the speech signal for the presence of epoch location in the speech signal. As an example, the Least Significant Bit (LSB) may be marked as T for all sample locations of a given speech signal that corresponds to an epoch. Further, the least significant bit of rest of the samples, which are identified as not-an-epoch location may be marked as zero.

At block 340, align frames of the speech signal based on the one or more epoch locations. In an embodiment, the frames are aligned based on one or more epoch locations identified by state of the dedicated at least one bit of the least significant nibble. In some embodiment, the frames are aligned using compare and shift operations.

The process of aligning epochs is described in Figure 4. Figure 4 illustrates a schematic representation of alignment of epochs for synchronizing synthesis and analysis frames in accordance with some embodiments of the present disclosure. The figure shows only epoch locations in the speech waveform instead of the entire speech for understanding purpose. In the synthesis frame of length N, location of first epoch in the overlapping region is j y . In analysis frame, also of length N, second epoch location is chosen as j xopt so that k m is positive and minimum. Now, analysis frame is shifted right by k m samples to align the epochs of the two frames. The initial k m samples of the analysis frame which are now, out of overlapping region are discarded. To compensate for the discarded samples, k m samples from the next analysis frame are appended at the tail of present analysis frame as shown in the third frame. This ensures that length of the frame essentially remains the same.

The parameter k m can take values in the range [0,K mca ], where K max is maximum amount of shift that can be allowed. K max is chosen such that it is larger than the largest pitch period in the entire speech signal. β(]) is fading factor used while overlapping and adding so as to reduce any audible artifacts by providing a smooth transition between synthesized frames. In case no epochs are present in the analysis or synthesis frames or if k m > K max , k m is set to zero. This situation occurs when dealing with unvoiced/silent segments of the input signal. Setting k m to zero signifies that the output and input frames are simply overlapped and added based on the time-scale factor without employing any synchronization procedure.

At block 340, perform a weighted overlap-add of the aligned frames of the speech signal to modify time-scale of the speech signal. The process of aligning epochs and overlap-adding the synchronized glottal cycles with weights (cross fading function) for two frames of a speech signal is shown in Figure 5. Figure 5 illustrates speech waveforms over which the epochs are superimposed. The output speech waveform is obtained after weighted overlap-adding of the input frames.

Further, in an embodiment, the time-scale modified speech signal may be resampled at a predefined rate to modify pitch of the speech signal. In an exemplary embodiment, the pitch-shift is performed by a factor β. In such a case, the signal playback rate is first changed to l/β times the original rate and then the time-scaled signal is resampled at the rate F s / β, where F s is the original sampling frequency. The output of the pitch shift is a signal of same duration as the original but with frequencies scaled by β. If the desired pitch is higher than the original pitch, i.e. if β > 1, then the distance between consecutive pitch marks decreases. On the other hand, if β < 1, the distance between consecutive pitch marks increases. Figure 6 illustrates exemplary waveforms of speech signal modified at different rates in accordance with some embodiments of the present disclosure. The first block illustrates speech signal at original rate (lx). The second block illustrates the speech signal with time scale modification at rate of 1.6x. The third block illustrates the speech signal with time scale modification at rate of 0.6x.

The described operations may be implemented as a method, system or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The described operations may be implemented as code maintained in a "non-transitory computer readable medium", where a processor may read and execute the code from the computer readable medium. The processor is at least one of a microprocessor or a processor capable of processing and executing the queries. The processor may be implemented as a central processing unit (CPU) for test automation system 112. The CPU may include one or more processing units having one or more processor cores or having any number of processors having any number of processor cores. CPU may include any type of processing unit, such as, for example, a multi-processing unit, a reduced instruction set computer (RISC), a processor having a pipeline, a complex instruction set computer (CISC), digital signal processor (DSP), and so forth. A non-transitory computer readable medium may comprise media such as magnetic storage medium (e.g., hard disk drives, floppy disks, tape, etc.), optical storage (CD-ROMs, DVDs, optical disks, etc.), volatile and non-volatile memory devices (e.g., EEPROMs, ROMs, PROMs, RAMs, DRAMs, SRAMs, Flash Memory, firmware, programmable logic, etc.), etc. The non-transitory computer-readable media comprise all computer-readable media except for a transitory. The code implementing the described operations may further be implemented in hardware logic (e.g., an integrated circuit chip, Programmable Gate Array (PGA), Application Specific Integrated Circuit (ASIC), etc.).

Still further, the code implementing the described operations may be implemented in "transmission signals," where transmission signals may propagate through space or through a transmission media, such as an optical fiber, copper wire, etc. The transmission signals in which the code or logic is encoded may further comprise a wireless signal, satellite transmission, radio waves, infrared signals, Bluetooth, etc. The transmission signals in which the code or logic is encoded is capable of being transmitted by a transmitting station and received by a receiving station, where the code or logic encoded in the transmission signal may be decoded and stored in hardware or a non-transitory computer readable medium at the receiving and transmitting stations or devices. An "article of manufacture" comprises non-transitory computer readable medium, hardware logic, and/or transmission signals in which code may be implemented. A device in which the code implementing the described embodiments of operations is encoded may comprise a computer readable medium or hardware logic. Of course, those skilled in the art will recognize that many modifications may be made to this configuration without departing from the scope of the invention, and that the article of manufacture may comprise suitable information bearing medium known in the art. The terms "an embodiment", "embodiment", "embodiments", "the embodiment", "the embodiments", "one or more embodiments", "some embodiments", and "one embodiment" mean "one or more (but not all) embodiments of the invention(s)" unless expressly specified otherwise. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless expressly specified otherwise. The enumerated listing of items does not imply that any or all of the items are mutually exclusive, unless expressly specified otherwise. Moreover, the terms "first," "second," "third," and so forth, are used merely as labels, and are not intended to impose numerical requirements on their objects.

The terms "a", "an" and "the" mean "one or more", unless expressly specified otherwise. A description of an embodiment with several components in communication with each other does not imply that all such components are required. On the contrary a variety of optional components are described to illustrate the wide variety of possible embodiments of the invention.

When a single device or article is described herein, it will be readily apparent that more than one device/article (whether or not they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether or not they cooperate), it will be readily apparent that a single device/article may be used in place of the more than one device or article or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the invention need not include the device itself.

The illustrated operations of Fig. 3 show certain events occurring in a certain order. In alternative embodiments, certain operations may be performed in a different order, modified or removed. Moreover, steps may be added to the above described logic and still conform to the described embodiments. Further, operations described herein may occur sequentially or certain operations may be processed in parallel. Yet further, operations may be performed by a single processor or by distributed processing units.

Additionally, the advantages of present disclosure are illustrated herein. In an embodiment, the present disclosure provides computationally efficient technique for time- scaling of speech/audio signals which is robust to additive noise and produces high quality time- scaled version of input speech as compared to correlation techniques. In an embodiment, the present disclosure provides computationally efficient technique for pitch-scaling of speech/audio signals which produces high quality pitch-scaled version of input speech.

In an embodiment, the method of present disclosure uses epoch estimation algorithms such as ZFR, which are robust to additive noise. This alleviates the problem of degradation in the quality of time-scaled speech due to presence of noise.

In an embodiment, the method of present disclosure uses ZFR or SZCR for estimation of epochs, due to which the method performs faster, producing high quality output.

In an embodiment, the quality of pitch-scaled signals obtained by the method of present disclosure is observed to be high and provides improvement over that of other know pitch scaling techniques. In an embodiment, epochs can be embedded into the signal in the least significant nibble in the binary representation. Therefore, if the signal is required to be time/pitch-scaled multiple times, the epochs need not be estimated again. The epochs need to be estimated only once.

In an embodiment, the present disclosure does not require voiced/unvoiced segment classification, and voice activity detection, which is needed in most of the conventional methods.

In an embodiment, the method of present disclosure is suitable for differential time/pitch- scaling, that is to scale differently in different time regions. The present disclosure can be used in many applications. Few non-limiting examples are described herein. In an embodiment, a blind person can use the method of the present disclosure to speed up or slow down the speech or audio track of a video. Text-to-speech synthesis systems can use the method of the present disclosure for performing prosody modification. A learner of a foreign language or music can use the method to slow down speech or music, respectively from a trainer/instructor and practice learning with the nuances.

Movie technicians can use the method of the present disclosure for lip-synchronization without having to record the song track multiple times. A voice-mail system can use the method to playback the messages at a faster rate and selectively slow down as desired. The method can also be used in impersonation performances to impersonate a vocalist, for example, and in animation movies in particular to give special effects to characters. The technique can also be applied to singing voices effectively. The method of present disclosure is applicable to any periodic or quasi-periodic signal (electrocardiogram signals, for example), in addition to speech signals.

Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims. Referral Numerals:

Reference

Description

Number

102 Audio Source

104 A/D Converter

106 Speech Modification System 108 D/A Converter

110 Audio Sink

202 Processor

204 Memory

5 206 Interface

208 Modules

210 Epoch Estimation Module

212 Identification Assigning Module

214 Frame Alignment Module

216 Overlap-add Module

10

218 Other Modules

15

20