Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PITCH PERIOD AND VOICED/UNVOICED SPEECH MARKING METHOD AND APPARATUS
Document Type and Number:
WIPO Patent Application WO/2018/026329
Kind Code:
A1
Abstract:
The presented invention generally relates to speech analysis and synthesis; more specifically, it concerns a method and apparatus for voiced/unvoiced segmentation and pitch period marking in the analysis of a complex signal. The pitch marking method comprises removing the signal offset, low-pass zero-phase filtering with cut-off frequency fLP, calculation of the short-time autocorrelation using a sliding window with a variable window size, defining the short-time pitch autocorrelation time lag, defining the bandpass filter coefficients based on the value of the short-time pitch autocorrelation time lag, filtering the speech signal with an adaptive zero-phase bandpass filter with central frequency Fcf, where the filter output signal is a pitch waveform, defining the pitch marks of the pitch signal by extracting the half periods of the pitch waveform, defining the pitch segment, and mapping the pitch marks of the pitch waveform onto the pitch marks of the speech signal. The method can further denote the pitch mark as a negative peak within the pitch segment, as a positive peak, or as the start and end sample of the speech period within which the positive and negative peaks appear. Voiced/unvoiced speech detection is performed using the thresholding method. The short-time absolute amplitude average of the zero-phase low-pass filtered signal and short-time pitch autocorrelation time lags are used as criteria for voiced/unvoiced segment detection and marking.

Inventors:
KACIC ZDRAVKO (SI)
Application Number:
PCT/SI2017/000007
Publication Date:
February 08, 2018
Filing Date:
April 25, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIVERZA V MARIBORU FAKULTETA ZA ELEKTROTEHNIKO RACUNALNISTVO IN INFORMATIKO (SI)
International Classes:
G10L25/90; G10L25/93
Foreign References:
US20090182556A12009-07-16
US8280725B22012-10-02
US8214201B22012-07-03
US20040260537A12004-12-23
US6349277B12002-02-19
US6470311B12002-10-22
EP0227858A11987-07-08
Other References:
LEGT M ET AL: "On the detection of pitch marks using a robust multi-phase algorithm", SPEECH COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 53, no. 4, 14 January 2011 (2011-01-14), pages 552 - 566, XP028168781, ISSN: 0167-6393, [retrieved on 20110121], DOI: 10.1016/J.SPECOM.2011.01.008
DE CHEVEIGNÉ ALAIN ET AL: "YIN, a fundamental frequency estimator for speech and musica)", THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, AMERICAN INSTITUTE OF PHYSICS FOR THE ACOUSTICAL SOCIETY OF AMERICA, NEW YORK, NY, US, vol. 111, no. 4, 1 April 2002 (2002-04-01), pages 1917 - 1930, XP012002854, ISSN: 0001-4966, DOI: 10.1121/1.1458024
Attorney, Agent or Firm:
PATENTNI BIRO AF D.O.O. (SI)
Download PDF:
Claims:
Patent claims

1. The pitch marking of speech and the voiced/unvoiced segment marking method, wherein the method comprises the following steps:

- the offset is first removed from the speech signal and the signal is low-pass filtered with a zero-phase low-pass filter with cut-off frequency Fip;

- the output low-pass filtered signal is used to calculate the short-time absolute amplitude average and short-time biased autocorrelation using a sliding window of variable window size;

- the values of the first two harmonically related peaks (harmonic peaks) of the short-time autocorrelation sequence and of the pitch autocorrelation time lag are defined;

- based on the pitch autocorrelation time lag value, the value of the central frequency FCf of the zero-phase bandpass filter is defined and the filter coefficients calculated;

- the short-time amplitude average of the low-pass filtered speech signal is calculated to determine voiced segments;

- in the case that pitch autocorrelation time lag values exist for the current segment of speech signal and the values of the short-time amplitude average are greater than threshold TRE1 , the current segment is denoted as a voiced segment; otherwise, the current segment is denoted as unvoiced;

- the low-pass filtered signal segment for which the pitch autocorrelation time lag values were defined is filtered with an adaptive bandpass zero-phase filter that uses the calculated filter coefficients. The output signal is the extracted pitch signal of the current speech segment. For voiced segments, the pitch marks are set at zero-crossing points of the pitch waveform as the start and end sample of the pitch period within which the positive and negative peaks appear;

- in the last stage, the mapping of the pitch marks of the pitch waveform onto the pitch marks of the speech signal is carried out;

- the pitch marks for unvoiced speech segments are rule-based defined, where there rules applied may define the pitch marks of unvoiced speech segments to be placed at constant time intervals of predefined lengths, or the length is defined as the average value of the pitch mark lengths of the voiced segments or definition is based on otherwise defined statistical characteristics of the voiced speech pitch lengths.

2. The method of claim 1 , characterized in that the proposed method in the first step (1010) removes the offset from the input signal (1000); the resulting speech signal is filtered by a zero-phase low-pass (ZPLP) filter in step (1020); the output low-pass filtered speech signal is processed by a correlation module in step (1030), which includes a short-time autocorrelation calculation in step (1031) and a search for time lags of autocorrelation peaks in step (1032) to define the pitch autocorrelation time lag of the speech signal; the low-pass filtered speech signal resulting from step (1020) is input to step (1060), the speech buffer, and to step (1045), where the short-time absolute amplitude average is calculated; in step (1050) voiced/unvoiced speech detection is performed, where the signal from step (1032) is also used; the output speech signal from step (1050) is led to step (1060) (speech buffer) and step (2080); the signal from step (1032) is also led to step (2080), where the filter coefficients are calculated; the filter coefficients calculated in step (2080) are used in the passband zero-phase filter with central frequency FCf in step (2070); step (2070) is followed by step (2090), where the generation of the pitch signal is performed; the defined pitch signal is led to the pitch marking step (11 1 1 ), to which the signal with a removed offset is also conveyed; the pitch signal defined in step (2090), and the speech signal defined in step (10 0), with detected polarity, which is defined in step (2100), are led to step (11 10), which, in addition to step (11 1 1) for voiced speech segment marking, also includes step (1 1 12), where pitch marking of unvoiced speech segments is performed using the detected marks of unvoiced segments performed in step (1050) and the speech signal with offset removed; the processing results of steps (1050) and (1 1 10) are written in the set of marks in step (1 120); from step (1050) the processing result is written in the set of marks of voiced/unvoiced segments in step (1 122); the processing results from steps ( 111) and (1 1 12) are written in a set of pitch marks of voiced and unvoiced speech segments in step (1 121 ).

3. The method of claim 2, characterized in that the filter coefficient generation step is composed of step (2082), which includes a set of predefined coefficients for a set of bandpass filters, where the set of predefined filter coefficients is input to step (2081), which includes a module for filter coefficient selection. The output of step (2081) is a set of coefficients of a selected bandpass filter with central frequency

4. The method of one of the claims from 1 to 3, characterized in that by triggering the START key the offset is removed from the input signal (1000) in step (1010); in step (1020), the resulting speech signal from step (1010) is filtered by a zero-phase low- pass (ZPLP) filter (1020) with cutoff frequency FiP, where the value of FiP is set above the highest pitch value expected in the input speech signal; in the case of low frequency noise, the input low-pass filter with cutoff frequency Fip may be replaced by a zero-phase bandpass filter, in order to remove the low frequency noise and to increase the robustness of the proposed method; the output low-pass filtered speech signal from step (1020) is processed by a correlation module (1030), which in step (1031 ) calculates a short-time autocorrelation, and in step (1032) searches for time lags of autocorrelation peaks; the short-time autocorrelation module calculates a biased autocorrelation sequence for the sliding window W of variable window size over the whole input signal; for each time instant, an initial long-time window WLT is first used to calculate the autocorrelation sequence from step (1031 ); the obtained autocorrelation sequence is normalised to value 1 ; the normalised autocorrelation sequence from step (1031 ) represents the input to the autocorrelation peak search module (1032), which searches for the first two harmonically related peaks in the autocorrelation sequence; the peaks are defined using threshold criteria; three thresholds are defined: TRA1 , TRA2 and TRA3; in the case that both of the first two harmonically related autocorrelation peaks are above the corresponding thresholds, the autocorrelation time lag of the first peak, which was defined in step (1033), is defined as the time lag of the long-time window WLT; in the case that an autocorrelation time lag of WLT in step (1034) exists, a new length of the analysis window WN (short-time window) is defined to improve the time resolution of the method; the length of the short-time window WN is defined in step (1035) as a multiple of the autocorrelation time lag of the long-time window WLT; the autocorrelation sequence is calculated again using the short-time window WN in step (1036); a search for the first two harmonically related short-time autocorrelation peaks is performed in step (1037); if peaks in step (1038) exist and exceed thresholds TRA1 and TRA2, or if only the first peak exceeds the corresponding threshold TRA3, while the second peak is below the peak TRA2, the time lag of the first autocorrelation peak is denoted as the autocorrelation time lag of the short-time window WN; if none of the peaks exceed the thresholds, the autocorrelation time lag of the short-time window WN is not defined; if an autocorrelation time lag of the short-time window WN exists, it is compared in step (1039) to the calculated autocorrelation time lag of the initial long-time window WLT; if the difference between the time lags in step (1040) is less than the predefined threshold TRL1 , the autocorrelation time lag of the short-time window WN is denoted as a pitch autocorrelation time lag for the current window in step (1041); otherwise, the autocorrelation time lag of the long-time window WLT is denoted as a pitch autocorrelation time lag of the current window in step (1042); if the autocorrelation time lag of the short-time window WN does not exist, the autocorrelation time lag of the long-time window WLT is defined as the pitch autocorrelation time lag; if no peaks of the autocorrelation sequence for the longtime window WLT exceed the thresholds, no pitch autocorrelation time lag is defined for the current window; the outputs from steps (1041 ) and (1042), as well as from step (1034), for cases where no pitch autocorrelation time lag for long-time window WLT exists, are inputs to step (1045).

5. The method of claim 4, characterized in that the threshold values are defined as TRA1 =0.3, TRA2=0.2 and TRA3=0.38.

6. The method of claim 4, characterized in that the threshold values TRA1 , TRA2 and TRA3 are defined experimentally and can differ for different acoustic environments.

7. A pitch marking and voiced/unvoiced segment marking apparatus, comprising: - an offset removal unit,

- a zero-phase low-pass filter,

- an autocorrelation calculation and peak search unit,

- a short-time absolute amplitude average calculation unit,

- a voiced/unvoiced speech detection unit,

- a voiced speech buffer,

- a bandpass filter coefficient generation unit,

- an adaptive zero-phase bandpass filter,

- a speech polarity detection unit, and

- a pitch marking unit.

8. A pitch marking and voiced/unvoiced segment marking method of claims 1-6, wherein the results of the method are used for pitch synchronous speech quality improvement, in clinical diagnostics, speech coding, automatic phonetic segmentation, pitch synchronous speech analysis and processing, speaker characterisation, speech conversion, speech recognition, and speech synthesis.

Description:
PITCH PERIOD AND VOICED/UNVOICED SPEECH MARKING METHOD AND APPARATUS

Field of the invention

The present invention generally relates to speech analysis and synthesis; more specifically, it concerns a method and apparatus for voiced/unvoiced segmentation and pitch period marking in the analysis of a complex signal.

The technical problem

The present invention solves the technical problem of how to mark voiced and unvoiced signal segments for accurate definition of pitch marks with high time resolution. The problem is also how to mark pitch periods in the voiced segment and set marks in the unvoiced segments of a complex signal, such as speech or music, in order to avoid problems of windowing, low time and frequency resolution, averaging, pitch mark insertion and deletion, and noise robustness. A noise robust, high time and frequency resolution, and high accuracy method is a goal of this invention.

Pitch is a fundamental feature of a speech signal. In the time domain, it is denoted by a pitch period within the voiced segment of speech. In the case of a speech signal, pitch marking means marking the time instants of glottal closure, also referred to as glottal closure instants (GCIs) or epochs. Usually, the pitch mark is set at the time instant of the speech signal amplitude extreme within a pitch period - most often a negative peak, which corresponds to the time instant of glottal closure.

In automatic pitch marking, a decision must be made whether to place the pitch marks at positive or negative peaks of a speech waveform, which can be determined by a polarity detection algorithm. Pitch marks can also be placed at zero crossing points of the speech signal, where these points represent a start and end sample of the speech period within which the positive and negative peaks appear. The accurate marking of pitch periods is important in number of speech processing areas, including pitch- synchronous speech enhancement, clinical diagnosis, speech coding, automatic phonetic segmentation, pitch synchronous speech analysis and processing, speaker characterisation, voice conversion, speech recognition, and speech synthesis.

State of the art

In recent years, many pitch marking algorithms have been proposed that are based on different speech processing techniques, such as linear prediction, cepstral analysis, autocorrelation function, average magnitude difference function, Cohen's class time- frequency representations (TFR), group delay-based methods, multiphase algorithm, ensemble empirical mode decomposition, threshold-based and peak picking methods. Many of these techniques face problems such as the windowing effect, low time or frequency resolution, averaging, epoch insertion or deletion, noise robustness, etc. Furthermore, many of them are computationally inefficient.

Among the frequently used pitch marking methods are autocorrelation-based methods. Such methods are described in US Pat Nos. 8,280,725 B2 and US 8,214,201 B2, in which the pitch period is defined based on the computation of the autocorrelation values of overlapping portion of speech signal and calculation of the combined autocorrelation value to select the estimated pitch period in dependence on the combined autocorrelation value. In document US 2004/0260537 A1 the autocorrelation is used to determine the pitch period using the iterative technique of calculating the autocorrelation lag that denotes the pitch period. Although in most of the autocorrelation based approaches good accuracy of pitch marks can be achieved, the problem of low time resolution exists. In order to accurately define the autocorrelation values, and thus achieve good accuracy of the method, the autocorrelation window should contain three or more pitch periods of the voiced speech signal, which consequently brings the need for long time windows, and with this the lower time resolution of the method. The time resolution becomes critical for example at voiced/unvoiced speech borders.

Several other approaches are using filtering technique to extract the pitch period, such as in document US 6,349,277 B1 in which a pitch analyser is used to evaluate the pitch of the input signal and the adaptive lowpass filter sets a cut-off frequency according to pitch information in a way to extract pitch waveform of the input signal. For pitch analyser different pitch analysing methods can be used. In the second embodiment of the application a bank of fixed low-pass filters is used, the filters being connected to peak detectors to detect peaks, based on which the channel selector selects a proper channel at each unit time adaptively and thus defines a set of pitch marks, which are, to remove irregularities, converted to the pitch information that is used to control the adaptive lowpass filter, which extracts the pitch waveform. In document US 6,470,31 1 B1 the optimum filter is chosen by passing the largest voice area greater than 50 ms through multiple filters. The average energy output for each filter and differences between the filter averages (delta energy) are calculated and the first peak in delta energy above the average delta energy is used to define the optimal filter for filtering the signal.

Such filtering based approaches mainly provide a good time resolution, however accuracy of the methods that use pitch detection to set the filter parameters depends heavily on the accuracy, robustness, and time resolution of the pitch detection methods applied, whereas the one that are using lowpass filter arrays tend to have low noise robustness, what can lead to poor performance for noisy speech signals.

In document EP 2 278 58 A2 a band-pass filter is used to extract the pitch waveform and the pass band is set in accordance with the pitch, detected by the pitch detection section. In the proposed application the pitch detection section has a complex structure, what means higher computation complexity of the proposed method. The proposed method of the present invention differs from the above-mentioned solutions in that it features simple and robust mechanism for adaptive bandpass filter parameters setting.

Description of the invention

The proposed method of pitch period and voiced/unvoiced segment marking according to this invention includes the following steps:

• the offset is first removed from the input speech signal, which is then filtered using a zero-phase low-pass filter, with a cut-off frequency Fi P ;

• the filtered output speech is used for calculation of the short-time absolute amplitude average of the signal and calculation of the short-time biased autocorrelation sequence, using a sliding window with a variable window size;

• the first two harmonically related peak values of the short-time autocorrelation sequence are searched to define the pitch autocorrelation time lag;

• based on the value of the pitch autocorrelation time lag, the central frequency Fcf of the bandpass filter is defined and filter coefficients are calculated;

• the speech segment for which the pitch autocorrelation time lag was defined is filtered with an adaptive zero-phase bandpass filter using the calculated filter coefficients; the extracted signal is the pitch waveform of the speech segment;

• the short-time absolute amplitude average is calculated to determine voiced segments;

• if pitch autocorrelation time lags exist for the current signal segment and the short-time absolute amplitude average is greater than threshold TRE1 , the current speech segment is denoted as voiced; otherwise, the segment is denoted as unvoiced;

• for voiced segments, pitch marks are defined for the pitch waveform at zero crossing points, where these points represent the start and end sample of the pitch waveform period within which the positive and negative peaks appear;

• in the last stage, mapping of the pitch waveform pitch marks onto the pitch marks of the speech signal is performed; • the pitch marks for unvoiced segments are defined as rule-based, which may define the position of pitch marks in constant time intervals of a selected length, where the selected length is defined as the average length between pitch marks of voiced segments or based on otherwise defined statistical characteristics of distances between pitch marks, or based on any other criteria for pitch mark distance definition.

The apparatus for the method described above includes:

• an offset removal unit,

• a zero-phase low-pass filter,

• an autocorrelation calculation and peak search unit,

• a short-time absolute amplitude average calculation unit,

• a voiced/unvoiced speech detection unit,

• a voiced speech buffer,

• a bandpass filter coefficient generation unit,

• an adaptive zero-phase bandpass filter,

• a speech polarity detection unit, and

• a pitch marking unit.

The proposed pitch period and voiced/unvoiced speech marking method and apparatus will now be described in more detail with reference to the accompanying figures that form part of this application and describe the following: a block scheme of the first embodiment of the pitch marking method a block scheme of the second embodiment of the pitch marking method

a flowchart of the proposed pitch marking method

- an example waveform of the ~ speech signal

the normalised curve of the short-time pitch autocorrelation time lag the normalised short-time absolute amplitude average of the low- pass filtered speech signal Figure 5 an example waveform of the voiced speech signal and the pitch waveform extracted from the voiced speech signal by the adaptive zero-phase band-pass filter

Figure 6 an example of a waveform of the voiced speech signal.

Figure 7 an example of a biased autocorrelation sequence of the voiced speech signal

Figure 8 an example of a transition from voiced to unvoiced speech signal Figure 9 an example of a biased autocorrelation sequence

Figure 10 an example of a transition from a voiced to an unvoiced speech signal

Figure 11 an example of a biased autocorrelation sequence

Figure 12 an example of the waveform of a speech signal with depicted pitch marks

Figure 3 an example of the waveform of a speech signal with depicted pitch marks

Figure 14 an example of the waveform of a speech signal with depicted pitch marks

Figure 15 the apparatus according to the present invention

The exemplary embodiments of the present invention are described in detail hereinafter with reference to the accompanying drawings. The same reference numbers are used throughout the drawings when referring to the same or like parts. In the following detailed description, only exemplary embodiments of the invention are shown and described. Therefore, the drawings and descriptions are to be regarded as illustrative in nature and unrestrictive.

Figure 1 illustrates a block diagram of the proposed pitch marking method (embodiment I). The proposed pitch marking method starts with step 1010, in which the offset is removed from the input speech signal. The resulting speech signal is filtered by a zero-phase low-pass (ZPLP) filter in step 1020. In step 1030, the output low-pass filtered speech signal is processed by a correlation module, which consists of short-time autocorrelation calculation in step 1031 and a search for time lags of autocorrelation peaks in step 1032 to define the pitch autocorrelation time lag of the speech signal.

The signal resulting from step 1020 is input to step 1060 (speech buffer) and step 1045, where a short-time absolute amplitude average is calculated. In step 1050 voiced/unvoiced speech detection is performed, where the signal from step 1032 is also used. For voiced segments, the output from step 1050 is led to step 1060 (speech buffer) and to step 2080. The output from step 1032 is also led to step 2080, where the filter coefficients are calculated. The filter coefficients calculated in step 2080 are used in a passband zero-phase filter with central frequency F C f in step 2070. The result of filtering in step 2070 is a generated pitch waveform (step 2090). The generated pitch signal is led to the pitch marking step 1 1 11 , to which a signal with a removed offset is also conveyed. From step 2090 and step 1010, both signals with a detected polarity (defined in step 2100) are led to step 1110, which, in addition to step 1111 for voiced speech segment marking, also includes step 1 112. In step 1 1 12, pitch marking of unvoiced speech segments is carried out using the detected marks of unvoiced segments, performed in 1050, and the signal with offset removed. The processing results of steps 1050 and 1 1 10 are written in the set of marks in step 1120. The processing result from step 1050 is written into the set of marks of voiced/unvoiced segments in step 1 122. From steps 1 1 1 1 and 1 1 12, the processing results are written in a set of pitch marks of voiced and unvoiced speech segments in step 1 121. The speech signal is processed in steps 1 1 11 , 11 12 and 1050. The results of this processing are marks. The marks may denote voiced/unvoiced speech segments, as defined in step 1050, or pitch periods of the speech signal. The latter are defined for voiced speech in step 1 1 1 1 , and for unvoiced speech in step 1 1 12. Step 1120 does not include processing of signals, but combines the results of processing steps 1 1 1 1 and 1 1 12 that are marks of voiced/unvoiced segments of speech (step 1 122) and the definition of pitch marks for voiced and unvoiced speech segments (step 1 121).

The pitch marking method according to the second embodiment is depicted in Figure 2. Step 2080 for filter coefficient generation is replaced with step 2082, which includes a set of predefined coefficients for a set of bandpass filters, where the set of predefined filter coefficients is input to step 2081 , which includes a module for filter coefficient selection. The output of step 2081 is a set of coefficients of the selected filter, which is one of the two input data in step 2070. With this, better computational efficiency of the proposed method is achieved, as it requires less calculation for the definition of filter coefficients for the current speech segment.

Figure 3 illustrates the flowchart of the pitch period and the voiced/unvoiced speech marking method. The START key initiates the method in which in the first step 1010 the offset is removed from the input speech signal 1000. In step 1020, the resulting speech signal from step 1010 is filtered by a zero-phase low-pass (ZPLP) filter with cutoff frequency Fi P , where the value of Fi p is set above the highest pitch value expected in the input signal. In the case of low frequency noise, the input low-pass filter with cutoff frequency Fi P may be replaced by a zero-phase bandpass filter to remove the low frequency noise and to increase the robustness of the proposed method. The output low-pass filtered speech signal from step 1020 is processed by a correlation module 1030, which consists of short-time autocorrelation calculation in step 1031 and a search for time lags of autocorrelation peaks in step 1032. The short-time autocorrelation module calculates a biased autocorrelation sequence for the sliding window W of variable window size over the whole input signal. For each time instant, an initial long-time window WLT is first used to calculate the autocorrelation sequence from step 1031 . The obtained autocorrelation sequence is normalised to value 1. The normalised autocorrelation sequence from step 1031 represents the input to the autocorrelation peak search module 1032, which searches for the first two harmonically related peaks in the autocorrelation sequence. The peaks are defined using threshold criteria. Three thresholds are defined: TRA1 , TRA2 and TRA3; in figures 7, 9 and 1 , shown as an illustrative example, the thresholds are set to values TRA1 =0.3, TRA2=0.2 and TRA3=0.38. The threshold values TRA1 , TRA2 and TRA3 are defined experimentally and can differ for different acoustic environments. Considering the predefined thresholds, three different situations may occur. In the first, both of the first two harmonically related autocorrelation peaks are above the corresponding thresholds: the first peak is above TRA1 and second peak is above TRA2 (Fig. 7). In this case, the signal segment on which the autocorrelation sequence was calculated tends to be voiced (Fig. 6). In the second, only the first peak exceeds the corresponding thresholds TRA1 and TRA3 (Fig. 9), while the second peak is below threshold TRA2. The signal segment on which the autocorrelation sequence was calculated tends to be partially voiced (Fig. 8). In the third, none of the first two harmonically related autocorrelation peaks exceed the corresponding thresholds TRA1 and TRA 2 (Fig. 1 1 ). The signal segment on which the autocorrelation sequence was calculated tends to be unvoiced (Fig. 10). If the peaks exceed the corresponding thresholds, the autocorrelation time lag of the first peak is defined as the time lag of the long-time window WLT 1033 (Fig. 3). If in step 1034 an autocorrelation time lag of WLT exists, a new length of the analysis window WN (short-time window) is defined to improve the time resolution of the method. The length of the short-time window WN is defined as a multiple of the autocorrelation time lag of the long-time window WLT in step 1035. The autocorrelation sequence is calculated again using the short-time window WN in step 1036. Next, a search for the first two harmonically related short- time autocorrelation peaks is performed in step 1037. If peaks exist and exceed thresholds TRA1 and TRA2 - or if only the first peak exceeds the corresponding threshold TRA3, while the second peak is below peak TRA2 - the time lag of the first autocorrelation peak is denoted in step 1038 as the autocorrelation time lag of the short-time window WN. If none of the peaks exceed the thresholds, the autocorrelation time lag of the short-time window WN is not defined. If an autocorrelation time lag of the short-time window WN exists, it is compared to the calculated autocorrelation time lag of the initial long-time window WLT in step 1039. If in step 1040 the difference between the time lags is less than the predefined threshold TRL1 , the autocorrelation time lag of the short-time window WN is denoted in step 1041 as a pitch autocorrelation time lag for the current window; otherwise, the autocorrelation time lag of the long-time window WLT is denoted as a pitch autocorrelation time lag of the current window in step 1042. If the autocorrelation time lag of the short-time window WN does not exist, the autocorrelation time lag of the long-time window WLT is defined as the pitch autocorrelation time lag. If no peaks of the autocorrelation sequence for the long-time window WLT exceed the thresholds, no pitch autocorrelation time lag is defined for the current window. The method continues from steps 1041 and 1042, as well as from step 1034 (for cases where no pitch autocorrelation time lag for long-time window WLT exists) in step 045.

Next to the autocorrelation calculation, the short-time absolute amplitude average is calculated in step 1045 using the low-pass filtered signal as input. The voiced/unvoiced speech detection module in step 1050 uses the information about the pitch autocorrelation time lags defined in steps 1041 or 1042 and the short-time absolute amplitude average calculated in step 1045 and the corresponding threshold TRE1 to denote voiced/unvoiced segments. If pitch autocorrelation time lags exist for the current segment and the short-time absolute amplitude average values exceed threshold TRE1 for the duration of the whole segment, the segment is denoted as voiced. If pitch autocorrelation time lags do not exist for the current segment, or the short-time absolute amplitude average values are below threshold TRL1 , or both, the current segment is denoted as unvoiced. A set of voiced/unvoiced segment marks is defined using information about the existing pitch autocorrelation time lags and values of the short-time absolute amplitude average of low-pass filtered speech signal that exceed threshold TRE1 (step 1050). Using the defined set of voiced/unvoiced segment marks (step 1 122), the current speech segment is defined (step 1023). Fig. 4a illustrates an example waveform of a speech signal, Fig. 4b illustrates the normalised curve of the short-time pitch autocorrelation time lag, and Fig. 4c shows the normalised short-time absolute amplitude average of the low-pass filtered speech signal with a denoted average absolute amplitude threshold TRE1. The vertical solid lines on all three figures (4a, 4b and 4c) indicate the position of defined voiced/unvoiced segment marks.

If the current speech segment is voiced (step 1024), it is written into a speech buffer (step 1060) and the information about pitch autocorrelation time lags for voiced segments is conveyed to the filter coefficient generation module in step 2080, where the values of pitch autocorrelation time lags are used to define the central frequencies Fcf of the bandpass filter, based on which a set of filter coefficients is calculated. Here, the pitch autocorrelation time lag to the central frequency F C f mapping function is used to map the value of the pitch autocorrelation time lag onto the value of the central frequency F C f. The adaptive zero-phase bandpass filter in step 2070 uses the generated set of filter coefficients to extract the pitch signal from the low-pass filtered signal. Fig. 5 shows an example waveform of a voiced speech signal and the pitch waveform extracted from the voiced speech signal using the adaptive zero-phase bandpass filter. The next step is the pitch marking procedure - (step 1 1 10, Fig. 1). First, for voiced speech segments, pitch marking of the extracted pitch signal is performed in the pitch marking module in step 1 1 1 1 , using half period detection and denoting the pitch marks of the pitch waveform at zero-crossing points of the pitch waveform as the start and end sample of the pitch period within which the positive and negative peaks appear. The pitch period denoted by the pitch marks defines the pitch mark region. Mapping of the pitch period pitch marks onto the pitch marks of the speech signal is performed next (step 1 121). Three pitch marking approaches are possible here:

a) pitch marks are placed at negative peaks of the speech period (Fig. 12), b) pitch marks are placed at positive peaks of the speech period (Fig. 13),

c) pitch marks are denoted as the start and end sample of the speech period within which the positive and negative peaks appear (Fig. 14).

In Figures 12, 13, and 14, the arrows indicate the mapping of the pitch marks of the pitch waveform onto the pitch marks of the speech signal. In marking peaks as pitch marks, speech polarity detection in step 2100 (Fig. 1 ) defines whether positive or negative peaks should be marked.

If the current segment is unvoiced, the pitch marking algorithm for unvoiced segments defines pitch marks using a rule-based approach in step 1 1 12. The rules applied may define the pitch marks of unvoiced speech segments to be placed at constant predefined time intervals of predefined lengths, or the average value of the pitch mark lengths of voiced segments, or based on otherwise defined statistical characteristics of voiced speech pitch lengths. If in 1025 the last segment of the input speech signal was processed, the method ends, otherwise the method continues with pitch marking in step 1 123 for the next segment of the input speech signal. Fig. 5 illustrates an example waveform of the voiced speech signal and the pitch waveform extracted from the voiced speech signal by the adaptive zero-phase bandpass filter.

Fig. 6 illustrates an example of a waveform of the voiced speech signal.

Fig. 7 illustrates an example of a biased autocorrelation sequence of the voiced speech signal with a first peak value P1 greater than threshold TRA1 and a second peak value P2 greater than threshold TRA2, where the autocorrelation is calculated for a voiced segment of length WLT of the speech signal illustrated in Fig. 6.

Fig. 8 illustrates an example of a transition from voiced to unvoiced speech signal.

Fig. 9 illustrates an example of a biased autocorrelation sequence, with a first peak value P1 greater than thresholds TRA1 and TRA3, and a second peak value P2 less than threshold TRA2, calculated for the example of the transition from a voiced to an unvoiced speech segment of length WLT shown in Fig. 8.

Fig. 10 illustrates an example of a transition from a voiced to an unvoiced speech signal - a major part of the signal is unvoiced.

Fig. 11 illustrates an example of a biased autocorrelation sequence with no peaks greater than thresholds TR1 and TR2, calculated for the example of the transition from a voiced to an unvoiced speech segment of length WLT shown in Fig. 10.

Fig. 12 illustrates an example of the waveform of a speech signal with depicted pitch marks (vertical dashed lines), where the pitch marks are denoted as negative peaks. The arrows indicate the mapping of pitch waveform pitch marks set at zero crossing points of the pitch period waveform onto the points of negative peaks of the speech signal. Fig. 3 illustrates an example of the waveform of a speech signal with depicted pitch marks (vertical dashed lines), where the pitch marks are denoted as positive peaks. The arrows indicate the mapping of pitch waveform pitch marks set at zero crossing points of the pitch period waveform onto the points of positive peaks of the speech signal.

Fig. 14 illustrates an example of the waveform of a speech signal with depicted pitch marks (vertical dashed lines), where the pitch marks are denoted as the start and end sample of the speech period within which the positive and negative peaks appear. The arrows indicate the mapping of pitch waveform pitch marks set at the zero crossing points of the pitch period waveform onto the zero crossing points of the speech signal.

The apparatus for pitch marking (Figure 15) carried out on the basis of the method according to the present invention consists of an offset removal unit 1010a, a zero- phase low-pass filter 1020a, an autocorrelation calculation and peak search unit 1030a, a short-time absolute amplitude average calculation unit 1045a, a voiced/unvoiced speech detection unit 1050a, a voiced speech buffer 1060a, a bandpass filter coefficient generation unit 2080a, an adaptive zero-phase bandpass (ZPBP) filter 2070a, a pitch marking unit 1 1 10a with voiced speech marking 1 1 11a and unvoiced speech marking 1 1 12a, and a speech polarity detection unit 2100a.

The final result of the proposed algorithm is a set of marks defined in step 1 120 (Fig. 1 ). A set of pitch marks for voiced/unvoiced speech is defined in step 1121 and a set of voice/unvoiced segment marks is defined in step 122.

In order to reduce the computational complexity of the algorithm, in the second embodiment, filter coefficient generation in step 2080 is replaced by a bank of bandpass zero-phase predefined filter coefficients (step 2082, Fig. 2) and bandpass filter coefficient selection (step 2081 ). Data about voiced/unvoiced speech detection from step 1050 and pitch autocorrelation time lags from step 1032 are used as criteria for selection of the corresponding filter coefficients from a bank of predefined bandpass filters in step 2081. The main purpose of the present invention is to provide a means for robust and accurate calculation of speech pitch marks with high time resolution.

As the pitch of the speech signal changes over time, a time varying filter should be applied to extract the changing pitch waveform. Calculating the coefficients of the adaptive filter can require a time-consuming complex iterative algorithm. The goal of the filtering used in this invention is to extract the pitch waveform from speech using the appropriately defined coefficients of an adaptive zero-phase bandpass filter with the central frequency F C f.

The following advantages are achieved with solutions in line with the invention. The proposed method exhibits high noise robustness and time resolution, with no windowing effect and averaging in pitch marking and voiced/unvoiced segment marking of the speech signal.

The proposed method and apparatus according to the present invention can be used in the areas of pitch-synchronous speech enhancement, clinical diagnosis, speech coding, automatic phonetic segmentation, pitch synchronous speech analysis and processing, speaker characterisation, voice conversion, speech recognition and speech synthesis.