Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
VOICE PACKET SUBSTITUTION SYSTEM AND METHOD
Document Type and Number:
WIPO Patent Application WO/1992/015987
Kind Code:
A1
Abstract:
A pitch estimate (212) transmitted with voice packets (210) is used at the destination device (10) to determine which stored voice data to substitute for missing or corrupted voice packets. An error check value (214) based on the most significant bits of the transmitted digital voice representation allows the destination device (10) to determine if the received voice representation meets a minimum quality standard. If not, a substitute representation is made based on a received pitch estimate (212).

Inventors:
BERKEN JAMES J (US)
TAYLOR MARK (US)
WANG HUIYU (US)
ODLYZKO PAUL (US)
Application Number:
PCT/US1992/001214
Publication Date:
September 17, 1992
Filing Date:
February 13, 1992
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MOTOROLA INC (US)
International Classes:
H04B1/66; (IPC1-7): G10L9/00
Foreign References:
US5073940A1991-12-17
US4907277A1990-03-06
Other References:
IEEE TRANSACTIONS ON ACONSTIES, SPEECH AND SIGNAL PROCESSING, Volume ASSP - 34, Number 6, December 1986, GOODMAN et al., column 1 to column 2, line 14 on page 1440 and Section II on page 441.
Download PDF:
Claims:
Claims
1. A device (10) for transmitting voice packets comprising: (a) means (102) for monitoring consecutive voice samples at a transmitting device; (b) means (104) for calculating at the transmitting device pitch estimates based on said monitored voice samples; and (c) means (22) for transmitting voice packets including digital values corresponding to voice samples and the most recent calculated pitch . estimate.
2. The device of claim 1 further comprising means (108) for calculating an error check value based at least on the voice samples of each voice packet and transmitting said error check value with each transmitted voice packet.
3. The device of claim 2 wherein said error check value calculating means (108) calculates an error check value based on a predetermined number of the most significant digits representing said voice samples, said predetermined number of digits being less than all digits representing said voice samples.
4. A device (10) for receiving voice packets and reproducing a voice signal comprising : (a) means (120, 124) for receiving voice packets each containing a pitch estimate and a digital representation of a predetermined number of voice samples; (b) means (136) for storing at least the most recent of said pitch estimates and a predetermined number of digital representations in memory (142); (c) means (130, 134) for determining if a digital representation in a voice packet meets a predetermined reception accuracy standard; (d) means (140, 158) for identifying a set of the stored digital representations based on said stored pitch estimate that are likely to have sound qualities similar to digital representations that did not meet said standard; and (e) means (152) for substituting said set of digital representations for the digital representations that did not meet said standard in order to improve the quality of the reproduced voice.
5. The device of claim 4 further comprising means (152) for storing said set of digital representations in memory (142) in place of said digital representations that did not meet the predetermined quality standard and deleting the digital representations corresponding to the oldest packet from memory.
6. The device of claim 4 wherein said standard determination means (130, 134) includes means (130) for receiving an error check value (from 128) based on digital representations for each packet and determining if the received digital representations (from 122) correspond exactly to the received check value (from 128).
7. A method for receiving voice packets and reproducing a voice signal comprising the steps of: (a) receiving voice packets (at steps 6064)each containing a digital representation of a predetermined number of voice samples and an error check value based on a predetermined number of the most significant bits of said digital representation where said number is less than all bits of said digital representation ; (b) storing (at step 66) a predetermined number of digital representations in memory; (c) determining (at step 64) if said digital representations in each voice packet meets a predetermined reception accuracy standard using said check value; (d) upon receiving a digital representation in a voice packet that does not meet said predetermined reception accuracy standard, using a pitch estimate to identify (at step 78) a set of the stored digital representations that are likely to have sound qualities similar to the digital representation that did not meet said standard; and (e) substituting (at step 80) said set of digital representations for the digital representations that did not meet said standard in order to improve the quality of the reproduced voice (at step 82), whereby only an error in the most significant bits of the the digital representation will cause a substitution to be made.
8. A device (10) for receiving voice packets and reproducing a voice signal comprising: (a) means (120, 124) for receiving voice packets each containing a digital representation of a predetermined number of voice samples and an error check value based on a predetermined number of the most significant bits of said digital representation where said number is less than all bits of said digital representation; (b) means (144) for storing a predetermined number of digital representations in memory (142); (c) means (130, 134) for determining if said digital representations in each voice packet meets a predetermined reception accuracy standard using said check value; (d) means (140, 158) for identifying a set of the stored digital representations that are likely to have sound qualities similar to the digital representation that did not meet said standard based on a pitch estimate; and (e) means (152) for substituting said set of digital representations for the digital representations that did not meet said standard in order to improve the quality of the reproduced voice, whereby only an error in the most significant bits of the the digital representation will cause a substitution to be made.
9. The device of claim 8 further comprising means (152) for storing said set of digital representations in memory (142) in place of said digital representations that did not meet said standard and deleting the digital representations corresponding to the oldest packet from memory.
10. The device of claim 8 further comprising means (136) for receiving said pitch estimate which is transmitted as part of each packet.
Description:
VOICE PACKET SUSTITUTION SYSTEM AND METHOD

Field of the Invention

This invention pertains to a voice technique for substituting missing or substantially corrupted voice segments. The present invention is especially suited, but not limited to, wireless packet systems in which burst type packet errors occur and retransmission of voice packets cannot be utilized.

Background of the Invention

In a wireless fast voice/data packet environment the need to consider a missing or corrupted voice segment stems from the desire for high quality voice reception in a RF environment where interference will be likely encountered. Missing or incorrectly received data packets can be retransmitted in many data systems, however, this cannot be done for voice packets without introducing unacceptable delays into the system. An occasional voice packet error will normally have only a limited impact on sound quality. However, as the rate of missing or corrupted voice packets rises, noise in the form of audible clicks and static increase.

Pitch estimation is a technique in which the pitch of the voiced speech is estimated. In a conventional packet network which carries voice, a destination device may utilize received voice packets to make pitch estimates. Such estimates can be utilized by the destination device to create a voice segment to be substituted for a missing voice packet. Such a substitution normally results in better perceived audio quality by a listener as compared to utilizing silence in place of a missing voice segment. In these systems the destination device learns the position of the missing packets by

time stamps and/or sequence numbers in the headers of correctly received packets. If the destination device calculates a "poor" pitch estimate based on a corrupted signal, the performance of the substitution system will be degraded thereby degrading the voice quality. In D.J. Goodman, et al. article, "Waveform Substitution Techniques for

Recovering Missing Speech Segments in Packet Voice Communications' * , IEEE Transactions on Acoustics, Speech and Signal Processing, Vol ASSP- 34, No. 6, December 1986, a technique is discussed for using pitch estimates calculated at the receiver to replace missing speech segments with waveform segments from correctly received packets. It is also known in the prior art to distinguish between voiced and unvoiced speech, and to generate a CRC.

There exists a need for a system and a method that provides improved voice quality when missing voice packets occur and when corrupted voice packets are encountered.

Brief Description of the Drawings

FIG. 1 is block diagram of a voice packet origination and destination node in accordance with the present invention.

FIG. 2a illustrates a voice packet format utilized in a TDMA time slot.

FIG.2b illustrates a TDMA frame format consisting of a plurality of time slots.

FIG. 3 illustrates a voice waveform in which a pitch estimate has been made.

FIG. 4 is a flow diagram illustrating steps in accordance with the present invention for transmitting a voice packet which includes a pitch estimate.

FIG. 5 is a flow diagram illustrating steps in accordance with the present invention for receiving voice packets and making substitutions based on previously received pitch estimates.

FIG. 6 is a block diagram of an embodiment for generating a pitch estimate and incorporation of same into a packet at an origination node.

FIG. 7 is a block diagram of an embodiment for receiving a packet in accord with the present invention.

Detailed Description of Invention

A transmitted pitch estimate in accordance with the present invention provides improved voice quality by allowing a better selection of stored voice samples for substitution in place of a corrupted or missed voice packet. This packet substitution technique is best suited to address voice packet losses between 0.5 and 10 percent. Below about a 0.5 percent error rate, voice quality is impaired but not normally to a level that many people would find objectionable. For error rates at or above ten percent, the unusable packets are so frequent that substitution techniques have difficulty in maintaining an acceptable voice quality level. The use of a pitch estimate of previously received packets to determine suitable prior voice samples for substitution relies on the pitch of consecutive pitch periods (or cycles) being stationary relative to the packet transmission rate. Such substitution produces a sound quality better than the use of silent gaps in place of unusable or lost voice packets. In FIG. 1 a communication controller 10 includes a transceiver device

22, and a microprocessor 24 with associated read only memory 26 and random access memory 28. The transceiver 22 includes one or more antennas 12 for RF communication. The communication controller is connected by wire 14 to an analog to digital and digital to analog converter 20 which is connected by wire 16 to a telephone 18 that receives and transmits voice data.

The communication controller 10 functions to both transmit and receive voice packets within a packet network. In originating or transmitting voice information, the controller receives analog signals from the telephone 18 and converts these signals to a digital representation by A/D converter 20. These digital signals are processed by the MPU 24 under the control of an operational program stored in ROM 26 and RAM 28 into a packet format which includes digital representations of the voice. These digital

representations are part of a packet which is transmitted by the RF transmitter portion of the transceiver 22 as an RF signal over antenna 12 to another node in the packet network.

In the receiving mode, a packet signal is received by antenna 12 and demodulated by the RF receiver portion of the transceiver 22. The received digital packet information is processed by the MPU 24 in accordance with ROM 26 and RAM 28 to reconstruct the voice information. The recovered digital representations are reconverted by the digital to analog converter 20 which provides the analog voice information to telephone 18 which is utilized by a user to listen to the transmission. It will be apparent to those skilled in the art that the packet transmission rate will limit the number of different voice channels in a real time voice transmission system.

In FIG.2a each voice packet 210 is segmented to contain a packet header 211, pitch estimate 212, voice data information 213, and an error check value or CRC 214. A packet header 211 includes relevant packet information such as the packet length, address, and CRC on the header. The packet header is used by the communication controller to assist in routing voice packet information to the correct destination, device. The pitch estimate 212 requires a number of voice samples to have occurred before a pitch estimate can be calculated. The pitch estimate may be arbitrarily set to a predetermined value, such as zero, at the transmitter when not enough samples have occurred to allow a pitch estimate to be calculated. After the pitch is estimated, it is inserted into the voice packet 210 and an error checking value such as a CRC 214 is generated and appended to the voice packet 210.

A CRC 214 is used to protect bits in each of the digital representations in the voice packet 210 and pitch estimate 212. A significant discovery resulted for this invention: test listeners found the audio quality of recovered speech more acceptable when there were errors in the least significant bits of the digital speech representation as compared to errors in the most significant bits of the digital representations. This indicated that the most significant bits have a larger impact on voice quality than the least significant bits. The voice quality experiment also indicated that with all bits protected

the voice quality declined resulting in the listener hearing M chopped"or "slurred" voice due to the more frequent packet substitutions dictated by CRC failure. This phenomenon results because any single bit error will cause a packet to be substituted at the destination device when all bits are protected. It must be remembered that substituted voice data only represents a best guess at what the original voice data would have been. Thus, a CRC 214 which protects some but not all of the possible bits comprises an implementation of a predetermined reception accuracy standard The advantage of this discovery relies on the fact that only a portion of the most significant bits need to be protected to ensure that the system is error sensitive enough to detect the corrupted voice signals with the largest impact. This will minimize the number of packet substitutions activated by a CRC failure. This will result in maintaining high voice quality since a substituted voice signal will be used only when needed. For example, where 8 bits is used for one voice sample, protecting the 4 most significant bits by a CRC represents a choice that results in improved voice quality as compared to protecting all 8 bits.

FIG. 2a illustrates a voice packet 210 which is transmitted during time slot number two in a TDMA system consisting of a plurality of time slots within each frame 200 as illustrated in FIG. 2b.

An originating node in accordance with the present invention generates pitch estimates 212 based on prior voice packets. The pitch estimate 212 is transmitted with each voice packet 210 along with a voice data field 213 which includes digital representations of a number of voice samples. Thus, a continuous voice analog voltage waveform such as shown in FIG. 3 is divided into a plurality of consecutive intervals as designated by the marks along the time axis in FIG. 3. These intervals each correspond to different packets appearing in a selected time slot on consecutive frames 200. As shown in FIG. 2b a plurality of different voice signals can be simultaneously transmitted during a TDMA frame 200, i.e. each time slot functions as a separate channel.

It is an important aspect of the present invention that the pitch estimate 212 is calculated at the origination node. This is advantageous since the pitch estimate is made based upon consecutive uncorrupted voice samples which results in a more accurately determined pitch estimate than could be made if calculated based upon received information which may contain inaccuracies.

FIG. 3 illustrates a voice waveform divided into consecutive intervals as indicated by the marks along the time axis. In the illustrative example each interval consists of 16 samples. One known method of determining pitch is to detect the positive and negative peaks of the speech signal. Peak detectors that use center clipping with threshold can provide a voiced/unvoiced classification. In the event of unvoiced speech the origination node forces the pitch estimate to equal 1 interval which causes the receiver to use the previouisy received packet in place of the corrupted or missed packet. The pitch estimate 212 is calculated by measuring the elapsed time in units of the number of voice samples between consecutive significant positive and consecutive negative peaks of the waveform. The speech pitch is not related to the packet transmission rate, hence a sufficient number of packet voice samples must be stored in memory in order to accommodate the normal pitch period found in human speech.

In the present invention, when a packet 210 is determined to be missing or not in compliance with a predetermined quality standard because of corruption, the pitch estimate 212 of the speech received during the preceding packet is used by the destination device to identify the stored digital representations of voice in memory that are to be substituted for the missed (or corrupted) voice packet. For example, if as shown in FIG.3 a voice packet is missed and the last pitch estimate received was 52, i.e. 52 voice samples, then this pitch is used to identify the location relative to the current packet of the voice samples stored in memory to be used in place of the missing voice packet. In this example the receiver would replace the missing packet of 16 samples with the 16 voice samples received 52 samples earlier

FIG. 4 is a flow diagram illustrating an exemplary method for the transmission of voice data such as by the communication controller 10. Beginning with entry at START 40 and initialization of variable N (the number of packets transmitted) to zero, the variable N is incremented by 1 in step 42. In step 44 the originating device determines if N > X, i.e. if a sufficient number X of voice packets have been evaluated in memory to calculate a pitch estimate. If step 44 is NO, the originating device assigns a pitch estimate equal to the interval size in step 58 and control passes to step 48. Upon a YES determination by step 44, step 45 determines if a new pitch interval has been detected. A NO determination by step 45 results in the pitch last calculated by step 46 being used in accord with step 47 and control passing to step 48. A YES determination by step 45 causes a pitch estimate to be calculated in step 46 on the uncorrupted voice signal at the transmitter. A CRC is calculated for a predetermined number of the most significant voice samples in each packet and on the pitch estimate in step 48. In step 50 the voice packet such as shown in FIG. 2A is transmitted. In the determination step 52, a decision is made if voice processing is to continue, i.e. is more voiced data to be transmitted? A YES decision returns control to step 42 for processing another voice packet. A NO decision by step 52 terminates this method by RETURN 54.

It will be apparent to those skilled in the art that the exemplary method in Figs. 4 and 5 will most advantageously be incorporated as part of a software operating system utilized for control of a device such as a communications controller. FIG. 5 is a flow diagram illustrating an exemplary method for the reception of voice data transmitted in accord with the method of FIG.4. Beginning at START 60, variable M which represents the number of received packets is set to zero. In step 62 the receiver attempts to receive a voice packet for a particular voice channel or time slot. In determination step 64 a decision is made if a voice packet is incorrectly received, i.e. not received at all or received with an error in its voice data or pitch estimate as determined by a locally generated CRC being unequal with the received CRC. Upon a NO determination (correctly received packet), memory at the

voice receiver is updated with the pitch estimate and voice samples of the received voice packet in step 66. Up dating of the memory includes storing the new pitch value, storing the new voice samples, and deleting the oldest of a predetermined number of stored voice samples. Variable M is incremented by one. In step 68 the corresponding output voice signal is generated using the received packet voice samples. In step 70 a determination is made as to whether to continue voice processing, i.e. more voiced packets to be received? A YES decision transfers control back to step 62 to process another packet. A NO decision by step 70 will terminate voice processing at RETURN 72.

A YES decision by step 64 could result from receipt of a corrupted voice packet as determined by the CRC or a missing voice packet. In step 78 the stored pitch estimate is used to identify the location of the stored voice samples (SVS) that will be used for substitution. In step 80 the voice sample memory is updated with the SVS and M is incremented by one. Updating the voice sample memory with replacement stored voice samples enhances the systems performance by providing reasonably good quality voice samples for future possible substitutions. In step 82 the voice output signal is generated using the SVS instead of the currently received voice samples. Control is then passed to step 70 which proceeds as explained above. FIG. 6 is a block diagram of an illustrative embodiment of the generation of a pitch estimate and CRC in accordance with the present invention at an origination node. An input terminal 100 receives serial PCM information representative of sequential voice samples (see FIG. 3) from an originating codec. This PCM information is converted into parallel form by serial to parallel convenor 102. A pitch calculator 104 receives the parallel information and implements a pitch calculation method. Pitch calculation methods and apparatus for implementing the methods are generally known in the art. The output from the pitch calculator 104 is a pitch estimate which is transmitted with each voice packet. During voiced speech the value of the pitch estimate typically remains unchanged over a number of voice packets since the packet transmission rate is substantially faster than pitch changes. The pitch estimate is converted to serial form by parallel to serial convenor

106. OR gate 107 couples either the input serial PCM or the serial pitched estimate to CRC generator 108. The CRC generator calculates a check value which is coupled to packet assembler 110. The assembler also receives the input PCM information in parallel form and the pitch estimate. It provides a control output for elements 106 and 108 to enable each. The parallel to serial convenor 106 is enabled at the correct packet time slot so that the pitch estimate is included in the CRC check value. The packet assembler consists of conventional control circuitry as utilized in packet switches known in the art. The function of the packet assembler is to organize information in the correct time relationship to form a packet to be transmitted. In the exemplary embodiment, the information is organized in accordance with FIG. 2A. The packet header may contain a variety of relevant information tailored to a specific packet system and hence is not specifically addressed by this invention. Following the packet header field, the pitch estimate for the current packet is inserted followed by sixteen bytes of 8 bits corresponding to one voice time slot. In a preferred embodiment the CRC is generated to protect all of the information in the pitch estimate field and a predetermined number of the most significant bits in each of the sixteen voice bytes. In the illustrative example, providing CRC protection for three of the 8 bits in each voice byte was found to produce better voice quality than using only one bit or using all 8 bits.

FIG. 7 is a block diagram of an embodiment for implementing the voice substitution method based on received pitch and CRC in accordance with the present invention. In the illustrative embodiment, packet bytes of 8 bits and packet clock information are received by the parallel to serial convenor 120 which converts the parallel bytes into serial form and provides same to the CRC generator 122 which calculates a CRC based upon the received pitch estimate, voice data, and transmitted CRC fields. A packet disassembler 124 also receives the packet bytes and clock, and generates control signals as will be described. In general, packet disassemblers associated with packet communications systems are known and consist of conventional control circuitry to provide control and timing information to enable a received packet to be disassembled into its constituent parts. A

disassembler signal on line 126 supplies control information to the CRC generator 122 which controls the receipt of bits. This permits only the predetermined number of most significant bits to be input to the generator while permitting all of the pitched estimate and transmitted CRC information to be received by the CRC generator 122. The value "zero" in reference 128 is provided as one input to comparator 130. Its other input consists of the output value as determined by CRC generator 122. The comparator upon a command on line 132 compares the CRC generator output with the zero reference. If the comparison is true, i.e. if the CRC generator output is zero, the substitute control line 134 is not enabled. If the comparison is not true, i.e. if CRC generator output is other than zero, substitute control line 134 is enabled and thereby initiates the voice packet substitution in accordance with the present invention.

A temporary pitch memory 136 stores the pitch estimate byte when it is present on the packet byte line as determined by disassembler control line 138. Thus, memory 136 stores the pitch estimate for each packet. The last correct pitch memory 140 stores the last correctly received pitch estimate. If the pitch estimate received in the current packet is correctly received as determined by substitute control line 134, the value in memory 136 is transferred to memory 140. If the current pitch estimate is not correctly received, it is not transferred to memory 140.

The voice data field in a number of consecutive voice packets is stored in dual port RAM 142 and is utilized in accordance with the present invention to provide substitute voice data when the voice data in a current packet is not correctly received as determined by control line 134.

Transmission gate 144 is normally enabled and permits packet bytes to be received by the data input of RAM 142. This gate is inhibited upon determination of an incorrectly received CRC by control line 134. Each voice byte is stored in the RAM at an address location determined by the RAM's input pointer. Control of this pointer is described as follows.

Transmission gate 146 couples either the packet clock or the codec clock to counter 148 which counts to the maximum number of voice bytes to be stored in memory of RAM 142. A subtractor 150 receives the output counter

value and is able to subtract a predetermined number corresponding to the number of voice bytes contained in a voice packet, i.e. sixteen in the illustrative example. Except when enabled by generation of a substitute command on control line 134, the subtractor merely passes the counter output to the pointer input of RAM 142. Thus, consecutive voice bytes are stored in RAM memory with the oldest byte being overwritten as new bytes are received. Voice data in an incorrectly received packet, i.e. one having a CRC error, will be initially input in RAM 142. Upon the CRC error detection which initiates a substitute control signal on line 134, transmission gate 146 switches to provide counter 148 with the codec clock and causes the subtractor 150 to be enabled thereby subtracting 16 from the counter value. This effectively repositions the input pointer of RAM 142 to the first voice byte in the incorrectly received packet. Transmission gate 152 is enabled and couples the data out from RAM 142 back to the data input. As will be described below the output pointer is set to a substitute memory location as determined by the last correctly received pitch stored in memory 140. Sixteen voice bytes previously stored in memory will be output at data out of RAM 142 and also input via gate 152 so as to overwrite the voice bytes in the currently received packet which had a CRC error. The output data is converted from parallel to serial form by convenor 154 which provides PCM output to the codec, which in turn translates the PCM into analog voice.

The output pointer of RAM 142 identifies the memory location address in the RAM which holds data to be output at the data out port. A counter 156 is incremented by the codec clock and contains the same predetermined number as stored in counter 148, i.e. the maximum number of voice bytes stored in RAM 142. The output value of counter 156 is coupled without being altered by subtractor 158 to the output pointer input of RAM 142 except upon receipt of a erroneous packet as determined by control line 134. It should be noted that although counters 148 and 156 contain the same predetermined number, the output value of counter 148 leads the output value of counter 156 so that the voice bytes stored in a memory location in RAM 142 will have been written previous to the attempt by counter 156 to access the same memory location. Upon the output of

counter 156 reaching a voice byte of an incorrectly received packet, subtractor 158 will subtract from the output value of counter 156 a number corresponding to the pitch substitution value stored in memory 140. This effectively reindexes the output pointer backwards to the voice samples to be substituted for the erroneous packet in accordance with the pitch estimate. The subtractor 158 continues to subtract as counter 156 indexes through the next sixteen counts which corresponds to the length of voice bytes in the incorrectly received packet. Then, subtractor 158 ceases to subtract the pitch value and passes the actual output value of counter 156 to the output pointer input of RAM 142. This advances the output pointer to the voice byte location it would have been if the current packet had been correctly received. As the substitute voice bytes are output, gate 152 couples them to the data input and overwrites the incorrectly received voice bytes in RAM 142 memory. It will be apparent to those skilled in the art that the packet substitution technique in accordance with the present invention can be practiced by a hardware implementation such as described in the embodiment shown in FIG. 6 and 7 or may be implemented in a microprocessor or preferably a digital signalling processor. Further, the particular packet environment and the modulation coding and decoding technique will impact the selection of the number of most significant bits which will yield optimal performance.

In the illustrative embodiment of the present invention, the calculation of the pitch estimate at the originating device results in a better pitch estimate of the original waveform than if generated based on received data. This is advantageous since a voice packet substitution based on a more accurate pitch estimate will yield better voice quality. Providing protection of only a predetermined number of the most significant bits of the transmitted voice samples with a check value further enhances the quality achieved by the substitution method. By preventing an error in the least significant bits from causing the use of a substituted voice sample set, voice quality is improved since a substitution voice sample set in this situation often results in worse voice quality than if the received voice samples are used. Although voice samples such as generated using a pulse code modulation method

are referenced, it will be apparent that other forms of modulation or coding could be used in accord with this invention.

Although an embodiment of the present invention has been shown and described, the scope of the invention is defined by the claims which follow.