INTELLIGENT VOICE CONVERTER

Title:

INTELLIGENT VOICE CONVERTER

Document Type and Number:

WIPO Patent Application WO/2002/023837

Kind Code:

A1

Abstract:

A voice converter (110) includes upstream and downstream resource managers for allocation of half-duplex resources of a full duplex DSP transcoder. The resource managers schedule available upstream or downstream resources and provide transcoded voice data to requesting applications.

Inventors:

YARLAGADDA MADHU

Application Number:

PCT/US2001/028367

Publication Date:

March 21, 2002

Filing Date:

September 10, 2001

Export Citation:

Click for automatic bibliography generation Help

Assignee:

YAHOO INC (US)

International Classes:

H04M7/00; H04Q11/04; (IPC1-7): H04L12/66; G06F3/00

Foreign References:

US6269095B1	2001-07-31
US5497373A	1996-03-05

Attorney, Agent or Firm:

Albert, Philip H. (Two Embarcadero Center 8th Floo, San Francisco CA, US)

Download PDF:

View/Download PDF PDF Help

Claims:

WHAT IS CLAIMED IS :

1.

A voice transcoding method for scheduling resources in a system having a plurality of full duplex DSP resources, each fullduplex DSP resource having a half duplex upstream resource for transcoding voice data from a VOIP format to a PSTN format and a halfduplex downstream resource for transcoding voice data from a PSTN format to a VOIP format, said method comprising the acts of : maintaining an upstream resource availability table indicating which upstream resources are available to be scheduled ; receiving a request from a requesting application for upstream resource transcoding service; utilizing the upstream resource availability table to identify an available upstream resource ; scheduling the available upstream resource to provide requested transcoding service; modifying the upstream resource availability table to indicate that the available upstream resource is a scheduled upstream resource and thus unavailable for scheduling; routing voice data in PSTN format to a scheduled upstream resource for transcoding; routing transcoded voice data in VOIP format from the scheduled upstream resource to the requesting application ; and subsequent to completion of requested transcoding service, modifying the upstream resource availability table to indicate that the scheduled upstream resource is now an available upstream resource.

2.

The method of claim 1 further comprising the acts of : maintaining an downstream resource availability table indicating which downstream resources are available to be scheduled ; receiving a request from a requesting application for downstream resource transcoding service; utilizing the downstream resource availability table to identify an available downstream resource ; scheduling the available downstream resource to provide requested transcoding service; modifying the downstream resource availability table to indicate that the available downstream resource is a scheduled downstream resource and thus unavailable for scheduling ; routing voice data in PSTN format to a scheduled downstream resource for transcoding; routing transcoded voice data in VOIP format from the scheduled downstream resource to the requesting application; and subsequent to completion of requested transcoding service, modifying the downstream resource availability table to indicate that the scheduled downstream resource is now an downstream upstream resource.

3.	The method of claim 1 further comprising the acts of : buffering upstream data prior to routing to the upstream resource to compensate for network latency.

4.	The method of claim 1 further comprising the acts of : receiving requests from the requesting application via a TCP/IP port; and routing voice data via a UDP port.

5.

A method for scheduling a single fullduplex DSP resource, including upstream and downstream halfduplex resources, comprising the steps of : scheduling a first user, requiring use of the upstream halfduplex user, to have access to halfduplex upstream resource; and scheduling a second user, requiring use of the downstream halfduplex resource, to concurrently access the downstream halfduplex resource so that independent data from the first and second users are concurrently processed by single, fullduplex DSP resource.

Description:

INTELLIGENT VOICE CONVERTER CROSS-REFERENCES TO RELATED APPLICATIONS This application is related to and claims the benefit of co-pending applications No. 09/658,771, entitled"Voice Integrated System" (Atty. Docket No. 17887-006000US) ; No. 09/658,781, entitled"Intelligent Voice Bridging" (Atty. Docket No. 17887-007200US) ; and No. 09/659, 233, entitled"Message Store Architecture" (Atty. Docket No. 17887- 007400US), all filed September 11,2000, the disclosures of which are incorporated herein by reference.

BACKGROUND OF THE INVENTION The present invention relates to the field of transcoding of voice data. In general, transcoding is the process of converting one format of the voice data into another.

Internet telephony, known as voice over IP (VOIP) is becoming a realistic, cost effective alternative to the traditional public switched telephone networks (PSTNs).

In general, most VOIP applications use a voice encoding format that is different from the voice encoding format used by PSTN networks. Because of the different voice formats used, many of the functionalities that exist for PSTN are not available to VOIP applications unless the functionality is built directly into the VOIP application.

Voice converters and transcoders (VCs) that convert voice data from one format to another are known and may be used to convert data supplied by a VOIP application to PSTN format to allow the VOIP application to utilize PSTN functionalities such as automatic speech recognition (ASR) and text to speech conversion (TTS). The VC may also convert the output of a PSTN functionality to VOIP format. Existing VCs provide such a service by using dedicated DSP resources. A dedicated DSP resource is an entity that is allocated to the voice channel at the very beginning of a process and remains allocated as long as the channel is in use. The DSP resources used to perform transcode operation are full duplex. Both PSTN as well as VOIP networks are also full duplex in nature. Hence to handle a full duplex network a fall duplex DSP resource is created, dedicated, and used.

Although almost all of the DSP resources are full duplex in nature, most human interaction is half-duplex in the nature and most of the applications operate based on this half-duplex interaction. For example, almost all of the users who use telecommunication applications such as voice mail and informational services applications do not talk and listen

at the same time. Accordingly, it is not necessary for an application to dedicate and allocate DSP resources for the entire duration of the application.

Based on the above observations, all of these existing applications use the DSP resource to less than 50% of the their capabilities. From this, it is evident that a improved voice converter that utilizes the DSP resources more effectively and efficiently is required. Furthermore, such a system should be capable of handling huge number of subscribers.

SUMMARY OF THE INVENTION By virtue of this invention it is now possible to economically mix and match various functional components from VOIP and PSTN networks.

According to one aspect of the present invention, a method and a mechanism allows transcoding and scheduling of two independent voice data streams from two distinct and different subscribers on to the same full duplex DSP resource.

In one embodiment, the voice converter waits for a request for conversion resource on TCP/IP. Based on the type of transcoding that was requested, the voice converter will allocate a half-duplex resource, perform a transcode operation, and send output data over a User Datagram Protocol (UDP) interface.

According to another aspect of the invention, look-ahead buffers are utilized to mask network latency and provide a continuous stream of data to the DSP resources.

According to another aspect of the invention, data is transferred in packets having session numbers. The session numbers are utilized to identify different data streams using a single DSP resource.

A further understanding of the nature and advantages of the invention herein may be realized by reference to the remaining portions of the specification and the attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a block diagram of a distributed client/server telecommunication system coupled to a managed VOIP network according to one embodiment; Fig. 2 is block diagram and flow chart of a preferred embodiment of the invention; and Fig. 3 is a block diagram of the packet format used by a preferred embodiment.

DESCRIPTION OF THE SPECIFIC EMBODIMENTS Figure 1 shows one embodiment of a distributed client server system 100, which is used to provide telecommunication application services to subscribers over a managed VOIP network 102. System 100 is disclosed in the commonly assigned, co-pending application entitled INTEGRATED VOIP SYSTEM (Attny. Docket No. 017887-006000US, filed September 11,2000), which is hereby incorporated by reference for all purposes. This diagram shows the use of a voice converter to integrate TTS and ASR subsystems.

VC (Voice Converter): VC 110 is a server that can convert one voice data format into another. For this particular embodiment, VC 110 converts pulse code modulation (PCM) voice data (PSTN type data) into G723.1 format (VOIP type data) and vice versa.

GAS (Gateway Access Server): GAS 104 is the server that runs the telecommunication applications. It has a functional component called voice bridging that enables external systems and features to be integrated into the data path of the application running on GAS 104.

TTS (Text To Speech Server): TTS server 106 is responsible for converting text into speech that can be played to the user. Some of the applications that use this feature implement listening to email and other text based content from the phone.

ASR (Automatic Speech Recognition): ASR server 108 is responsible for recognition of voice data sent to it and translating it to text that is sent back to the requester.

Existing TTS 106 and ASR 108 servers have been designed to provide functionality to PSTNs and therefore are designed to process voice data in PCM format.

Accordingly, a VOIP application requires a VC to utilize these resources.

Y ! Mail 112 (Yahoo Mail Servers): GAS 104 talks to yahoo mail servers 112 to enable subscribers to listen to their email using the phone.

The art of transcoding using standard DSP resources, such as chip sets manufactured by Texas Instruments, is well known and will not be addressed in detail here.

The intelligent voice converter of the present embodiment, enables VOIP applications to efficiently use external functional modules like TTS 106 and ASR 108. In the following, an upstream resource transcodes a VOIP data format, such as G723.1 or G729 etc., to standard PSTN format, such as Pulse Code Modulation (PCM), and a downstream resource transcodes standard PSTN formats to VOIP formats.

Fig. 2 is a block diagram of a preferred embodiment of a Voice Converter (VC) architecture and design 200. In Fig. 2, a TCP/IP and UDP/IP API (application program

interface) 22 couples a VOIP application to an UP Stream Resource Manager (USRM) 24 and to a Down Stream Resource Manager (DSRM) 26. USRM 24 and DSRM 26 are coupled to a block 28 of digital signal processor (DSP) resources. Each DSP resource 30 is full- duplex and includes an upstream half-duplex block 32 and a downstream half-duplex block 34.

TCP/IP and UDP/IP API 22 utilizes a TCP port to call the voice converter command port. The voice converter command port will wait on the TCP/IP socket for a request. When the command port receives the request, it will analyze the request and will identify whether the request is an upstream request of if it is a down stream request. At this point the voice converter command port will dispatch the request to either the upstream resource manager or to the down stream resource manager.

As stated above, each of the DSP resources 30 is divided into two parts : the upstream resource 32 and a down stream resource 34. Each of these resources is a half- duplex channel that together form a full duplex DSP resource. Each of the two half-duplex resources 32 and 34 are assigned to a respective resource manager 24 and 26, where each resource manager maintains a table indicating the availability of each half-duplex resource so that an available resource may be scheduled to satisfy application transcoding requests.

Upstream Resource Manager : Upstream resource manager 24 is responsible for keeping track of upstream DSP resource availability and scheduling. USRM 24 waits for application requests for up stream conversion and accepts a request if resources are available. USRM 24 then receives a first buffer of voice data from the application for conversion, keeps the first buffer of data in a look-ahead buffer, and then waits for one more voice data buffers for conversion. After receiving a minimum of two voice buffers filled with data USRM 24 will forward data to the upstream resource to start the conversion.

A look-ahead buffer is kept because, once the conversion is started on a DSP resource 30, buffers must continually be fed to DSP 30. As described above, DSP resource 30 is designed for use with a PSTN in one embodiment. In the PSTN, data is supplied at a continuous rate so that DSP resource 30 is designed to output data at a continuous rate. If no data is received, DSP resource 30 outputs silence frames. Thus, if, for any reason, the next buffer is not ready because of network latency, etc. DSP 30 will start generating silence frames as the output and DSP resource 30 will always output data at continuous bit rate.

Silence frames may cause the transcoded voice to be unintelligible to a listener. The look-

ahead cache buffer will mask any network latency in receiving data, always making sure that enough data is available to feed to DSP 30 and hence ensuring the quality of transcoded voice.

Down Stream Resource Manager : Downstream resource manager 26 is responsible for keeping track of the down stream DSP resource availability and scheduling. Very similar to upstream resource manager 24, down stream resource manager 26 will also wait for the application request for downstream conversion, accept the conversion request, and identify available downstream DSP resources, and schedule the downstream resources on to the DSP resource. Prior to scheduling, it also maintains a look ahead buffer to mask any network latency.

System Operation : The operation of the preferred embodiment depicted in Fig. 2 will now be described. Signaling commands from an application are sent to TCP/IP and UDP/IP interface 22 using TCP and all data is sent on UDP. The operation of USRM 24 will be described first.

TCP/IP and UDP/IP interface 22 forwards a request from an application to USRM 24 (arrow 2). USRM 24 checks its resource availability table to determine which upstream resources are available to service the request. USRM 24 then schedules an available upstream resource and provides data in VOIP format to the scheduled upstream resource (arrow 10). USRM 24 also connects (arrow 11) the scheduled upstream DSP resource output back to the UDP stream that sends transcoded data back to the requesting application.

For example, if the requesting application is utilizing ASR resource 108, then the received voice data in VOIP format is transcoded to PCM format by the upstream resource and sent back to the application for ASR processing.

The operation of DSRM 26 will now be described. TCP/IP and UDP/IP interface 22 forwards a request from an application to DSRM 26 (arrow 3). DSRM 26 checks its resource availability table to determine which downstream resources are available to service the request. DSRM 26 then schedules an available downstream resource and provides data in PCM format to the scheduled resource (arrows 7 and 8). DSRM 26 also connects the scheduled downstream DSP resource output back to the UDP stream that sends transcoded data back to the requesting application (arrow 19).

For example, if the requesting application is utilizing TTS services, then PCM data output by TTS 106 is transcoded to VOIP data and sent back to the requesting application.

Note in this example in Fig. 2, the upstream and downstream resources are scheduled as the two-half duplex blocks of a single DSP resource. Thus, twice the performance is gained from a DSP resource so that only a fraction of the DSP resources required by existing systems are required.

The voice converter uses TCP/IP as the command and control channel in one embodiment. Some commands supported in one embodiment include the following: 1. Upstream conversion start 2. Upstream conversion stop 3. Upstream conversion cancel 4. Downstream conversion start 5. Downstream conversion stop 6. Downstream conversion cancel The voice converter sends actual voice data using UDP. An embodiment of a voice packet format that is shown in Fig. 3. The voice packet format includes a packet sequence number 600, a packet session flag 602, and a packet voice data 604. The first part of the packet is sequence number 600, which is an incremental number from 0 to 255.

Sequence number 600 is used to re-sequence packets in case the packets arrive out of sequence at the destination, and also to identify packet loss.

The next section in the packet is session flag 602. Flag 602 enables the bridge to support multiple simultaneous sessions. If session flag 602 for a particular session of bridging is set to a specific number, then all the voice packets belonging to that session will have same value. This flag value is used to separate packets based on functionality into separate sessions. The packet also has a session flag that is used to keep track of the voice sessions. By using the session numbers associated with the segments of the voice the application and the voice converter will take care of packets that have higher latency. For example if a user is listening to email, the application will play the header with session number one"1". Then the application will continue to play the body of the email and use session number two"2"for it. The advantage of using the session ID is the fact that the

application can use multiple simultaneous streams for different sessions. In a given session all packets sent for that session will have same session number.

The third part of the packet is packet voice data 604, which includes the voice packets.

The invention has now been described with reference to the preferred embodiments. Alternatives and substitutions will now be apparent to persons of skill in the art. For example, different VOIP encoding schemes such as G. 726 or CELP encoding may be utilized. Also, commands and data may be exchanged using network protocols other than UDP or TCP/IP. Accordingly, it is not intended to limit the invention except as provided by the appended claims.

Previous Patent: EXECUTING ACTIONS IN AN INFORMATION SYSTEM TO PROVIDE AID

Next Patent: ACTIVE TERMINATING DEVICE WITH OPTIONAL LINE-RECEIVING AND LINE-DRIVING CAPABILITIES