Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUDIO PACKET LOSS CONCEALMENT
Document Type and Number:
WIPO Patent Application WO/2019/213021
Kind Code:
A1
Abstract:
A method (400) receiving an audio stream (130) across a packet switched network (120), the audio stream including a sequence of indexed audio packets (132). The method also includes determining that the received audio stream is missing an indexed audio packet (132m) from the sequence of indexed audio packets and predicting a predicted audio packet (132p) to replace the missing indexed audio packet based on received indexed audio packets in the sequence of indexed audio packets of the audio stream. The method also includes substituting the predicted audio packet for the missing indexed audio packet within the sequence of indexed audio packets of the audio stream, resulting in a reconstituted audio stream (232). The method also includes communicating the reconstituted audio stream for audible play.

Inventors:
BRUSE, Martin (1600 Amphitheatre Parkway, Mountain View, CA, 94043, US)
UBERTI, Justin (1600 Amphitheatre Parkway, Mountain View, CA, 94043, US)
BLUM, Niklas (1600 Amphitheatre Parkway, Mountain View, CA, 94043, US)
WALTERS, Thomas (1600 Amphitheatre Parkway, Mountain View, CA, 94043, US)
NAREST, Alex (1600 Amphitheatre Parkway, Mountain View, CA, 94043, US)
Application Number:
US2019/029804
Publication Date:
November 07, 2019
Filing Date:
April 30, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GOOGLE LLC (1600 Amphitheatre Parkway, Mountain View, CA, 94043, US)
International Classes:
G10L19/005; G10L25/30
Foreign References:
US5907822A1999-05-25
Other References:
None
Attorney, Agent or Firm:
KRUEGER, Brett, A. (Honigman LLP, 300 Ottawa Ave. NW Suite 40, Grand Rapids MI, 49503-2308, US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A method (400) comprising:

receiving, at data processing hardware (144), an audio stream (130) across a packet switched network (120), the audio stream (130) comprising a sequence of indexed audio packets (132);

determining, by the data processing hardware (144), that the received audio stream (130) is missing an indexed audio packet (l32m) from the sequence of indexed audio packets (132);

predicting, by the data processing hardware (144), a predicted audio packet (l32p) to replace the missing indexed audio packet (l32m) based on received indexed audio packets (132) in the sequence of indexed audio packets (132) of the audio stream (130);

substituting, by the data processing hardware (144), the predicted audio packet (l32p) for the missing indexed audio packet (l32m) within the sequence of indexed audio packets (132) of the audio stream (130), resulting in a reconstituted audio stream (232); and

communicating, from the data processing hardware (144), the reconstituted audio stream (232) for audible play. 2. The method (400) of claim 1, wherein predicting the predicted audio packet

(l32p) comprises:

determining a prosodic representation (226) corresponding to the received indexed audio packets (132);

identifying an anomaly in the prosodic representation (226) corresponding to the missing indexed audio packet (l32m);

predicting, using a machine learning model (300), at least one prosodic unit (228) corresponding to the anomaly in the prosodic representation (226); and

generating, using a speech synthesizer (224), an audio packet based on the predicted at least one prosodic unit (228) corresponding to the anomaly.

3. The method (400) of claim 2, wherein predicting the at least one prosodic unit (228) corresponding to the anomaly in the prosodic representation (226) comprises:

determining a first portion (226a) of the prosodic representation (226)

corresponding to the received indexed audio packets (132) before the anomaly in the prosodic representation (226) corresponding to the missing indexed audio packet (l32m); and

predicting, using the machine learning model (300), the at least one prosodic unit (228) corresponding to the anomaly in the prosodic representation (226) based on the first portion (226a) of the prosodic representation (226).

4. The method (400) of claim 3, wherein predicting the at least one prosodic unit (228) corresponding to the anomaly in the prosodic representation (226) comprises:

determining a second portion (226b) of the prosodic representation (226) corresponding to the received indexed audio packets (132) after the anomaly in the prosodic representation (226) corresponding to the missing indexed audio packet (l32m); and

predicting, using the machine learning model (300), the at least one prosodic unit (228) corresponding to the anomaly in the prosodic representation (226) based on the second portion (226b) of the prosodic representation (226).

5. The method (400) of any of claims 2-4, wherein the machine learning model (300) comprises a neural network (120).

6. The method (400) of any of claims 1-5, wherein predicting the predicted audio packet (l32p) comprises predicting, using a machine learning model (300), a plausible audio replacement for missing audio of the missing indexed audio packet (l32m) based on an audio sample of one or more of the received indexed audio packets (132) in the sequence of indexed audio packets (132) of the audio stream (130).

7. The method (400) of any of claims 1-6, wherein the packet switched network (120) comprises a peer-to-peer connection protocol (P2P).

8. The method (400) of any of claims 1-7, wherein the audio stream (130) corresponds to a web-based real-time communication application.

9. The method (400) of any of claims 1-8, wherein the missing indexed audio packet (l32m) corresponds to greater than lOOms of human speech. 10. The method (400) of any of claims 1-9, wherein the audio stream (130) comprises human speech.

11. A system (100) comprising:

data processing hardware (144); and

memory hardware (146) in communication with the data processing hardware (144), the memory hardware (146) storing instructions that when executed on the data processing hardware (144) cause the data processing hardware (144) to perform operations comprising:

receiving an audio stream (130) across a packet switched network (120), the audio stream (130) comprising a sequence of indexed audio packets (132);

determining that the received audio stream (130) is missing an indexed audio packet (l32m) from the sequence of indexed audio packets (132);

predicting a predicted audio packet (l32p) to replace the missing indexed audio packet (l32m) based on received indexed audio packets (132) in the sequence of indexed audio packets (132) of the audio stream (130);

substituting the predicted audio packet (l32p) for the missing indexed audio packet (l32m) within the sequence of indexed audio packets (132) of the audio stream (130), resulting in a reconstituted audio stream (232); and

communicating the reconstituted audio stream (232) for audible play.

12. The system (100) of claim 11, wherein predicting the predicted audio packet (l32p) comprises:

determining a prosodic representation (226) corresponding to the received indexed audio packets (132);

identifying an anomaly in the prosodic representation (226) corresponding to the missing indexed audio packet (l32m);

predicting, using a machine learning model (300), at least one prosodic unit (228) corresponding to the anomaly in the prosodic representation (226); and

generating, using a speech synthesizer (224), an audio packet based on the predicted at least one prosodic unit (228) corresponding to the anomaly.

13. The system (100) of claim 12, wherein predicting the at least one prosodic unit (228) corresponding to the anomaly in the prosodic representation (226) comprises:

determining a first portion (226a) of the prosodic representation (226)

corresponding to the received indexed audio packets (132) before the anomaly in the prosodic representation (226) corresponding to the missing indexed audio packet (l32m); and

predicting, using the machine learning model (300), the at least one prosodic unit (228) corresponding to the anomaly in the prosodic representation (226) based on the first portion (226a) of the prosodic representation (226).

14. The system (100) of claim 13, wherein predicting the at least one prosodic unit (228) corresponding to the anomaly in the prosodic representation (226) comprises:

determining a second portion (226b) of the prosodic representation (226) corresponding to the received indexed audio packets (132) after the anomaly in the prosodic representation (226) corresponding to the missing indexed audio packet (l32m); and

predicting, using the machine learning model (300), the at least one prosodic unit (228) corresponding to the anomaly in the prosodic representation (226) based on the second portion (226b) of the prosodic representation (226).

15. The system (100) of any of claims 12-14, wherein the machine learning model (300) comprises a neural network (120). 16. The system (100) of any of claims 11-15, wherein predicting the predicted audio packet (l32p) comprises predicting, using a machine learning model (300), a plausible audio replacement for missing audio of the missing indexed audio packet (l32m) based on an audio sample of one or more of the received indexed audio packets (132) in the sequence of indexed audio packets (132) of the audio stream (130).

17. The system (100) of any of claims 11-16, wherein the packet switched network (120) comprises a peer-to-peer connection protocol (P2P).

18. The system (100) of any of claims 11-17, wherein the audio stream (130) corresponds to a web-based real-time communication application.

19. The system (100) of any of claims 11-18, wherein the missing indexed audio packet (l32m) corresponds to greater than lOOms of human speech. 20. The system (100) of any of claims 11-19, wherein the audio stream (130) comprises human speech.

Description:
Audio Packet Loss Concealment

TECHNICAL FIELD

[0001] This disclosure relates to audio packet loss concealment.

BACKGROUND

[0002] As technology related to wireless communications has grown to transfer information over distances near and far, telecommunication systems transfer packets of data across these distances. Depending on the technology involved in the data packet transfer, data packets may be lost or distorted resulting in an imperfect data transfer. These imperfect data transfers impact audio connections causing quality problems in the rendered audio. As people increasingly communicate using real-time audio connections, predictive modeling systems may be implemented to conceal audio packet loss.

SUMMARY

[0003] One aspect of the disclosure provides a method for communicating audio streams. The method includes receiving, at data processing hardware, an audio stream across a packet switched network. The audio stream includes a sequence of indexed audio packets. The method also includes determining, by the data processing hardware, that the received audio stream is missing an indexed audio packet from the sequence of indexed audio packets. The method also includes predicting, by the data processing hardware, a predicted audio packet to replace the missing indexed audio packet based on received indexed audio packets in the sequence of indexed audio packets of the audio stream. The method also includes, substituting, by the data processing hardware, the predicted audio packet for the missing indexed audio packet within the sequence of indexed audio packets of the audio stream, resulting in a reconstituted audio stream. The method also includes communicating, from the data processing hardware, the reconstituted audio stream for audible play.

[0004] Implementations of the disclosure may include one or more of the following optional features. In some implementations, predicting the predicted audio packet includes: determining a prosodic representation corresponding to the received indexed audio packets; identifying an anomaly in the prosodic representation corresponding to the missing indexed audio packet; predicting, using a machine learning model, at least one prosodic unit corresponding to the anomaly in the prosodic representation; and generating, using a speech synthesizer, an audio packet based on the predicted at least one prosodic unit corresponding to the anomaly. In these implementations, predicting the at least one prosodic unit corresponding to the anomaly in the prosodic representation may include: determining a first portion of the prosodic representation corresponding to the received indexed audio packets before the anomaly in the prosodic representation corresponding to the missing indexed audio packet; and predicting, using the machine learning model, the at least one prosodic unit corresponding to the anomaly in the prosodic representation based on the first portion of the prosodic representation. Here, predicting the at least one prosodic unit corresponding to the anomaly in the prosodic representation includes: determining a second portion of the prosodic representation corresponding to the received indexed audio packets after the anomaly in the prosodic representation corresponding to the missing indexed audio packet; and predicting, using the machine learning model, the at least one prosodic unit corresponding to the anomaly in the prosodic representation based on the second portion of the prosodic representation. In some examples, the machine learning model includes a neural network.

[0005] In other examples, predicting the predicted audio packet includes predicting, using a machine learning model, a plausible audio replacement for missing audio of the missing indexed audio packet based on an audio sample of one or more of the received indexed audio packets in the sequence of indexed audio packets of the audio stream. The packet switched network may include a peer-to-peer connection protocol (P2P) and/or the audio stream may correspond to a web-based real-time communication application. In some implementations, the missing indexed audio packet corresponds to greater than lOOms of human speech. Thus, the audio stream may include human speech.

[0006] Another aspect of the disclosure provides a system for communicating audio streams. The system includes data processing hardware and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations that include receiving an audio stream across a packet switched network. The audio stream includes a sequence of indexed audio packets. The operations also include determining that the received audio stream is missing an indexed audio packet from the sequence of indexed audio packets. The operations also include predicting a predicted audio packet to replace the missing indexed audio packet based on received indexed audio packets in the sequence of indexed audio packets of the audio stream. The operations also include substituting the predicted audio packet for the missing indexed audio packet within the sequence of indexed audio packets of the audio stream, resulting in a reconstituted audio stream. The operations also include communicating the reconstituted audio stream for audible play.

[0007] This aspect may include one or more of the following optional features Implementations of the disclosure may include one or more of the following optional features. In some implementations, predicting the predicted audio packet includes:

determining a prosodic representation corresponding to the received indexed audio packets; identifying an anomaly in the prosodic representation corresponding to the missing indexed audio packet; predicting, using a machine learning model, at least one prosodic unit corresponding to the anomaly in the prosodic representation; and generating, using a speech synthesizer, an audio packet based on the predicted at least one prosodic unit corresponding to the anomaly. In these implementations, predicting the at least one prosodic unit corresponding to the anomaly in the prosodic representation may include: determining a first portion of the prosodic representation corresponding to the received indexed audio packets before the anomaly in the prosodic representation corresponding to the missing indexed audio packet; and predicting, using the machine learning model, the at least one prosodic unit corresponding to the anomaly in the prosodic representation based on the first portion of the prosodic representation. Here, predicting the at least one prosodic unit corresponding to the anomaly in the prosodic representation includes: determining a second portion of the prosodic representation corresponding to the received indexed audio packets after the anomaly in the prosodic representation corresponding to the missing indexed audio packet; and predicting, using the machine learning model, the at least one prosodic unit corresponding to the anomaly in the prosodic representation based on the second portion of the prosodic representation. In some examples, the machine learning model includes a neural network.

[0008] In other examples, predicting the predicted audio packet includes predicting, using a machine learning model, a plausible audio replacement for missing audio of the missing indexed audio packet based on an audio sample of one or more of the received indexed audio packets in the sequence of indexed audio packets of the audio stream. The packet switched network may include a peer-to-peer connection protocol (P2P) and/or the audio stream may correspond to a web-based real-time communication application. In some implementations, the missing indexed audio packet corresponds to greater than lOOms of human speech. Thus, the audio stream may include human speech.

DESCRIPTION OF DRAWINGS

[0009] FIG. l is a schematic view of an example communication environment.

[0010] FIGS. 2A-2D are schematic views of example audio enhancers within the communication environment.

[0011] FIG. 3 is a schematic view of an example model for an audio enhancer.

[0012] FIG. 4 is a flow diagram of an example method for concealing audio packet loss.

[0013] FIG. 5 is a schematic view of an example computing device that may be used to implement the systems and methods described herein.

[0014] Like reference symbols in the various drawings indicate like elements.

DETAILED DESCRIPTION

[0015] FIG. 1 is an example of a communications environment 100. The

communications environment 100 is an environment where users 10, lOa-b have a conversation 12 via user devices 110, 1 lOa-b. The conversation 12 occurs between user the devices 110, 1 lOa-b across a packet switched network 120. The packet switched network 120 is a type of communication network configured to route small units of data referred to as packets (i.e. data packets or network packets) between addresses (e.g., a source address and a destination address). [0016] A conversation 12 generally refers to an audible sequence of speech between users 10. The user device 110 of each user 10 is configured to capture the conversation 12 and communicate the conversation 12 via the packet switched network 120. The user device 110 communicating via the packet switched network 120 enables remote users 10 to be connected and engaged in a conversation 12. For example, the user device 110 captures the conversation 12 with a component 116 (e.g., a peripheral or embedded component) such as a microphone, a speaker, a camera, or a webcam.

[0017] The user device 110 can be any computing devices or data processing hardware capable of communicating with a packet switched network 120 and/or remote systems 140. With continued reference to FIG. 1, the user device 110 includes data processing hardware 112, memory hardware 114, and peripherals or embedded components 116 (e.g., such as speakers, microphones, cameras, webcams, etc.). The user device 110, includes, but is not limited to, desktop computing devices and mobile computing devices, such as laptops, tablets, smart phones, and wearable computing devices (e.g., headsets and/or watches). The user device 110 is configured such that the user 10 may engage in conversations 12 across a packet switched network 120 (e.g., via an audio receiver and/or transmitter).

[0018] A packet switched network 120, unlike a circuit switched network with dedicated resources, is a shared network where multiple users 10 and/or applications may send data in the form of data packets (e.g., audio packets 132). As a shared network, the packet switched network 120 utilizes the small unit size of the data packets to exploit efficient data transfer across network channels. In other words, when a user device 110 sends data, the data (i.e. original data) is broken into data packets 132 to optimize data transfer across the packet switched network 120. Each data packet 132 may include control information in the form of packet headers and trailers to aid in packet delivery across the packet switched network 120.

[0019] In some examples, the conversation 12 undergoes connectionless packet switching. In connectionless packet switching, each data packet 132 may be individually routed according to routing information associated with each data packet 132. For example, the packet header of each data packet 132 includes a destination address, a source address, a total number of data packets to form the original data, and an index number. In other examples, the packet header includes the destination address, the source address, and the index number, but not the total number of data packets 132. The index number corresponds to a sequence in which to arrange each packet 132 to reconstruct the original data. With the routing information included, each data packet 132 may be sent along different routes within the packet switched network 120 and still reconstructed to form the original data. In other examples, the conversation 12 undergoes connection- oriented packet switching (i.e. virtual circuit switching) where the data packets 132 are sent along a predefined route. When using a predefined route, the data packets may be indexed and transferred in the sequential order to the destination. Thus, with connection- oriented packet switching, each data packets 132 does not require the routing

information.

[0020] A data packet 132 is a formatted unit of data carried by the packet switched network 120. The data packet 132 generally includes the control information and user data (i.e. a payload). Various communication protocols have different data packet conventions that define elements of the data packet and/or format the data into the data packet. For instance, in point-to-point protocol, the protocol specifies the format of the data packet as 8-bit bytes. Other protocols, such as transmission control protocol (TCP) and internet protocol (IP) may further dictate how to divide data in data packets as well as managing data packet transmission and receipt. In other words, parameters (e.g., size/length) of a data packet 132 may depend on parameters of network protocol and/or the original data transmitted. Furthermore, data packets 132 may also include other elements such as error detection and correction (e.g., checksums, parity bits, redundancy checks), time to live (TTL) identifiers, payload maxima, priority identifiers, etc.

[0021] Traditionally, networks have techniques to account for error detection and error correction. For example, checksum schemes such as parity bits, check digits, and redundancy checks indicate whether data packets have experienced any issues such as loss. In response to issues, networks seek to correct detected errors by methods such as automatic repeat requests (ARQ) or correcting codes. For instance, in an ARQ method, a receiver of data communicates an acknowledgment message indicating whether data has been received with or without errors. Here, when a transmitter of data does not receive an acknowledgment message, the transmitter resends the data. Although these methods have proven valuable, the demands of newer communication systems are beginning to strain aspects of traditional techniques.

[0022] As a shared network, bandwidth of communication channels within the packet switching network 120 may become limited depending on data transfer traffic or system load. Since packet switched networks 120 utilize data packets instead of a continuous bit streams, data packets may suffer data transmission issues (e.g., be lost or delayed). These transmission issues complicate conversations 12 (e.g., conversations involving human speech) across packet switched networks 120 because parts of the conversation 12 may be lost or distorted, resulting in an audio stream 130 of the conversation 12 to sound broken, segmented, or delayed (i.e. off-timing). In the case of video streaming, the video and audio may become out of synchronization. For instance, a user 10 sees another user 10 talking, but hears garbled or scrambled audio accompanying the video.

[0023] Real-time communication places additional strains on the capabilities of packet switching. For example, real-time communication demands low latency to simulate and/or to match in-person communication. As real-time communication seeks to offer high-quality communication, real-time communication data transmitted as data packets via packet switched networks 120 must account for packet switching issues. For example, real-time communication cannot afford to wait for retransmission in order to smooth out network jitter from lost or distorted data packet transmission. Thus, often real-time communication experiences gaps within a media stream (e.g., audio stream or video stream). Without these traditional techniques (e.g., retransmission), concealment of gaps within the media stream may occur by generating new media content to overlay or replace the gaps. For instance, processes, such as pitch synchronous overlap and add (PSOLA), generate a loop of a prior segment of the media stream to insert in a gap as a form of gap concealment.

[0024] In the example shown in FIG. 1, a first user lOa is having a conversation 12 with a second user lOb (i.e. a receiving user) using a real-time communication application (RTC app) 20 executable on the user devices 1 lOa, 1 lOb. As the first user lOa talks to the second user lOb, the user device 1 lOa associated with the first user lOa and executing the RTC app 20 receives audible speech of the conversation 12. Each user device 110 and/or the RTC app 20 is configured to memorialize the conversation 12 as a corresponding audio stream 130. Moreover, each user device 110, via the RTC app 20, facilitates the transmission and receipt of the audio stream 130 of the conversation 12.

For example, as shown in FIG. 1, the first user lOa tells the second user lOb that“The quick brown fox jumps over the lazy dog” as a spoken sentence 30, 30a of the conversation 12. Here, the user device 1 lOa captures sounds (e.g., via a microphone 116) generated from the spoken sentence 30a and converts the spoken sentence 30a into the communicated audio stream 130, l30a. To be communicated across the packet switched network 120, the communicated audio stream l30a is divided into a sequence of indexed audio packets 132, 132I-N. In some examples, the communication across the packet switched network 120 is facilitated by a peer to peer (P2P) connection protocol. For simplicity, each word of the sentence 30 constitutes an audio packet 132. However, audio packets 132 may vary in size such that the audio stream 130 may be divided by other linguistic structures (e.g., words, syllables, phones, frames, etc.) or parts of linguistic structures depending on protocols related to the packet switched network 120.

[0025] Due to the nature of packet switched networks (e.g., the packet switched network 120), parts of the communicated audio stream l30a may experience issues during transmission such that the received audio stream 130, l30b at the user device 1 lOb associated with the second user lOb differs from the communicated audio stream l30a. For example, the received audio stream l30b is missing at least one indexed audio packet 132 (e.g., a missing audio packet 132, l32m) from the sequence of indexed audio packets 132, 132I-N. AS illustrated by FIG. 1, the received audio stream l30b is missing three audio packets l32 4 - 6 that were included in the communicated audio stream l30a transmitted from the user device 1 lOa. Here, FIG. 1 depicts the missing three audio packets 132, l32 4-6 as blank. In some examples, when the received audio stream l30b is missing an indexed audio packet 132, the received audio stream l30b contains a gap 134. The gap 134 may correspond to inaudible sound, distorted sound, or garbled sound at the recipient (e.g., the second user 10b). [0026] To account for transmission issues, the received audio stream l30b is communicated as an input 150 to an audio enhancer 200. The audio enhancer 200 is configured to predict and/or to generate sounds to replace missing audio packets 132, l32m of the received audio stream 130, l30b as an output 240. When the user device 110 receives the output 240 of the audio enhancer 200, the user device 110 is configured to communicate the output 240 to the user 10 (e.g., the receiving user lOb). As an example, FIG. 1 illustrates an audible output component (e.g., speaker) 116 of the user device 1 lOb outputting audio corresponding to the output 240 of the audio enhancer 200. Here, the audible output 240 includes“The quick brown fox jumps over the lazy dog” as an output sentence 30, 30b of the conversation 12 through the RTC application 20 executing on the user devices 110. In this example, the spoken and output sentences 30a, 30b are the same such that the recipient user lOb hears what was conveyed by the first user lOa. Otherwise, without the audio enhancer 200 replacing the missing audio packets l32m, the recipient user lOb would hear the received audio stream l30b containing the gap 134 corresponding to inaudible, distorted, and/or garbled sounds due to the missing audio packets l32m.

[0027] In some examples, the audio enhancer 200 includes application hosted by a remote system 140, such as a distributed system of a cloud environment, accessed via a user device 110. In some implementations, the audio enhancer 200 includes an application downloaded to memory hardware 114 of the user device 110 (e.g., the receiving user device 1 lOb). Regardless of an access point to the audio enhancer 200, the audio enhancer 200 may be configured to communicate with the remote system 140 to access resources 142 (e.g., data processing hardware 144 or memory hardware 146). Access to resources 142 of the remote system 140 may allow the audio enhancer 200 to generate sounds to replace missing audio packets 132, l32m. For example, the audio enhancer 200 uses models 300 (FIGS. 2A-2D) or modeling data stored within the remote system 140 to predict and/or generate sounds to replace missing audio packets l32m. Additionally or alternatively, the audio enhancer 200 may store models 300 locally on the user device 110. Optionally, the RTC application 20 used to communicate between users 10 includes the audio enhancer 200. [0028] FIGS. 2A-2D show examples of the audio enhancer 200. The audio enhancer 200 includes a packet identifier 210, a predictor 220, and a rebuilder 230. As an input 150, the audio enhancer 200 receives the audio stream 130 (e.g., the received audio stream 130, l30b) as a sequence of indexed audio packets 132, 132I -3, 7 -N via packet switching by the packet switched network 120. The packet identifier 210 is configured to determine whether the received audio stream 130 includes any missing audio packets l32m from within the sequence of indexed audio packets 132, 132i -3, 7-N. In some examples, the packet identifier 210 indexes the received audio stream 130 and identifies whether any number is missing within the index. In this example, the packet identifier 210 identifies a number missing within the index as a missing audio packet l32m. As used herein, a“missing audio packet l32m” is interchangeably referred to as a“missing indexed audio packet l32m”.

[0029] In some implementations, the packet identifier 210 identifies missing audio packets 132m by interpreting the packet header corresponding to each indexed packet 132. For example, a packet header includes the index number for a given packet within a sequence of the original data. With this information, the packet identifier 210 may account for the index number of each received audio packet 132 and deduce each missing audio packet l32m based on missing numbers within the sequence of indexed audio packets 132 of the received audio stream 130, l30b. For example, during real-time communications such as a conversation 12, the total number of data packets forming the original data is typically unknown. As shown in FIG. 1, the packet identifier 210 determines that the audio stream 130 has a gap 134 corresponding to missing audio packetsl32m (e.g., original audio packets 132 4 -6 included in the communicated audio stream l30a of FIG. 1). When the packet identifier 210 identifies that an audio packet 132 is missing, the packet identifier 210 communicates the missing audio packets l32m to the predictor 220 to convey which audio packet(s) 132 the predictor 220 needs to predict for replacement. In some examples, the packet identifier 210 also communicates a location of the missing audio packet l32m within the audio stream 130 to aid the predictor 220 and/or the rebuilder 230. [0030] The predictor 220 is configured to predict a replacement for any identified missing audio packets l32m. In other words, the predictor 220 generates an audio sample of a size corresponding to a missing audio packet l32m. The generated audio sample attempts to match the original audio packet 132 identified as missing from the received audio stream 130. By generating a matching or similar audio sample, the audio enhancer 200 may conceal missing audio packet(s) l32m (e.g., packets l32 4 - 6 ) within the received audio stream 130. The generated audio sample is referred to as a predicted audio packet l32p or a replacement audio packet.

[0031] With continued reference to FIGS. 2A-2D, the predictor 220 generates (i.e. predicts) a corresponding predicted audio packet l32p to replace each missing indexed audio packet l32m based on the received indexed audio packets 132i- 3, 7-9 in the sequence of indexed audio packets 132, 1321-9 of the audio stream 130. For example, the predictor 220 replaces a first missing indexed audio packet l32mi with a first predicted audio packet 132ri. Here, the first missing indexed audio packet l32mi corresponds to the audio packet l30 4 within the communicated audio stream l30a. Accordingly, the predictor 220 also generates second and third predicted audio packets 132p 2, l32p 3 (not shown) to replace corresponding second and third missing indexed audio packets l32m 2 , l32m 3 (not shown) that correspond to audio packets 130s, 130b within the communicated audio stream l30a. In some examples, the predictor 220 receives audio packets 132 within a sequence of indexed audio packets 1321- 3 indexed before a missing audio packet 132, l32m to predict a corresponding predicted audio packet 132, l32p. In other examples, such as when multiple missing packets 132, l32m occur within the received sequence of indexed audio packets 132, 132I-N, the predictor 220 receives the entire received audio stream 130, or a larger block portion of the received audio stream 130, to provide audio context for the predictor 220 to generate predicted audio packets l32p. For example, as shown in FIG. 2B, the input 150 to the predictor 220 is the received audio stream l30b with the sequence of indexed audio packets 132, 132i- 3, 7-9. In yet other examples, to generate the predicted audio packets l32p, the predictor 220 is configured to receive a set number of audio packets 132 in the received sequence of audio packets 132 that are adjacent to a missing audio packet l32m. [0032] FIG. 2B depicts an example of the predictor 220 including a modeler 222 and a synthesizer 224. In this example, the modeler 222 determines a prosodic representation 226 corresponding to the received indexed audio packets 132, 132i- 3, 7-N. The prosodic representation 226 generally refers to a vector corresponding to prosodic variables for speech, such as pitch, vocal length, loudness, timbre, intonation, stress, rhythm, etc. In some examples, prosodic variables generate a prosodic representation 226 corresponding to frequency and/or acoustic intensity as a function of duration (i.e. time). In some implementations, the prosodic representation 226 includes a vector corresponding to tonal contours (e.g., intonation) of audio within an audio packet 132 based on prosodic variables. For example, FIG. 2B depicts the portion of the sentence 30 within the received audio stream 130-“the quick brown... [inaudible]... the lazy dog” - as prosodic representations 226 of intonation for each received word of the sentence 30. In some configurations, the prosodic representation 226 includes an anomaly corresponding to one or more missing audio packets l32m. In other words, the anomaly refers to an abnormal variance within a context of the prosodic representation 226 for the received indexed audio packets 132, 132I- 3 , 7-N. For example, FIG. 2B depicts, as an anomaly, the gap 134 within the received audio stream 130 such that the gap 134 corresponds to sections of the prosodic representation 226 that do not include a distinguishable vector and/or tonal contour (e.g., blank sections). In some examples, the modeler 222 may be configured to represent missing audio packets l32m identified by the packet identifier 210 without a curve or vector.

[0033] Furthermore, the predictor 220 is configured to use a model 300 to generate the predicted audio packets l32p. Referring to FIG. 2B, the modeler 222 inputs the prosodic representations 226 of the received audio packets 132, 132i- 3, 7-9 into the model 300. The prosodic representations 226 input to the model 300 may include a first portion

226a occurring before the anomaly (e.g., before the gap 134) and/or a second portion 226b occurring after the anomaly (e.g., after the gap 134). In some implementations, based on the prosodic representations 226, the model 300 predicts at least one prosodic unit 228 corresponding to the anomaly in the prosodic representation 226. Here, the at least one prosodic unit 228 corresponds to a predicted sound by the model 300 that attempts to contextually match the received audio packets 132, 132i- 3, 7-9. In other words, the modeler 222 may use the model 300 to generate prosodic units 228 that replace the missing audio packets l32m to conceal gap(s) 134 and/or jitter within the received audio stream 130.

[0034] In some examples, the synthesizer 224 functions to generate audio packets

132 based on the predicted at least one prosodic unit 228. In these examples, the synthesizer 224 determines a size of an audio packet 132 within the received audio stream 130b and combines more than one predicted prosodic units 228 to construct a predicted audio packet l32p. For example, in FIG. 2B, the synthesizer 224 combines three prosodic units 228 to construct the predicted audio packets 132ri-3. In some configurations, the predicted prosodic units 228 of the modeler 222 correspond to the size of an audio packet 132 for the received audio stream l30b.

[0035] In some configurations, instead of determining the prosodic representation 226 for the received audio stream 130b (as shown in FIG. 2B), the modeler 222 determines portions of the prosodic representation 226 to predict the at least one prosodic unit 228. For example, the modeler 222 may determine the first portion 226a of the prosodic representations 226 before the anomaly and/or the second portion 226b of the prosodic representations 226 occurring after the anomaly and predicts the at least one prosodic unit 228 based on the first portion 226a and/or the second portion 226b of the prosodic representations 226. Whether the modeler 222 uses one or both the first and second portion 226a, 226b depends on the design of the audio enhancer 200 and the size of the anomaly (e.g., identified by the packet identifier and/or the modeler 222). Namely, the audio enhancer 200 may need require a greater portion of prosodic representations 226b than what the available first portion 226a provides in in order to attain additional context for generating the predicted audio packets l32p.

[0036] Referring back to FIGS. 2A-2D, the rebuilder 230 is configured to substitute the predicted audio packets l32p predicted for the missing indexed audio packet l32m within the sequence of indexed audio packets 132i- 3, 7-N . By substituting the predicted audio packet(s) l32p for the missing indexed audio packet(s) l32m, the rebuilder 230 forms a reconstituted audio stream 232 for audible output 240. The reconstituted audio stream 232 may be communicated as the audible output 240 of the audio enhancer 200. For instance, FIG. 1 shows the audible output device (e.g., speaker) 116 of the receiving user device 1 lOb outputting the audible output 240 corresponding to the reconstituted audio stream 232 as a reconstructed sentence 30, 30b -“The quick brown fox jumps over the lazy dog.” In some examples, the rebuilder 230 uses packet header information to determine boundaries that identify where to insert the predicted audio packets l32p. In these or other examples, the rebuilder 230 splices out sections of the received audio stream 130b corresponding to the missing audio packets l32m. In other examples, the rebuilder 230 overlays the predicted audio packets l32p at a location corresponding to the missing audio packets l32m within the received audio stream 130.

[0037] The audio enhancer 200 may generate different outputs 240 depending on a configuration of the audio enhancer 200. For example, FIG. 2C shows the audio enhancer 200 communicating the entire reconstituted audio stream 232 as an output 240, 240a. In FIG. 2D, the audio enhancer 200 communicates only the predicted audio packets 132, l32p as the output 240, 240b. For instance, in FIG. 2D, the output 240b omits the packets 132I-3, N-7 present in the received audio stream 130 and only provides the predicted audio packets l32p predicted to replace the missing indexed audio packets 1324- 6. For example, the rebuilder 230 may be a separate component from the audio enhancer 200, whereby the rebuilder 230 executes at the user device 110 and the audio enhancer 200 executes at the remote system 140 (or vice versa). In some configurations, the audio enhancer 200 provides the predicted audio packets 132, l32p by bypassing the rebuilder 230. In other configurations, the rebuilder 230 functions as a checkpoint to ensure the predicted audio packets l32p can conceal the missing audio packets l32m, while the audio enhancer 200 communicates only the predicted audio packets l32p instead of the entire reconstituted audio stream 232.

[0038] FIG. 3 is an example of the model 300. The model 300 corresponds to algorithms configured to predict the predicted audio packets 132, l32p for loss concealment. Predictive modeling may overcome setbacks of other concealment techniques, such as looping segments of received audio stream 130, 130b because the predictive modeling fills missing audio packet 132, l32m. By filling the missing audio packet(s) l32m with sounds, the audio enhancer 200 may conceal longer durations of missing audio packets l32m than looping techniques without significantly impacting the conversation 12 for the user 10 of the RTC application 20 executing on the user device 110. For example, with a predictive model 300, the audio enhancer 200 may conceal up to l50ms of lost speech in a received audio stream 130.

[0039] In some examples, the model 300 includes a machine learning model where the model 300 initializes by first undergoing model training 310 and, once trained, proceeds to modeling 320. During model training 310, the model 300 receives data sets and result sets to predict its own output based on input data similar to the data sets. Here, the training data sets and results sets are training audio samples 330, 330a-n of human speech such that the model 300 learns prosodic variables and/or linguistic features. For example, the audio samples 330a-n correlate to the structure of a received audio stream 130 such that linguistic structure, prosodic representations 226, tonal contours, and waveforms of the received audio stream 130 are identifiable and predictable by the model 300. In some examples, for training purposes, data (e.g., training audio samples 330,

330a-n) is segregated into training and evaluation sets (e.g., 90% training and 10% evaluation). With these sets, the model 300 trains with the audio samples 330a-n until a performance of the model 300 on the evaluation set stops decreasing. Once the performance stops decreasing on the evaluation set, the model 300 is ready for modeling 320 (i.e., inference) where the model 300 predicts predicted audio packets l32m based on the received audio stream 130.

[0040] Additionally or alternatively, the model 300 includes a neural network. The model 300 may be a convolution neural network (CNN) (e.g., a modified WaveNet), a deep neural network (DNN), or a recurrent neural network (RNN). In some examples, the model 300 is a combination of a convolution neutral network and a deep neutral network such that the convolution neural network filters, pools, then flattens information to send to a deep neural network. In other examples, the model 300 is hybrid

combination of any of the CNN, DNN, or RNN. Much like when the model 300 is a machine learning model, a neural network is trained to generate meaningful outputs that may be used as accurate predicted audio packets l32p. In some examples, a mean squared error loss function trains the model 300 when the model 300 includes a neural network.

[0041] FIG. 4 is a flowchart of an example method 400 of concealing audio packet loss. At block 402, the method 400 receives an audio stream 130 across a packet switched network 120. Here, the audio stream 130 includes a sequence of indexed audio packets 132. At block 404, the method 400 determines that the received audio stream 130 is missing an indexed audio packet l32m from the sequence of indexed audio packets 132. At block 404, the method predicts a predicted audio packet l32p to replace the missing indexed audio packet l32m based on the received indexed audio packets 132 in the sequence of indexed audio packets 132, 132I-N of the audio stream 130. At block 406, the method 400 substitutes the predicted audio packet l32p for the missing indexed audio packet 132, l32m within the sequence of indexed audio packets 132. This substitution results in a reconstituted audio stream 232. At block 408, the method 400 communicates the reconstituted audio stream 232 for audible output 240. For instance, a user device 110 that receives the audio stream 130 may output the reconstituted audio stream 232 from an audible output device (e.g., speaker) in communication with the user device 110. While the example method 400 describes only a single indexed audio packet l32m missing from the sequence of audio packets 132, the example method 400 can similarly determine multiple missing indexed audio packets l32m from the received audio stream 130 and predict corresponding predicted audio packets l32p to substitute for the missing indexed audio packets l32m to result in the reconstituted audio stream 232.

[0042] A software application (i.e., a software resource) may refer to computer software that causes a computing device to perform a task. In some examples, a software application may be referred to as an“application,” an“app,” or a“program.” Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing

applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications. [0043] FIG. 5 is schematic view of an example computing device 500 that may be used to implement the systems and methods of, for example, RTC application 20, the user device 110, the audio enhancer 200, and/or the remote system 150 as described in the present disclosure. The computing device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

[0044] The computing device 500 includes a processor 510, memory 520, a storage device 530, a high-speed interface/controller 540 connecting to the memory 520 and high-speed expansion ports 550, and a low speed interface/controller 560 connecting to a low speed bus 570 and a storage device 530. Each of the components 510, 520, 530, 540, 550, and 560, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 510 can process instructions for execution within the computing device 500, including instructions stored in the memory 520 or on the storage device 530 to display graphical information for a graphical user interface (GUI) on an external input/output device, such as display 580 coupled to high speed interface 540. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 500 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).

[0045] The memory 520 stores information non-transitorily within the computing device 500. The memory 520 may be a computer-readable medium, a volatile memory unit(s), or non-volatile memory unit(s). The non-transitory memory 520 may be physical devices used to store programs (e.g., sequences of instructions) or data (e.g., program state information) on a temporary or permanent basis for use by the computing device 500. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM) / programmable read-only memory (PROM) / erasable programmable read-only memory (EPROM) / electronically erasable programmable read- only memory (EEPROM) (e.g., typically used for firmware, such as boot programs). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM) as well as disks or tapes.

[0046] The storage device 530 is capable of providing mass storage for the computing device 500. In some implementations, the storage device 530 is a computer- readable medium. In various different implementations, the storage device 530 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. In additional implementations, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 520, the storage device 530, or memory on processor 510.

[0047] The high speed controller 540 manages bandwidth-intensive operations for the computing device 500, while the low speed controller 560 manages lower bandwidth intensive operations. Such allocation of duties is exemplary only. In some

implementations, the high-speed controller 540 is coupled to the memory 520, the display 580 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 550, which may accept various expansion cards (not shown). In some

implementations, the low-speed controller 560 is coupled to the storage device 530 and a low-speed expansion port 590. The low-speed expansion port 590, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.

[0048] The computing device 500 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 500a or multiple times in a group of such servers 500a, as a laptop computer 500b, or as part of a rack server system 500c.

[0049] Various implementations of the systems and techniques described herein can be realized in digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or

interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

[0050] These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms“machine-readable medium” and“computer-readable medium” refer to any computer program product, non- transitory computer readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.

[0051] The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

[0052] To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.

[0053] Further to the descriptions above, a user may be provided with controls allowing the user to make an election as to both if and when systems, programs or features described herein may enable collection of user information (e.g., information about a user’s social network, social actions or activities, profession, a user’s preferences, or a user’s current location), and if the user is sent content or communications from a server. In addition, certain data may be treated in one or more ways before it is stored or used, so that personally identifiable information is removed. For example, a user’s identity may be treated so that no personally identifiable information can be determined for the user, or a user’s geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined. Thus, the user may have control over what information is collected about the user, how that information is used, and what information is provided to the user.

[0054] A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.