Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CALCULATION OF LEFT AND RIGHT BINAURAL SIGNALS FOR OUTPUT
Document Type and Number:
WIPO Patent Application WO/2022/093162
Kind Code:
A1
Abstract:
For each of a number of screen locations respectively corresponding to discrete audio signals, left and right transfer functions are generated using a machine learning model. The left and right transfer functions can take into account a head orientation of a user relative to a display screen having the screen locations. To each discrete audio signal, the left and right transfer functions for the screen location corresponding to the discrete audio signal are applied to respectively generate left and right audio signals corresponding to the discrete audio signal. Left and right binaural signals are calculated by respectively adding together the left and right audio signals corresponding to the discrete audio signals. The left and right binaural signals can be output on left and right speakers, respectively, which may be left and right headphone speakers of headphones that the user is wearing.

Inventors:
KUTHURU SRIKANTH (US)
BALLAGAS RAFAEL (US)
Application Number:
PCT/US2020/057274
Publication Date:
May 05, 2022
Filing Date:
October 26, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HEWLETT PACKARD DEVELOPMENT CO (US)
International Classes:
H04R5/04; G06F3/16; H04S3/00
Domestic Patent References:
WO2018057176A12018-03-29
Foreign References:
US20190387351A12019-12-19
US8854447B22014-10-07
US10503461B22019-12-10
Attorney, Agent or Firm:
GARDINER, Austin et al. (US)
Download PDF:
Claims:
We claim:

1 . A method comprising: generating, for each of a plurality of screen locations respectively corresponding to a plurality of discrete audio signals, left and right transfer functions using a machine learning model; applying, to each discrete audio signal, the left and right transfer functions for the screen location corresponding to the discrete audio signal to respectively generate left and right audio signals corresponding to the discrete audio signal; calculating left and right binaural signals by respectively adding together the left and right audio signals corresponding to the discrete audio signals; and outputting the left and right binaural signals on left and right speakers, respectively.

2. The method of claim 1 , wherein the left and right transfer functions for each screen location respectively characterize how left and right ears receive sound from a point in space corresponding to the screen location.

3. The method of claim 2, wherein the left and right transfer functions are head-related transfer functions (HRTFs).

4. The method of claim 1 , wherein the left and right transfer functions generated for each screen location take into account a head orientation of a user relative to a display screen having the screen locations.

5. The method of claim 4, wherein generating, for each screen location, the left and right transfer functions using the machine learning model comprises: for each screen location, providing as input to the machine learning model the screen location and the head orientation of the user, and receiving as output from the machine learning model the left and right transfer functions for the screen location.

6. The method of claim 1 , wherein generating, for each screen location, the left and right transfer functions using the machine learning model comprises: for each screen location, providing as input to the machine learning model the screen location, and receiving as output from the machine learning model the left and right transfer functions for the screen location.

7. The method of claim 4, wherein the left and right speakers on which the left and right binaural signals are output are left and right headphone speakers of headphones that the user is wearing.

8. The method of claim 1 , wherein the left and right speakers on which the left and right binaural signals are output are left and right speakers of or to either side of a display screen having the screen locations.

9. The method of claim 1 , further comprising: accentuating the discrete audio signal to which the screen location to which eye gaze of a user is directed corresponds, relative to the discrete audio signals to which other of the screen locations correspond.

10. A non-transitory computer-readable data storage medium storing program code executable by a processor to: generate, for each of a plurality of screen locations respectively corresponding to a plurality of discrete audio signals, left and right transfer functions that take into account a head orientation of a user relative to a display screen having the screen locations; apply, to each discrete audio signal, the left and right transfer functions for the screen location corresponding to the discrete audio signal to respectively generate left and right audio signals corresponding to the discrete audio signal; calculate left and right binaural signals by respectively adding together the left and right audio signals corresponding to the discrete audio signals; and output the left and right binaural signals on left and right headphone speakers, respectively, of headphones that the user is wearing.

11 . The non-transitory computer-readable data storage medium of claim 10, wherein the left and right transfer functions that take into account the head orientation of the user are generated for each screen location using a machine learning model.

12. The non-transitory computer-readable data storage medium of claim 11 , wherein the program code is executable by the processor to generate, for each screen location, the left and right transfer functions that take into account the head orientation of the user by: for each screen location, providing as input to the machine learning model the screen location and the head orientation of the user, and receiving as output from the machine learning model the left and right transfer functions for the screen location.

13. The non-transitory computer-readable data storage medium of claim 10, wherein the program code is executable by the processor to further: accentuate the discrete audio signal to which the screen location to which the eye gaze of the user is directed corresponds, relative to the discrete audio signals to which other of the screen locations correspond.

14. A system comprising: a display screen having a plurality of screen locations respectively corresponding to a plurality of discrete audio signals; an interface to communicatively connect to headphones wearable by a user and that have left and right headphone speakers; a processor; and a memory storing program code executable by the processor to: generate, for each screen location, left and right transfer functions that take into account a head orientation of the user relative to the display screen, using a machine learning model; apply, to each discrete audio signal, the left and right transfer functions for the screen location corresponding to the discrete audio signal to respectively generate left and right audio signals corresponding to the discrete audio signal; calculate left and right binaural signals by respectively adding together the left and right audio signals corresponding to the discrete audio signals; and output the left and right binaural signals on the left and right headphone speakers, respectively.

15. The system of claim 14, wherein the program code is executable by the processor to further: accentuate the discrete audio signal to which the screen location to which the eye gaze of the user is directed corresponds, relative to the discrete audio signals to which other of the screen locations correspond.

Description:
CALCULATION OF LEFT AND RIGHT BINAURAL SIGNALS FOR OUTPUT

BACKGROUND

[0001] Videoconferencing permits real-time audio-video communication of people at different locations. In early renditions of videoconferencing, high equipment cost and limited availability of high-speed networks restricted most videoconferencing to dedicated corporate, governmental, and other facilities. A conference room may be fitted with a video camera and a microphone to permit one or multiple individuals to conduct videoconferences with other participants at one or multiple other locations. With decreasing equipment cost and widespread availability of high-speed networks, videoconferencing is more available to individuals almost anywhere they have access to highspeed Internet connectivity and laptops or other computing devices.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] FIGs. 1 A and 1 B are front view diagrams of example display screens of videoconferences.

[0003] FIGs. 2A and 2B are top and right side diagrams of example spatially coherent discrete audio signals that match display screen locations that correspond to the audio signals.

[0004] FIGs. 3A and 3B are top and right side diagrams of example spatially coherent discrete audio signals that match display screen locations that correspond to the audio signals and that take into account head orientation of a user relative to the display screen.

[0005] FIG. 4 is a diagram of an example process to coherently spatialize discrete audio signals to match display screen locations that correspond to the audio signals while taking into account head orientation of a user relative to the display screen.

[0006] FIGs. 5A and 5B are diagrams of example processes to accentuate the discrete audio signal corresponding to the display screen location to which eye gaze of a user is directed, relative to the discrete audio signals corresponding to other screen locations.

[0007] FIG. 6 is a flowchart of an example method.

[0008] FIG. 7 is a diagram of an example computer-readable data storage medium.

[0009] FIG. 8 is a diagram of an example system.

DETAILED DESCRIPTION

[0010] As noted in the background, videoconferencing is now available to individual users via computers, such as desktop, laptop, and notebook computers, as well as other computing devices, such as smartphones, tablet computing devices, and other types of mobile computing devices. A user can participate in a videoconference with multiple other individuals at different locations. For example, a user can in the comfort of his or her home, or at the office, a hotel, and so on, have a videoconference with colleagues that are each located at a different location.

[0011] The videoconferencing software running on a user’s computing device thus receives audio and video signals from the videoconferencing software running on each remote user’s computing device. If a user is having a videoconference with n remote participants, the software running on the user’s computing device receives n corresponding audio and video signals, and likewise sends local audio and video signals of the user to the computing devices of the remote participants. The videoconferencing software may display the remote video signals in a grid or other configuration on the display screen, so that each remote participant appears at a different screen location.

[0012] The remote audio signals, by comparison, are usually combined into a single audio signal and output on the speakers of the computing device. In the case of a laptop or notebook computer, the speakers may be the internal speakers of the computer. In the case of a desktop computer, the speakers may be the internal speakers of the display screen, or may be external speakers. The remote audio signals may be monophonic, or mono, in which case the same combined signal may be output on each speaker. The remote audio signals may instead be stereophonic, or stereo, in which case each remote audio signal includes left and right signals. The left signals may be combined and output on the left speaker, and the right signals may similarly be combined and output on the right speaker.

[0013] Techniques described herein coherently spatialize discrete audio signals to match corresponding screen locations on a display screen of a computing device. In the context of a videoconference, the discrete audio signals are coherently spatialized to match the screen locations at which their corresponding video signals are being displayed on the display screen. Therefore, to the user of the computing device, the audio signal from each other participant in the videoconference appears to come from the screen location at which the video signal including the other participant is displayed. The techniques can further take into account the user’s head orientation relative to the display screen to impart additional realism to the auditory portion of the videoconference when the user is wearing headphones instead of listening to the audio via internal or external speakers.

[0014] FIG. 1 A shows a front view diagram of an example display screen 100 of a computing device on which a user of the device is participating in a videoconference with six remote users 102A, 102B, 102C, 102D, 102E, and 102F, who are collectively referred to as the remote users 102. Each remote user 102 is located at a corresponding remote location, and is participating in the videoconference via his or her own computing device. The remote users 102A, 102B, 102C, 102D, 102E, and 102F are thus displayed on the display screen 100 in corresponding windows 104A, 104B, 104C, 104D, 104E, and 104F, which are collectively referred to as the windows 104. Each window 104, and thus each remote user 102, is displayed at a different screen location on the screen 100.

[0015] FIG. 1 B shows a front view diagram of another example display screen 100 of a computing device on which a user of the device is again participating in a videoconference with six remote users 152A, 152B, 152C, 152D, 152E, and 152F, who are collectively referred to as the remote users 152. However, the remote users 152 are located at the same remote location, around a conference table, and are participating in the videoconference via one computing device, which may be dedicated videoconferencing equipment, for instance. The remote users 152 are thus displayed on the display screen 100 in the same window 154, but nevertheless at different screen locations on the display screen 100.

[0016] The remainder of the detailed description is primarily described in relation to the example of FIG. 1A, in which remote users at different locations are displayed in corresponding windows at different screen locations of the display screen of a local user’s computing device. However, the description equally pertains to the example of FIG. 1 B, in which remote users at the same location are displayed in the same window at different screen locations of the display screen. The description also pertains to a combined example, in which one window or some windows include individual remote users per FIG. 1 A, and another window or other windows each include multiple remote users per FIG 1 B.

[0017] FIGs. 2A and 2B are top and right side diagrams of the display screen 100 of FIG. 1A, and show example spatially coherent discrete audio signals that match screen locations corresponding to the audio signals. A local user 210 of the computing device including the display screen 100 is positioned in front of the screen 100, centered from left to right per FIG. 2A and from top to bottom per FIG. 2B. In FIG. 2A, discrete audio signals 202A, 202B, and 202C correspond to the screen locations of the windows 104A, 102B, and 102C on the display screen 100. In FIG. 2B, discrete audio signals 202C and 202F correspond to the screen locations of the windows 104C and 104F on the display screen 100. The discrete audio signals 202A, 202B, 202C, and 202F are collectively referred to as the discrete audio signals 202. [0018] The discrete audio signals 202 correspond to the screen locations of the windows 104 in that the audio signals 202 correspond to the video signals of the remote users 102 of FIG. 1A displayed in the windows 104. In FIGs. 2A and 2B, the discrete audio signals 202 are spatially coherent. That is, the discrete audio signals 202 are output so that they appear to come from the screen locations of their corresponding windows 104. Specifically, the audio signal 202A appears to come from the window 104A, the audio signal 202B appears to come from the window 104B, the audio signal 202C appears to come from the window 104C, and the audio signal 202F appears to come from the window 104F.

[0019] The discrete audio signals 202 are output by speakers of the computing device including the display screen 100. The computing device includes the display screen 100 in that the screen 100 is integrated with the computing device as may be the case with a laptop, notebook, or all-in-one (AIO) computer, or is communicatively connected to the computing device as may be case with a desktop computer. The display screen 100, in other words, may be internal or external to the computing device. The speakers may be left and right internal speakers of the display screen 100, left and right external speakers placed to either side of the display screen 100, or left and right internal speakers of the computing device itself in the case of a laptop, notebook, or AIO computer.

[0020] The spatially coherent nature of the discrete audio signals 202 when output by the speakers thus imparts auditory realism to the local user 210 participating in the videoconference with the remote users 102 of FIG. 1A. The spatially coherent audio signals 202 match the arrangement of the screen locations of their corresponding windows 104 on the display screen 100. The local user 210 hears the audio signals 202 coming from the remote users 102 of FIG. 1 A as if they were physically present in front of the user 210 in a locational arrangement corresponding to the screen locations of their corresponding windows 104. [0021] In the example of FIGs. 2A and 2B, the spatially coherent audio signals 202 inherently take into account the head orientation of the local user 210 when output on internal or external speakers of the display screen 100 or the computing device itself. That is, as the local user 210 rotates his or her head in three-dimensional (3D) space, the user 210 will perceive the audio signals 202 differently in accordance with the position of each of the user 210’s ears relative to the screen locations of the windows 104. However, this effect will be lost if the local user 210 is wearing headphones and the spatially coherent audio signals 202 are instead output on left and right headphone speakers.

[0022] Rather, if the local user 210 is wearing headphones, as the user 210 rotates his or her head in 3D space, the user 210 will instead perceive the spatially coherent audio signals 202 as if the display screen 100 moved in concert with the rotation of the user 210’s head. This is because the headphone speakers themselves move with the head of the local user 210. While the discrete audio signals 202 remain spatially coherent in that they match the arrangement of the screen locations of their corresponding windows 104 on the display screen 100, the audio signals 202 no longer take into account the head orientation of the local user 210 when output on headphone speakers.

[0023] FIGs. 3A and 3B are top and right side diagrams of the display screen 100 of FIG. 1A, and show example spatially coherent discrete audio signals that match screen locations corresponding to the audio signals while taking into account user head orientation relative to the screen 100. The local user 210 of the computing device including the display screen 100 is again positioned in front of the screen 100, centered from left to right per FIG. 3A and from top to bottom per FIG. 3B. As in FIG. 2A, in FIG. 3A discrete audio signals 202A, 202B, and 202C correspond to the screen locations of the windows 104A, 104B, and 104C on the display screen 100. As in FIG. 2B, in FIG. 3B discrete audio signals 202C and 202F correspond to the screen locations of the windows 104C and 104F on the display screen 100.

[0024] The discrete audio signals 202 correspond to the screen locations of the windows 104 in that the audio signals 202 correspond to the video signals of the remote users 102 of FIG. 1A displayed in the windows 104, as in FIGs. 2A and 2B. However, in FIGs. 2A and 2B the audio signals 202 are spatially coherent but do not explicitly take into account the orientation of the head of the local user 210 relative to the display screen 100. By comparison, in FIGs. 3A and 3B the audio signals 202 are spatially coherent and explicitly take into account the orientation of the user 210’s head relative to the display screen 100.

[0025] Specifically, in the example of FIGs. 3A and 3B, the local user 210 has turned his head to the left, towards the screen location of the window 104A displayed on the display screen 100. The spatially coherent nature of the discrete audio signals 202 takes into account this rotation of the user 210’s head relative to the display screen 100. That is, the discrete audio signals 202 are output so as appear to come from the screen locations of their corresponding windows 104 while also taking into account the orientation of the local user 210’s head relative to the display screen 100.

[0026] Additional auditory realism is thus imparted to the local user 210 participating in the videoconference with the remote users 102 of FIG. 1A when wearing headphones. The spatially coherent audio signals 202 again match the screen arrangement of their corresponding windows 104 on the display screen 100, as in FIGs. 2A and 2B. The local user 210 hears the audio signals 202 coming from the remote users 102 of FIG. 1A as if they were physically present in front of the user 210 in a locational arrangement corresponding to the screen locations of their corresponding windows 104, also as in FIGs. 2A and 2B. Furthermore, in FIGs. 3A and 3B, the local user 210 hears the audio signals 202 as if the user 210 were turned towards the remote user 102A of FIG. 1A in the window 104A - even though the user 210 is wearing headphones.

[0027] FIG. 4 shows an example process 400 to coherently spatialize discrete audio signals to match display screen locations that correspond to the audio signals, and which can into account head orientation of a user relative to the display screen. The process 400 may be implemented as program code stored on a non-transitory computer-readable data storage medium. The program code can be executed by a processor, such as a processor of a computing device.

[0028] Each of a number of screen locations 402A, 402B, . . ., 402N of a display screen, which are collectively referred to as the screen locations 402, can be input to a machine learning model 404. For example, a local user may be participating in a videoconference with remote users that are displayed on the local user’s display screen at the screen locations 402. In the case in which the discrete video signal of each remote user is displayed in a corresponding window, as in FIG. 1 A, the screen location 402 of a remote user is the location of the window on the display screen. In the case in which each remote user is displayed within the same window, as in FIG. 1 B, the screen locations 402 of the remote users may be determined by performing facial recognition on the video signal of the remote users to identify the number of remote users within the singular video signal and the location of each.

[0029] The head orientation 406 of a user relative to the display screen may also be input to the machine learning model 404 in the case in which the user’s head orientation 406 is to be taken into account, such as when the user is wearing headphones. For example, a local user who is participating in a videoconference with remote users while wearing headphones may have his or head orientation relative to the display screen input to the machine learning model 404. The head orientation of the local user may be determined by performing head tracking on the local video signal captured by a local camera positioned at the display screen, and which is transmitted to the computing devices of the remote users of the videoconference. The local user’s head orientation may instead be determined using accelerometer, gyroscope, and/or other types of sensors.

[0030] The head orientation 406 can include the angle of rotation relative to the display screen in 3D space. For instance, the head orientation 406 can include the angle of vertical rotation upwards or downwards relative to the display screen, as well as the angle of horizontal rotation to the left or right relative to the screen. The head orientation 406 may further include the position of the user’s head relative to the display screen. The position may include just horizontal and vertical position within the plane of the display screen, and not include the distance to the screen (i.e. , how far away the user’s head is relative to the plane of the display screen). In another implementation, the position may include the distance to the display screen, however.

[0031] Left head-related transfer functions (HRTFs) 408A, 408B, . . ., 408N, collectively referred to as the left HRTFs 408 and respectively corresponding to the screen locations 402A, 402B, . . , 402N, are output from the machine learning model 404, as are right HRTFs 410A, 410B, . . ., 410N that are collectively referred to as the right HRTFs 410. There is thus a left HRTF 408 and a right HRTF 410 for each screen location 402. The left HRTF 408 of a screen location 402 is a function characterizing how the left ear of a user receives sound from a point in space corresponding to the screen location 402, and the right HRTF 410 of a screen location 402 is similarly a function characterizing how the right ear of the user receives sound from this point in space. The HRTFs 408 and 410 may also be referred to as anatomical transfer functions (ATFs), and are more generally transfer functions.

[0032] The process 400 thus includes generating for each screen location 402 left and right HRTFs 408 and 410 that can take into account the head orientation 406 of the user relative to the display screen by using the machine learning model 404. The screen locations 402 may be individually provided as input to the machine learning model 404, along with the head orientation 406, to receive corresponding left and right HRTFs 408 and 410 from the machine learning model 404. If the head orientation 406 is not input to the machine learning model 404 with a screen location 402, the corresponding left and right HRTFs 408 and 410 output from the model 404 do not take into account the head orientation 406 of the user relative to the display screen, however.

[0033] The machine learning model 404 may be a supervised machine leaning model, and may be a neural network or other type of deep learning such model. The machine learning model 404 may be trained with training data that specify left and right HRTFs 408 and 410 for a number of screen locations 402. If the head orientation 406 is to be taken into account by the machine learning model 404, this information (left and right HRTFs 408 and 410 for a number of screen locations 402) is provided for a number of different head orientations 406 as well.

[0034] For example, a display screen may be divided into a uniform grid of 16 regions. Left and right HRTFs 408 and 410 may be constructed for a screen location 402 corresponding to each region, for a total of 32 HRTFs 408 and 410. If head orientation 406 is also to be taken into account, then these 32 HRTFs 408 and 410 are provided for each of a number of different head orientations as well. For example, 16 head orientations in 3D space about a point of origin centered a given distance in front of the display screen may be provided, for a total of 32x16=512 HRTFs 408 and 410.

[0035] The machine learning model 404 may instead be trained with training data that does not actually specify left and right HRTFs 408 and 410 for a number of screen locations 402. Rather, the training data may include a discrete audio signal, a number of screen locations 402, and corresponding spatialized left and right audio signals at each screen location 402. For instance, for a given screen location 402, the discrete audio signal may be output at a speaker positioned at the screen location 402, and the corresponding spatialized left and right audio signals recorded using left and right microphones at positions corresponding to the left and right ears of a user centered a distance in front of the screen.

[0036] In this case, the machine learning model 404 generates for each screen location 402, the left and right HRTFs 408 that transform the discrete audio signal into the spatialized left and right audio signals for the screen location 402. The training data may include a number of discrete audio signals and corresponding spatialized left and right audio signals at each screen location 402 for each discrete audio signal. If head orientation 406 is also to be taken into account, this information is provided for a number of different head orientations 406 as well, where the left and right microphones are rotated in accordance with each head orientation 406 to record the corresponding spatialized left and right audio signals.

[0037] The left and right HRTFs 408A and 410A, 408B and 410B, . . ., 408N and 41 ON are respectively applied to discrete audio signals 412A, 412B, . . ., 412N, which correspond to the screen locations 402A, 402B, . . ., 402N and which are collectively referred to as the discrete audio signals 412. Each discrete audio signal 412 is associated with a corresponding screen location 402. That is, each discrete audio signal 412 includes the sound that is to be spatialized to appear to be coming from a corresponding screen location 402. In the example in which a local user is participating in a videoconference with remote users for which respective discrete video signals are displayed in corresponding windows at screen locations 402, as in FIG. 1A, the discrete audio signals 412 are the audio signals corresponding to these video signals. [0038] In the case in which the remote users are displayed within the same window at different screen locations 402, as in FIG. 1 B, the discrete audio signals 412 may be determined from the audio signal corresponding to the singular video signal. As one example, if the audio signal was recorded using an array of microphones, then beamforming or another technique may be employed to isolate the sound originating at each screen location 402. As another example, image processing may be employed to identify which remote user is speaking (e.g., by detecting which user is moving his or her lips), and the audio signal considered as the discrete audio signal 412 for the screen location 402 of the remote user who is currently speaking.

[0039] Application of the left and right HRTFs 408A and 410A, 408B and 410B, . . ., 408N and 410N to the discrete audio signals 412A, 412B, . . ., 412N, results in generation of corresponding left and right audio signals 414A and 416A, 414B and 416B, . . ., 414N and 416N. The left audio signals 414A, 414B, . . ., 414N are collectively referred to as the left audio signals 414, and the right audio signals 416A, 416B, . . ., 416N are collectively referred to as the right audio signals 416. Each pair of left and right audio signals 414 and 416 spatializes a corresponding discrete audio signal 412 for the left and right ears of the user, taking into account the head orientation 406 of the user if head orientation 406 was input to the machine learning model 404.

[0040] The left audio signals 414 are added together to calculate a left binaural signal 418, and the right audio signals 416 are added together to calculate a right binaural signal 420. The left and right binaural signals 418 and 420 are respectively output on left and right speakers 422 and 424. As thus output on the left and right speakers 422 and 424, the left and right binaural signals 418 and 420 coherently spatialize the discrete audio signals 412 so that the audio signals 412 appear to come from points in space corresponding to their respective screen locations 402 as perceived by the user’s left and right ears. When the left and right speakers 422 and 424 are left and right headphone speakers worn by the user, the coherent spatialization takes into account the user’s head orientation 406 as well if head orientation 406 was input into the machine learning model 404.

[0041] The described process 400 imparts auditory realism to a user participating in a videoconference with remote users. The process 400 can further be extended to take into account the eye gaze of the user so that the discrete audio signal 412 corresponding to the screen location 402 that is the focus of the user’s eye gaze is accentuated relative to the discrete audio signals 412 corresponding to the other screen locations 402. That is, independent and irrespective of the user’s head orientation 406, the user may be considered as being most interested in the discrete audio signal 412 corresponding to the screen location 402 to which his or her eye gaze is directed. This discrete audio signal 412 can be emphasized in volume relative to the other audio signals 412 within the left and right binaural signals 418 and 420 that coherently spatialize the audio signals 412.

[0042] FIGs. 5A and 5B show example processes 500 and 550 to accentuate the discrete audio signal corresponding to the display screen location to which the eye gaze of a user is directed, relative to the discrete audio signals corresponding to other screen locations. The processes 500 and 550 can be performed for each discrete audio signal 412, and are both described in relation to one such discrete audio signal 412. Like the process 400, the processes 500 and 550 may be implemented as program code stored on a non-transitory computer-readable data storage medium and executable by a processor, such as a processor of a computing device.

[0043] In FIG. 5A, the process 500 is performed on each discrete audio signal 412 prior to the application of the left and right HRTFs 408 for the corresponding screen location 402 on the audio signal 412 in FIG. 4. From the corresponding screen location 402 of a discrete audio signal 412 and an eye gaze 504 of the user, a distance weight 506 is determined. The eye gaze 504 of the user may specify the screen location of the display screen to which the user’s eye gaze 504 is directed, irrespective of the head orientation 406 of the user. The user’s eye gaze 504 may be determined by performing eye tracking on the local video signal captured by a local camera positioned at the display screen.

[0044] The distance weight 506 can be a weight corresponding to the Euclidean or other distance between the screen location to which the user’s eye gaze 504 is directed and the screen location 402 corresponding to the discrete audio signal 412 under consideration. For instance, the distance weight 506 may range from a minimum value of one to a maximum value greater than one. The minimum distance weight 506 corresponds to a farthest possible screen location 402 from the screen location to which the user’s eye gaze 504 is directed, and the maximum distance weight 506 corresponds to a screen location 402 coincident with the user’s eye gaze 504. [0045] The distance weight 506 is applied to the discrete audio signal

412 to generate an eye gaze-weighted discrete audio signal 412’. The left and right HRTFs 408 and 410 are thus applied to the discrete audio signal 412’ instead of the discrete audio signal 412 in FIG. 4. The discrete audio signal 412 corresponding to a screen location 402 farther from the user’s eye gaze 504 is relatively lower in volume in the corresponding eye gaze-weighted discrete audio signal 412’ than the discrete audio signal 412 corresponding to a screen location 402 closer to the user’s eye gaze 504. In the resulting left and right binaural signals 418 and 420 of FIG. 4 that coherently spatialize the audio signals 412, the audio signal 412 corresponding to the screen location 402 to which the user’s eye gaze 504 is directed is thus accentuated relative to the other audio signals 412 corresponding to the other screen locations 402. [0046] In FIG. 5B, the process 550 is performed on the left and right audio signals 414 and 416 for each discrete audio signal 412, and thus after application of the left and right HRTFs 408 and 410 for the corresponding screen location 402 on the audio signal 412 in FIG. 4. As in FIG. 5A, from the corresponding screen location 402 of a discrete audio signal 412 and the user’s eye gaze 504, a distance weight 506 is determined. The determined distance weight 506 is applied to the left and right audio signals 414 and 416 for the discrete audio signal 412 to generate eye gaze-weighted left and right audio signals 414’ and 416’. The left and right audio signals 414’ and 416’ for the discrete audio signals 412 are thus respectively added together in FIG. 4, instead of the audio signals 414 and 416. The difference between FIGs. 5A and 5B is, therefore, when the distance weight 506 is applied, before or after application of the left and right HRTFs 408 and 410.

[0047] FIG. 6 shows an example method 600. The method 600 includes generating, for each of a number of screen locations respectively corresponding to a number of discrete audio signals, left and right transfer functions using a machine learning model (602). The method 600 includes applying, to each discrete audio signal, the left and right transfer functions for the screen location corresponding to the discrete audio signal to respectively generate left and right audio signals corresponding to the discrete audio signal (604). The method 600 includes calculating left and right binaural signals by respectively adding together the left and right audio signals corresponding to the discrete audio signals (606). The method 600 includes outputting the left and right binaural signals on left and right speakers, respectively (608).

[0048] FIG. 7 shows an example computer-readable data storage medium 700 storing program code 702. The program code 702 is executable by the processor to generate, for each of a number of screen locations respectively corresponding to a number of discrete audio signals, left and right transfer functions that take into account a head orientation of a user relative to a display screen having the screen locations (704). The program code 702 is executable by the processor to apply, to each discrete audio signal, the left and right transfer functions for the screen location corresponding to the discrete audio signal to respectively generate left and right audio signals corresponding to the discrete audio signal (706). The program code 702 is executable by the processor to calculate left and right binaural signals by respectively adding together the left and right audio signals corresponding to the discrete audio signals (708), and output the left and right binaural signals on left and right headphone speakers, respectively, of headphones that the user is wearing (710). [0049] FIG. 8 shows an example system 800. The system 800 may be implemented as a computing device, for instance, such as a desktop, laptop, or notebook computer, or a smartphone, tablet computing device, or another type of mobile computing device. The system 800 includes a display screen 802 having locations respectively corresponding to a number of discrete audio signals. The system 800 includes a processor 804, and an interface 806 to communicatively connect to headphones 808 wearable by a user and that have left and right headphone speakers. The interface 806 may be a wired interface or a wireless interface. The system 800 includes a memory storing program code 812 executable by the processor 804.

[0050] The program code 812 is executable by the processor 804 to generate, for each screen location, left and right transfer functions that take into account a head orientation of the user relative to the display screen, using a machine learning model (814). The program code 812 is executable by the processor 804 to apply, to each discrete audio signal, the left and right transfer functions for the screen location corresponding to the discrete audio signal to respectively generate left and right audio signals corresponding to the discrete audio signal (816). The program code 812 is executable by the processor 804 to calculate left and right binaural signals by respectively adding together the left and right audio signals corresponding to the discrete audio signals (818), and output the left and right binaural signals on the left and right headphone speakers, respectively (820).

[0051] Techniques have been described that coherently spatialize discrete audio signals to match corresponding screen locations on a display screen of a computing device, as output on left and right speakers. Therefore, to a user of the computing device, the discrete audio signals appear to come from their corresponding screen locations. If the left and right speakers are part of headphones that the user is wearing, the discrete audio signals may further be coherently spatialized so that they take into account the user’s head orientation relative to the display screen.