Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
USING A SMART SPEAKER TO ESTIMATE LOCATION OF USER
Document Type and Number:
WIPO Patent Application WO/2022/150640
Kind Code:
A1
Abstract:
A method, smart device and computer program product for identifying the location of a user. A smart device, such as a smart speaker, collects signals emanating from a user in a living space by a microphone array of the smart speaker. The angle of arrival of the propagation paths of one or more of the collected signals are then estimated. Furthermore, the smart speaker estimates a room structure by emitting chirp pulses in the room structure. The smart speaker then collects the signals reflected by reflectors of the room structure from the emitted chirp pulses. The smart speaker then estimates the positions and orientations of the reflectors in the room structure. The location of the user is then identified by retracing the propagation paths of the one or more signals received from the user based on the positions and orientations of the reflectors.

Inventors:
QIU LILI (US)
WANG MEI (US)
SUN WEI (US)
Application Number:
PCT/US2022/011692
Publication Date:
July 14, 2022
Filing Date:
January 07, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV TEXAS (US)
International Classes:
G06F17/00; H04W24/00
Foreign References:
US20180299527A12018-10-18
US20120044786A12012-02-23
US20130002488A12013-01-03
US20200309930A12020-10-01
US8174931B22012-05-08
Other References:
SHEN SHENG, CHEN DAGUAN, WEI YU-LIN, YANG ZHIJIAN, CHOUDHURY ROMIT ROY: "Voice Localization Using Nearby Wall Reflections", PROCEEDINGS OF THE 26TH ANNUAL INTERNATIONAL CONFERENCE ON MOBILE COMPUTING AND NETWORKING, 21 September 2020 (2020-09-21), pages 1 - 14, XP058463910, DOI: https://doi.org/10.1145/3372224.3380884
Attorney, Agent or Firm:
VOIGT, Robert, A., Jr. (US)
Download PDF:
Claims:
CLAIMS:

1. A method for identifying a location of a user, the method comprising: collecting signals emanating from said user by a microphone array of a smart device; estimating an angle of arrival of propagation paths of one or more of said collected signals emanating from said user; estimating positions and orientations of reflectors of a room structure; and identifying said location of said user by retracing said propagation paths of said one or more signals collected from said user and based on said positions and said orientations of said reflectors of said room structure.

2. The method as recited in claim 1 further comprising: performing Short-Time Fourier Transform (STFT) analysis on said one or more collected signals emanating from said user over different time windows; applying differencing to said Short-Time Fourier Transform (STFT) analysis to cancel signals in a time-frequency domain to generate a differencing result; and applying a multiple signal classification (MUSIC) algorithm to said differencing result.

3. The method as recited in claim 2 further comprising: using results from said STFT analysis on said one or more collected signals emanating from said user to form a base; searching for a nearest neighbor in results of said wavelet analysis and in results of applying said differencing to said wavelet and Short-Time Fourier Transform (STFT) analysis for each point in said base; and selecting a number of peaks from a selected number of nearest neighbors as corresponding to said estimates of said angle of arrival of propagation paths of said one or more of said collected signals emanating from said user.

4. The method as recited in claim 2 further comprising: performing wavelet analysis on said one or more collected signals emanating from said user over different time windows; applying differencing to said wavelet analysis to cancel signals in a time-frequency domain to generate a differencing result; and applying a multiple signal classification (MUSIC) algorithm to said differencing result.

5. The method as recited in claim 4 further comprising: combining results from said STFT analysis, said STFT differencing, and said wavelet differencing using a linear regression.

6. The method as recited in claim 4 further comprising: combining results from said STFT analysis, said STFT differencing, and said wavelet differencing using a neural network.

7. The method as recited in claim 4 further comprising: combining multiple signal classification (MUSIC) profiles from results of said STFT analysis, said STFT differencing, and said wavelet differencing; and selecting peaks from said combined MUSIC profiles.

8. The method as recited in claim 1 further comprising: emitting chirp pulses in said room structure; collecting signals reflected by said reflectors of said room structure from said emitted chirp pulses; and estimating said positions and said orientations of said reflectors of said room structure using a multiple signal classification (MUSIC) algorithm.

9. The method as recited in claim 8, wherein said chirp pulses correspond to frequency- modulated continuous-wave (FMCW) chirps, wherein the method further comprises: dividing a FMCW signal into multiple subbands in a time domain; running a multiple signal classification (MUSIC) algorithm in each of said multiple subbands to generate a MUSIC profile for each of said multiple subbands, wherein said MUSIC profile corresponds to an azimuth AoA-distance profile; and summing said generated MUSIC profiles.

10. The method as recited in claim 9 further comprising: searching and selecting azimuth AoAs from said summed generated MUSIC profiles that minimize a fitting error with a rectangular room.

11. The method as recited in claim 8, wherein said chirp pulses correspond to frequency- modulated continuous-wave (FMCW) chirps, wherein the method further comprises: performing a beamforming algorithm on said FMCW signals reflected by said reflectors of said room structure forming FMCW profiles containing distance information from said reflectors to said smart device.

12. The method as recited in claim 1, wherein said location of said user is determined by retracing each of said propagation paths of said one or more signals collected from said user and using said positions of said reflectors of said room structure as a cone structure resulting in a plurality of cone structures, wherein a width of said cone structure is determined by a width of peaks in a multiple signal classification (MUSIC) profile, wherein said location of said user corresponds to a point in said plurality of cone structures such that a circle centered at said point overlaps with a maximum number of cones.

13. The method as recited in claim 12, wherein a width of each of said cone structures corresponds to a peak width obtained using a multiple signal classification (MUSIC) algorithm on frequency-modulated continuous-wave (FMCW) signals reflected by said reflectors of said room structure.

14. The method as recited in claim 1, wherein said location of said user is determined by retracing each of said propagation paths of said one or more signals collected from said user and using said positions of said reflectors of said room structure as a cone structure resulting in a plurality of cone structures, wherein each point in said cone structure is assigned a probability based on a distance from peaks in multiple signal classification (MUSIC) profiles, wherein a joint probability of a point in space is computed as a product of probabilities from said plurality of cone structures corresponding to all reflectors in said room structure, wherein said location of said user is derived as a weighted centroid of all points where weights are said joint probabilities.

15. The method as recited in claim 1, wherein said location of said user is determined by retracing each of said propagation paths of said one or more signals collected from said user and using said positions of said reflectors of said room structure as an intersection of propagation paths from multiple reflectors.

16. The method as recited in claim 1 further comprising: interpreting a command from said user in connection with said identified location of said user.

17. A computer program product for identifying a location of a user, the computer program product comprising one or more computer readable storage mediums having program code embodied therewith, the program code comprising programming instructions for: collecting signals emanating from said user by a microphone array of a smart device; estimating an angle of arrival of propagation paths of one or more of said collected signals emanating from said user; estimating positions and orientations of reflectors of a room structure; and identifying said location of said user by retracing said propagation paths of said one or more signals collected from said user based on said positions and said orientations of said reflectors of said room structure.

18. The computer program product as recited in claim 17, wherein the program code further comprises the programming instructions for: performing Short-Time Fourier Transform (STFT) analysis on said one or more collected signals emanating from said user over different time windows; applying differencing to said Short-Time Fourier Transform (STFT) analysis to cancel signals in a time-frequency domain to generate a differencing result; and applying a multiple signal classification (MUSIC) algorithm to said differencing result.

19. The computer program product as recited in claim 18, wherein the program code further comprises the programming instructions for: using results from said STFT analysis on said one or more collected signals emanating from said user to form a base; searching for a nearest neighbor in results of said wavelet analysis and in results of applying said differencing to said wavelet and Short-Time Fourier Transform (STFT) analysis for each point in said base; and selecting a number of peaks from a selected number of nearest neighbors as corresponding to said estimates of said angle of arrival of propagation paths of said one or more of said collected signals emanating from said user.

20. The computer program product as recited in claim 18, wherein the program code further comprises the programming instructions for: performing wavelet analysis on said one or more collected signals emanating from said user over different time windows; applying differencing to said wavelet analysis to cancel signals in a time-frequency domain to generate a differencing result; and applying a multiple signal classification (MUSIC) algorithm to said differencing result.

21. The computer program product as recited in claim 20, wherein the program code further comprises the programming instructions for: combining results from said STFT analysis, said STFT differencing, and said wavelet differencing using a linear regression.

22. The computer program product as recited in claim 20, wherein the program code further comprises the programming instructions for: combining results from said STFT analysis, said STFT differencing, and said wavelet differencing using a neural network.

23. The computer program product as recited in claim 20, wherein the program code further comprises the programming instructions for: combining multiple signal classification (MUSIC) profiles from results of said STFT analysis, said STFT differencing, and said wavelet differencing; and selecting peaks from said combined MUSIC profiles.

24. The computer program product as recited in claim 17, wherein the program code further comprises the programming instructions for: emitting chirp pulses in said room structure; collecting signals reflected by said reflectors of said room structure from said emitted chirp pulses; and estimating said positions and said orientations of said reflectors of said room structure using a multiple signal classification (MUSIC) algorithm.

25. The computer program product as recited in claim 24, wherein said chirp pulses correspond to frequency-modulated continuous-wave (FMCW) chirps, wherein the program code further comprises the programming instructions for: dividing a FMCW signal into multiple subbands in a time domain; running a multiple signal classification (MUSIC) algorithm in each of said multiple subbands to generate a MUSIC profile for each of said multiple subbands, wherein said MUSIC profile corresponds to an azimuth AoA-distance profile; and summing said generated MUSIC profiles.

26. The computer program product as recited in claim 25, wherein the program code further comprises the programming instructions for: searching and selecting azimuth AoAs from said summed generated MUSIC profiles that minimize a fitting error with a rectangular room.

27. The computer program product as recited in claim 24, wherein said chirp pulses correspond to frequency-modulated continuous-wave (FMCW) chirps, wherein the program code further comprises the programming instructions for: performing a beamforming algorithm on said FMCW signals reflected by said reflectors of said room structure forming FMCW profiles containing distance information from said reflectors to said smart device.

28. The computer program product as recited in claim 17, wherein said location of said user is determined by retracing each of said propagation paths of said one or more signals collected from said user and using said positions of said reflectors of said room structure as a cone structure resulting in a plurality of cone structures, wherein a width of said cone structure is determined by a width of peaks in a multiple signal classification (MUSIC) profile, wherein said location of said user corresponds to a point in said plurality of cone structures such that a circle centered at said point overlaps with a maximum number of cones.

29. The computer program product as recited in claim 28, wherein a width of each of said cone structures corresponds to a peak width obtained using a multiple signal classification (MUSIC) algorithm on frequency-modulated continuous-wave (FMCW) signals reflected by said reflectors of said room structure.

30. The computer program product as recited in claim 17, wherein said location of said user is determined by retracing each of said propagation paths of said one or more signals collected from said user and using said positions of said reflectors of said room structure as a cone structure resulting in a plurality of cone structures, wherein each point in said cone structure is assigned a probability based on a distance from peaks in multiple signal classification (MUSIC) profiles, wherein a joint probability of a point in space is computed as a product of probabilities from said plurality of cone structures corresponding to all reflectors in said room structure, wherein said location of said user is derived as a weighted centroid of all points where weights are said joint probabilities.

31. The computer program product as recited in claim 17, wherein said location of said user is determined by retracing each of said propagation paths of said one or more signals collected from said user and using said positions of said reflectors of said room structure as an intersection of propagation paths from multiple reflectors.

32. The computer program product as recited in claim 17, wherein the program code further comprises the programming instructions for: interpreting a command from said user in connection with said identified location of said user.

33. A smart device, comprising: a memory for storing a computer program for identifying a location of a user; and a processor connected to said memory, wherein said processor is configured to execute program instructions of the computer program comprising: collecting signals emanating from said user by a microphone array of a smart device; estimating an angle of arrival of propagation paths of one or more of said collected signals emanating from said user; estimating positions and orientations of reflectors of a room structure; and identifying said location of said user by retracing said propagation paths of said one or more signals collected from said user based on said positions and said orientations of said reflectors of said room structure.

34. The smart device as recited in claim 33, wherein the program instructions of the computer program further comprise: performing Short-Time Fourier Transform (STFT) analysis on said one or more collected signals emanating from said user over different time windows; applying differencing to said Short-Time Fourier Transform (STFT) analysis to cancel signals in a time-frequency domain to generate a differencing result; and applying a multiple signal classification (MUSIC) algorithm to said differencing result.

35. The smart device as recited in claim 34, wherein the program instructions of the computer program further comprise: using results from said STFT analysis on said one or more collected signals emanating from said user to form a base; searching for a nearest neighbor in results of said wavelet analysis and in results of applying said differencing to said wavelet and Short-Time Fourier Transform (STFT) analysis for each point in said base; and selecting a number of peaks from a selected number of nearest neighbors as corresponding to said estimates of said angle of arrival of propagation paths of said one or more of said collected signals emanating from said user.

36. The smart device as recited in claim 34, wherein the program instructions of the computer program further comprise: performing wavelet analysis on said one or more collected signals emanating from said user over different time windows; applying differencing to said wavelet analysis to cancel signals in a time-frequency domain to generate a differencing result; and applying a multiple signal classification (MUSIC) algorithm to said differencing result.

37. The smart device as recited in claim 36, wherein the program instructions of the computer program further comprise: combining results from said STFT analysis, said STFT differencing, and said wavelet differencing using a linear regression.

38. The smart device as recited in claim 36, wherein the program instructions of the computer program further comprise: combining results from said STFT analysis, said STFT differencing, and said wavelet differencing using a neural network.

39. The smart device as recited in claim 36, wherein the program instructions of the computer program further comprise: combining multiple signal classification (MUSIC) profiles from results of said STFT analysis, said STFT differencing, and said wavelet differencing; and selecting peaks from said combined MUSIC profiles.

40. The smart device as recited in claim 33, wherein the program instructions of the computer program further comprise: emitting chirp pulses in said room structure; collecting signals reflected by said reflectors of said room structure from said emitted chirp pulses; and estimating said positions and said orientations of said reflectors of said room structure using a multiple signal classification (MUSIC) algorithm.

41. The smart device as recited in claim 40, wherein said chirp pulses correspond to frequency-modulated continuous-wave (FMCW) chirps, wherein the program instructions of the computer program further comprise: dividing a FMCW signal into multiple subbands in a time domain; running a multiple signal classification (MUSIC) algorithm in each of said multiple subbands to generate a MUSIC profile for each of said multiple subbands, wherein said MUSIC profile corresponds to an azimuth AoA-distance profile; and summing said generated MUSIC profiles.

42. The smart device as recited in claim 41, wherein the program instructions of the computer program further comprise: searching and selecting azimuth AoAs from said summed generated MUSIC profiles that minimize a fitting error with a rectangular room.

43. The smart device as recited in claim 40, wherein said chirp pulses correspond to frequency-modulated continuous-wave (FMCW) chirps, wherein the program instructions of the computer program further comprise: performing a beamforming algorithm on said FMCW signals reflected by said reflectors of said room structure forming FMCW profiles containing distance information from said reflectors to said smart device.

44. The smart device as recited in claim 33, wherein said location of said user is determined by retracing each of said propagation paths of said one or more signals collected from said user and using said positions of said reflectors of said room structure as a cone structure resulting in a plurality of cone structures, wherein a width of said cone structure is determined by a width of peaks in a multiple signal classification (MUSIC) profile, wherein said location of said user corresponds to a point in said plurality of cone structures such that a circle centered at said point overlaps with a maximum number of cones.

45. The smart device as recited in claim 44, wherein a width of each of said cone structures corresponds to a peak width obtained using a multiple signal classification (MUSIC) algorithm on frequency-modulated continuous-wave (FMCW) signals reflected by said reflectors of said room structure.

46. The smart device as recited in claim 33, wherein said location of said user is determined by retracing each of said propagation paths of said one or more signals collected from said user and using said positions of said reflectors of said room structure as a cone structure resulting in a plurality of cone structures, wherein each point in said cone structure is assigned a probability based on a distance from peaks in multiple signal classification (MUSIC) profiles, wherein a joint probability of a point in space is computed as a product of probabilities from said plurality of cone structures corresponding to all reflectors in said room structure, wherein said location of said user is derived as a weighted centroid of all points where weights are said joint probabilities. 47. The smart device as recited in claim 33, wherein said location of said user is determined by retracing each of said propagation paths of said one or more signals collected from said user and using said positions of said reflectors of said room structure as an intersection of propagation paths from multiple reflectors. 48. The smart device as recited in claim 33, wherein the program instructions of the computer program further comprise: interpreting a command from said user in connection with said identified location of said user.

Description:
USING A SMART SPEAKER TO ESTIMATE LOCATION OF USER

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional Patent Application Serial No. 63/135,231 entitled “Using a Smart Speaker to Estimate Location of User,” filed on January 8, 2021, which is incorporated by reference herein in their entirety.

GOVERNMENT INTERESTS

[0002] This invention was made with government support under Grant numbers CNS1718585 and CNS2032125 awarded by the National Science Foundation. The U.S. government has certain rights in the invention.

TECHNICAL FIELD

[0003] The present invention relates generally to user localization systems, and more particularly to using a smart device, such as a smart speaker, to estimate the location of the user.

BACKGROUND

[0004] User localization systems attempt to estimate the location of the user. For example, such systems may attempt to use audio and vision-based schemes to estimate the location of the user. Such vision-based schemes may involve the use of cameras. However, it may not be prudent to deploy cameras throughout a home due to privacy concerns.

[0005] Other such user localization systems utilize device-based tracking, which requires the user, whose location is to be estimated, to carry a device (e.g., smartphone), which may not be convenient for the user at home.

[0006] Furthermore, other such user localization systems may utilize device-free radio frequency. However, such a scheme requires a large bandwidth as well as many antennas or millimeter wave chirps to achieve high accuracy, which is not easy to deploy at home.

[0007] Unfortunately, such user localization systems are deficient in accurately estimating the location of the user. Furthermore, such user localization systems may require the deployment of expensive equipment. SUMMARY

[0008] In one embodiment of the present invention, a method for identifying a location of a user comprises collecting signals emanating from the user by a microphone array of a smart device. The method further comprises estimating an angle of arrival of propagation paths of one or more of the collected signals emanating from the user. The method additionally comprises estimating positions and orientations of reflectors of a room structure. Furthermore, the method comprises identifying the location of the user by retracing the propagation paths of the one or more signals collected from the user based on the positions and the orientations of the reflectors of the room structure.

[0009] Other forms of the embodiment of the method described above are in a smart device and in a computer program product.

[0010] The foregoing has outlined rather generally the features and technical advantages of one or more embodiments of the present invention in order that the detailed description of the present invention that follows may be better understood. Additional features and advantages of the present invention will be described hereinafter which may form the subject of the claims of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] A better understanding of the present invention can be obtained when the following detailed description is considered in conjunction with the following drawings, in which:

[0012] Figure 1 illustrates a diagram of a placement of smart device, such as a smart speaker, within a living space, such as a house, apartment, etc., in accordance with an embodiment of the present invention;

[0013] Figure 2 is a diagram of the software components of the smart speaker used to estimate the location of the user within the living space in accordance with an embodiment of the present invention;

[0014] Figure 3 illustrates an embodiment of the present invention of the hardware configuration of the smart speaker which is representative of a hardware environment for practicing the present invention;

[0015] Figure 4 is a flowchart of a method for identifying the location of the user in accordance with an embodiment of the present invention;

[0016] Figure 5 is a flowchart of a method for estimating the angle of arrival (AoA) of the propagation paths of one or more of the collected signals in accordance with an embodiment of the present invention;

[0017] Figure 6 is a diagram of the multi-resolution analysis algorithm in accordance with an embodiment of the present invention;

[0018] Figure 7 is a diagram illustrating the comparison of the AoA derived from Short-Time Fourier Transform (STFT) and wavelet with and without differencing in accordance with an embodiment of the present invention;

[0019] Figure 8 is a flowchart of a method for estimating the AoA of the propagation paths of one or more signals reflected from the reflectors of the room structure in accordance with an embodiment of the present invention;

[0020] Figure 9 illustrates an exemplary azimuth-distance profile in accordance with an embodiment of the present invention; [0021] Figure 10A illustrates retracing using a ray structure for each of the two near parallel paths in accordance with an embodiment of the present invention; and

[0022] Figure 10B illustrates retracing using a cone structure for each of the paths in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION

[0023] As stated in the Background section, user localization systems attempt to estimate the location of the user. For example, such systems may attempt to use audio and vision-based schemes to estimate the location of the user. Such vision-based schemes may involve the use of cameras. However, it may not be prudent to deploy cameras throughout a home due to privacy concerns.

[0024] Other such user localization systems utilize device-based tracking, which requires the user, whose location is to be estimated, to carry a device (e.g., smartphone), which may not be convenient for the user at home.

[0025] Furthermore, other such user localization systems may utilize device-free radio frequency. However, such a scheme requires a large bandwidth as well as many antennas or millimeter wave chirps to achieve high accuracy, which is not easy to deploy at home.

[0026] Unfortunately, such user localization systems are deficient in accurately estimating the location of the user. Furthermore, such user localization systems may require the deployment of expensive equipment.

[0027] The embodiments of the present invention provide a means for accurately estimating the location of the user using standard inexpensive equipment (e.g., smart speaker) that may already be located in the user’s home.

[0028] While the following discusses the present invention in connection with utilizing a smart speaker to estimate the location of the user, it is noted that any smart device that can receive signals emanating from a user as well as emit chirp pulses and receive signals reflected by reflectors (e.g., walls, ceilings) in the user’s home of the emitted chirp pulses may be utilized. A person of ordinary skill in the art would be capable of applying the principles of the present invention to such implementations. Further, embodiments applying the principles of the present invention to such implementations would fall within the scope of the present invention.

[0029] In one embodiment, the principles of the present invention estimate the user’s location, such as within a particular room in a house, by retracing multiple propagation paths that the user’s sound traverses. In one embodiment, the angle of arrivals (AoAs) of the multiple paths traversed by the voice signals from the user to a microphone array, such as on a smart speaker, are estimated. The multipath may include a direct path (referring to the path of a signal propagating between the user and the microphone array without any reflections) and the reflected paths (referring to the path of a signal propagating between the user and the microphone array with reflections, such as via walls, ceilings, etc.). In one embodiment, the indoor space structure (e.g., walls, ceilings) is estimated by emitting wideband chirp pulses to estimate the angle of arrival (AoA) and distance to the reflectors (e.g., walls) in the room. Furthermore, in one embodiment, the propagation paths of the signals (both from the user and the reflectors) are retraced based on the estimated AoA of the voice signals and the reflected chirp signals to localize the voice. “Localizing the voice,” as used herein, refers to estimating the location of the source of the voice, which in the case of the present invention, represents the location of the user speaking such words. As a result, the principles of the present invention may actively map indoor rooms and localize voice sources using only a smart device, such as a smart speaker, without additional hardware. Furthermore, the present invention may localize voice in both line of sight (LoS) and non-light of sight (NLoS) scenarios. LoS scenarios refer to the user being within sight of the smart device, such as the smart speaker; whereas, the NLoS scenarios refer to the user not being within sight of the smart device (e.g., smart speaker), such as being behind a wall or in a different room. Prior user localization systems are not capable of estimating the user’s location in NLoS scenarios.

[0030] As discussed above, “localizing the voice,” as used herein, refers to estimating the location of the source of the voice, which in the case of the present invention, represents the location of the user speaking such words. In one embodiment, such words may refer to a command, such as to turn on a light. As a result, the ability to localize human voice benefits smart devices, such as smart speakers, in many ways. For example, knowing the user’s location allows the smart speaker to beamform its transmission to the user so that it can both hear from and transmit to a faraway user. Second, the user location gives context information, which can assist in interpreting the user’s intent. For example, when the user issues the command to turn on the light, the smart speaker can resolve the ambiguity and tell which light to turn on depending on the user’s location. In addition, knowing the user’s location also enables location based services. For instance, a smart speaker can automatically adjust the temperature and lighting condition near the user. Moreover, location information can also help with speech recognition and natural language processing by providing important context information. For example, when a user says "orange" in the kitchen, it knows that refers to a fruit; whereas, when the same user says "orange" elsewhere, it may be interpreted as a color.

[0031] In one embodiment, a microphone array widely available on a smart speaker is utilized to collect the received signals from the user. In one embodiment, in order to reduce coherence and separate paths, the earliest arriving voice signals are captured so that the signal traversing via the shortest path has small or no overlap with those traversing via the longer paths.

[0032] Furthermore, in one embodiment, in connection with estimating the angle of arrival of the signals emanating from the user, wavelet and Short-Time Fourier Transform (STFT) analysis are performed on the signals emanating from the user over different time windows to benefit from both transient signals with low coherence and long signals with high cumulative energy. Furthermore, differencing is applied to the wavelet and STFT analysis to cancel the signals in the time-frequency domain to reduce coherence, thereby improving the AoA accuracy.

[0033] Additionally, in one embodiment, the room contour (i.e., the distances and direction of the walls, ceilings, etc.) is estimated. In one embodiment, the smart device (e.g., smart speaker) of the present invention emits wide-band Frequency Modulated Continuous Waves (FMCW) chirp pulses and utilizes the wideband 3D multiple signal classification (MUSIC) algorithm to estimate the multiple propagation paths from the reflected chirp pulses simultaneously. The wide bandwidth not only improves distance resolution, but also allows one to leverage the frequency diversity to estimate the AoAs of coherent signals. Furthermore, in one embodiment, the AoA estimation is improved by leveraging the assumption of a rectangle room (which is common in real world scenarios). Additionally, in one embodiment, the accuracy of the distance estimation, such as to the wall, is improved by using beamforming.

[0034] In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention. However, it will be apparent to those skilled in the art that the present invention may be practiced without such specific details. For the most part, details considering timing considerations and the like have been omitted inasmuch as such details are not necessary to obtain a complete understanding of the present invention and are within the skills of persons of ordinary skill in the relevant art.

[0035] Referring now to the Figures in detail, Figure 1 illustrates a diagram of a placement of smart device, such as a smart speaker, within a living space, such as a house, apartment, etc., in accordance with an embodiment of the present invention. “Living space,” as used herein, refers to space within a building. In such a space, a person (or people) may or may not live. For example, a living space may include examples, such as house, apartment, etc. where people live. However, a living space may also include spaces, such as hospital rooms, offices, etc., where people work or interact with others.

[0036] Referring to Figure 1, living space 100 includes a smart device 101, such as a smart speaker as shown in Figure 1, that is placed within a particular room of the home. For example, as shown in Figure 1, smart speaker 101 is located in room 102A (identified as “Room 1” in Figure 1), separated from other rooms, such as room 102B (identified as “Room 2” in Figure 1), room 102C (identified as “Room 3” in Figure 1) and room 102D (identified as “Room 4” in Figure 1) via a wall, door, etc.

[0037] A “smart device,” as used herein, refers to an electronic device, generally connected to other devices or networks via different wireless protocols, such as Bluetooth, Zigbee, NFC, Wi Fi, LiFi, 5G, etc., that can operate to some extent interactively and autonomously. Several notable types of smart devices include smartphones, smart televisions or displays, smart thermostats, smart doorbells, smart locks, smart refrigerators, phablets and tablets, smartwatches, smart bands, smart key chains, smart speakers and others.

[0038] A “smart speaker,” as used herein, refers to a speaker and voice command device with an integrated virtual assistant that offers interactive actions and hands-free activation with the help of one "hot word" (or several "hot words"). In one embodiment, smart speaker 101 functions as a smart device that utilizes Wi-Fi, Bluetooth and other protocol standards to extend usage beyond audio playback, such as to control home automation devices, such as light source 103. This can include, but is not limited to, features, such as compatibility across a number of services and platforms, peer-to-peer connection through mesh networking, virtual assistants, and others. Each can have its own designated interface and features in-house, usually launched or controlled via application or home automation software. Some smart speakers also include a screen to show the user a visual response. Some smart televisions or other electronic devices may have integrated smart speakers. [0039] As discussed above, smart speaker 101 may be utilized to control home automation devices, such as light source 103. For example, a user 104 may be located in room 102A and instruct light source 103 to be turned on using voice commands captured by smart speaker 101.

[0040] In one embodiment, smart speaker 101 includes a microphone array 105 for extracting voice input, such as voice signals emanating from user 104 or signals reflected from reflectors of the room structure, such as walls, ceilings, etc. A “reflector,” as used herein, refers to a object in living space 100 that causes the reflection of a signal, whether a signal emanating from user 104 or a chirp pulse emanating from smart speaker 101. For example, the reflector may be a wall, ceiling, table, etc. located within living space 100.

[0041] While Figure 1 illustrates a living space with four rooms, it is noted that living space 100 may include any number of rooms, which may be separated in any number of ways (e.g., doors, walls, ceilings, etc.). Furthermore, user 104 may be located in any room, including being located in a different room from smart speaker 101. Additionally, living space 100 may include any number of smart speakers 101.

[0042] A diagram of the software components of smart speaker 101 used to estimate the location of the user 104 within living space 100 is discussed below in connection with Figure 2.

[0043] Figure 2 is a diagram of the software components of smart speaker 101 (Figure 1) used to estimate the location of the user 104 (Figure 1) within living space 100 (Figure 1) in accordance with an embodiment of the present invention.

[0044] Referring to Figure 2, in conjunction with Figure 1, smart speaker 101 includes an angle of arrival estimator 201 configured to estimate the angle of arrival of signals emanating from user 104 as discussed further below in connection with Figures 4-9 and 10A-10B.

[0045] Smart speaker 101 further includes a room structure estimator 202 configured to estimate the room contour as discussed further below in connection with Figures 4-9 and 10A-10B. “Room contour,” as used herein, refers to the structure or outline of living space 100.

[0046] Smart speaker 101 additionally includes a constrained beam retracing engine 203 configured to retrace the propagation paths of the signals received from user 104 as well as the signals received from the reflectors of the room structure as discussed further below in connection with Figures 4-9 and 10A-10B. “Retracing,” as used herein, refers to tracing back the propagation path of the signal(s) collected from user 104 and tracing back the propagation path of the signal(s) collected from the reflectors of the room structure.

[0047] A description of the hardware configuration of smart speaker 101 is provided below in connection with Figure 3.

[0048] Referring now to Figure 3, Figure 3 illustrates an embodiment of the present invention of the hardware configuration of smart speaker 101 (Figures 1 and 2) which is representative of a hardware environment for practicing the present invention.

[0049] Smart speaker 101 may be a machine that operates as a standalone device or may be networked to other machines. Further, while smart speaker 101 is shown only as a single machine, the term "system" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

[0050] Smart speaker 101 may include one or more speakers 301, one or more processors 302, a main memory 303, and a static memory 304, which communicate with each other via a link 305 (e.g., a bus). Smart speaker 101 may further include a video display unit 306, an alphanumeric input device 307 (e.g., a keyboard), and a user interface (UI) navigation device 308. Video display unit 306, alphanumeric input device 307, and UI navigation device 308 may be incorporated into a touch screen display. A UI of smart speaker 101 can be realized by a set of instructions that can be executed by processor 302 to control operation of video display unit 306, alphanumeric input device 307, and UI navigation device 308. Video display unit 306, alphanumeric input device 307, and UI navigation device 308 may be implemented on smart speaker 101 arranged as a virtual assistant to manage parameters of the virtual assistant.

[0051] As illustrated in Figure 3, smart speaker 101 includes microphone array 105 and a set of optical sensors 309 having source(s) 310 and detectors (s) 311. Smart speaker 101 may include a set of acoustic sensors 312 having transmitter(s) 313 and receiver(s) 314.

[0052] Smart speaker 101 may also include a network interface device 315, and one or more sensors (not shown), such as a global positioning system (GPS) sensor, compass, accelerometer, or other sensor. The communications may be provided using a bus 305, which can include a link in a wired transmission or a wireless transmission. As a result, network interface device 315 interconnects bus 305 with an outside network (network 316) thereby allowing smart speaker 101 to communicate with other devices, such as other smart devices (not shown), etc.

[0053] Network 316 may be, for example, a local area network, a wide area network, a wireless wide area network, a circuit-switched telephone network, a Global System for Mobile Communications (GSM) network, a Wireless Application Protocol (WAP) network, a WiFi network, an IEEE 802.11 standards network, various combinations thereof, etc. Other networks, whose descriptions are omitted here for brevity, may also be used without departing from the scope of the present invention.

[0054] Referring again to Figure 3, main memory 303 may store application 317, which may include, for example, angle of arrival estimator 201 (Figure 2), room structure estimator 202 (Figure 2) and constrained beam retracing engine 203 (Figure 2). Furthermore, application 317 may include, for example, a program for estimating the location of the user, such as user 104 (Figure 1), as discussed further below in connection with Figures 4-9 and 10A-10B.

[0055] Processor(s) 302 may include instructions to completely or at least partially operate smart speaker 101 as an activated smart home speaker with user localization capabilities. Components of smart speaker 101, as taught herein, can be distributed as modules having instructions in one or more of main memory 303, static memory 304, and/or within instructions 318 of processor(s) 302.

[0056] In one embodiment, as discussed above, application 317 of smart device 101 includes the software components of angle of arrival estimator 201, room structure estimator 202 and constrained beam retracing engine 203. In one embodiment, such components may be implemented in hardware, where such hardware components would be connected to bus 305. The functions discussed above performed by such components are not generic computer functions. As a result, smart speaker 101 is a particular machine that is the result of implementing specific, non-generic computer functions.

[0057] The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention. [0058] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

[0059] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

[0060] Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

[0061] Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

[0062] These computer readable program instructions may be provided to a processor of a computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

[0063] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0064] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be accomplished as one step, executed concurrently, substantially concurrently, in a partially or wholly temporally overlapping manner, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

[0065] As stated above, user localization systems attempt to estimate the location of the user. For example, such systems may attempt to use audio and vision-based schemes to estimate the location of the user. Such vision-based schemes may involve the use of cameras. However, it may not be prudent to deploy cameras throughout a home due to privacy concerns. Other such user localization systems utilize device-based tracking, which requires the user, whose location is to be estimated, to carry a device (e.g., smartphone), which may not be convenient for the user at home. Furthermore, other such user localization systems may utilize device-free radio frequency. However, such a scheme requires a large bandwidth as well as many antennas or millimeter wave chirps to achieve high accuracy, which is not easy to deploy at home. Unfortunately, such user localization systems are deficient in accurately estimating the location of the user. Furthermore, such user localization systems may require the deployment of expensive equipment. [0066] The embodiments of the present invention provide a means for accurately estimating the location of the user using standard inexpensive equipment (e.g., smart speaker) that may already be located in the user’s home as discussed below in connection with Figures 4-9 and 10A-10B.

[0067] Figure 4 is a flowchart of a method 400 for identifying the location of the user (e.g., user 104 of Figure 1) in accordance with an embodiment of the present invention.

[0068] Referring now to Figure 4, in conjunction with Figures 1-3, in step 401, smart speaker 101 collects signals emanating from user 104 by microphone array 105 of smart speaker 101. For example, user 104 may speak the command to turn on light source 103. Such words may then be collected by microphone array 105 of smart speaker 101.

[0069] In step 402, angle of arrival estimator 201 of smart speaker 101 estimates an angle of arrival (AoA) of the propagation paths of one or more of the signals collected in step 401.

[0070] An embodiment of estimating the AoA of the propagation paths of one or more of the signals collected in step 401 is discussed below in connection with Figure 5.

[0071] Figure 5 is a flowchart of a method 500 estimating the AoA of the propagation paths of one or more of the collected signals (collected in step 401) in accordance with an embodiment of the present invention.

[0072] Referring to Figure 5, in conjunction with Figures 1-4, in step 501, angle of arrival estimator 201 of smart speaker 101 performs wavelet and Short-Time Fourier Transform (STFT) analysis on the signals emanating from user 104 over different time windows.

[0073] In step 502, angle of arrival estimator 201 of smart speaker 101 applies differencing to the wavelet and STFT analysis to cancel signals in a time-frequency domain.

[0074] In step 503, angle of arrival estimator 201 of smart speaker 101 uses the results from the STFT analysis on the signals emanating from user 104 to form a base.

[0075] In step 504, angle of arrival estimator 201 of smart speaker 101 searches for a nearest neighbor in the results of the wavelet analysis and in the results of applying the differencing to the wavelet and STFT analysis for each point in the base. [0076] In step 505, angle of arrival estimator 201 of smart speaker 101 selects a number of peaks from a selected number of nearest neighbors as corresponding to the estimated angle of arrival of the propagation paths of one or more of the collected signals.

[0077] A more detailed discussion of method 500 is provided below.

[0078] In one embodiment, smart speaker 101 utilizes time-frequency analysis to reduce coherence in voice signals since signals that differ in either time or frequency will be separated out. In one embodiment, smart speaker 101 separates coherence in across different frequency bins, and then cancels the paths in each frequency bin by taking the difference between the two consecutive time windows. It is noted that such a process is useful for voice signals since different pitches may occur at a different time. An important decision in time- frequency analysis is to select the sizes of the time window and frequency bin to perform the analysis.

[0079] On the one hand, aggregating the signals over a larger time window and a larger frequency bin improves the signal-to-noise ratio (SNR) and in turn improves the AoA estimation accuracy. On the other hand, a larger time window and a larger frequency bin also means more coherent signals. Moreover, the frequency of voice signals varies unpredictably over time, which makes it challenging to determine a fixed time window and frequency bin.

[0080] To separate paths with different delay, one desires good time resolution. Small time windows have a good time resolution but with a poor frequency resolution. To separate paths with different frequencies, one desires good frequency resolution. Similarly, small frequency bins have a good frequency resolution but with a poor time resolution. Therefore, there is no single time window or frequency bin that works well in all cases.

[0081] To address this challenge, the principles of the present invention utilize multi-resolution analysis as illustrated in Figure 6. Figure 6 is a diagram of the multi-resolution analysis algorithm in accordance with an embodiment of the present invention.

[0082] Referring to Figures 5 and 6, in conjunction with Figures 1-4, in one embodiment, Short- Time Fourier Transform (STFT) with different window sizes 601, 602 and wavelet 603 are used as they are complementary to each other. In one embodiment, STFT using a large time window 601 is performed and fed to the spectrogram to the multiple signal classification (MUSIC) algorithm (see 604). While STFT results with a large window 601 have more coherent signals, which results in more outliers, their peaks also include points that are close to the ground truth, likely due to the stronger cumulative energy. In one embodiment, frequency analysis is performed using smaller windows 602, where the difference between adjacent windows is taken (see 605) to reduce the coherent signals and improve AoA estimation under the coherent multipath. In one embodiment, wavelet 603 is utilized, which has a higher time resolution for relatively high frequency signals. As a result, the transient voice signals that have low or no coherence can be captured thereby reducing outliers in the MUSIC AoA estimation. However, in one embodiment, since transient signals have low cumulative energy and cause non-negligible AoA estimation errors, wavelet is combined with STFT with different window sizes as shown in Figure 6. A discussion regarding these methods is elaborated below.

[0083] In connection with using STFT with a large window size 601, a larger window yields higher SNR and hence higher accuracy. On the other hand, a larger window tends to have more coherent multipath, which may degrade the accuracy. As a result, such an approach can provide information about the AoA of the direct path (see 604), but not sufficient on its own.

[0084] In connection with using STFT with a short window size 602, using a smaller time window gives good time resolution and helps separate paths with different delays. In such an embodiment, the evanescent pitches are selected in the time-frequency domain to reduce error from coherence. The next step is to further reduce coherent signals by taking the difference between two consecutive time windows for each antenna (see 605). This cancels the paths with different delay in the time-frequency domain, and is more effective than cancelling in the time- domain alone. If the difference between two adjacent windows is greater than the delay difference of any two paths, this process can remove the old paths. As a result, coherence is reduced in the short time window.

[0085] In connection with using wavelet 603, wavelet is a multi-resolution analysis. In one embodiment, short basis functions to isolate signal discontinuities and long basis functions to perform detailed frequency analysis are used. Wavelet 603 has superior resolution for relatively high frequency signals. Transient signals in the small time window have less energy and may yield large errors. As a result, to improve accuracy, the differences of the wavelet spectrum in the two consecutive time windows are taken (see 606) to further reduce the coherence. [0086] In one embodiment, the AoA derived from applying MUSIC to STFT and wavelet are compared. Figure 7 shows the results for the case where a woman speaks at 2.4 meters away from microphone array 105. In particular, Figure 7 is a diagram illustrating the comparison of the AoA derived from STFT and wavelet with and without differencing in accordance with an embodiment of the present invention.

[0087] Referring to Figure 7, in conjunction with Figures 5-6, dashed lines 701 are ground truth AoAs of different paths. The STFT results without taking difference, shown in the circles 702, deviate from the right angles due to coherence even after using different window sizes. The wavelet results without taking difference are plotted as circles 703, which also deviates a lot from dashed lines 701 because of low energy. Points 704 are the AoA estimates derived from MUSIC when differencing is applied to STFT and wavelet, referred to herein as the STFT Diff and Wavelet Diff methods. Compared with the original results (shown in circles 702, 703), differencing brings the estimation closer to the ground truth angles (shown as dashed lines 701). It is interesting to observe that there are false peaks in STFT Diff but the peaks in the Wavelet Diff are all close to the ground truth though STFT Diff may have peaks closer to the ground truth than the wavelet. This suggests that it is beneficial to combine the STFT Diff and wavelet Diff results.

[0088] Returning to Figure 6, Figure 6 illustrates the algorithm for deriving the results using different time windows, where the combined results 607 are synthesized to select the final AoA results as discussed herein.

[0089] In one embodiment, the results from STFT, STFT differencing and wavelet differencing are combined using a linear regression. For example, training traces that contain the result from each method (e.g., MUSIC, STFT+MUSIC, and wavelet+MUSIC) and the ground truth Angle of Arrival (AoA) can be collected. Next, the training traces are utilized to train a linear regression model, such as y = Ax + b, where y is the ground truth AoA, and x is a vector that represents the results from these three individual methods. Furthermore, in one embodiment, the least square method may be utilized to determine the parameters, A and b. After training, in one embodiment, the trained model outputs the combined result using the estimations from these individual methods. [0090] In another embodiment, a non-linear model (e.g., neural network) is trained using the training traces discussed above. In such an embodiment, the non-linear model (e.g., neural network) outputs the combined result using the estimations from these individual methods.

[0091] In another embodiment, the multiple signal classification (MUSIC) profiles from STFT, STFT differencing, and wavelet differencing results are combined in which the peaks from the combined MUSIC profiles are selected.

[0092] In one embodiment, the weighted cluster of these points are computed, where the weight is set according to the magnitude of the MUSIC peak. In one embodiment, the top K clusters from each algorithm is selected, where f is a positive whole number greater than zero.

[0093] In one embodiment, to combine the results across different algorithms, the nearest neighbor algorithm is used. Since STFT with a large window provides more stable results without significant outliers, such an algorithm is used to form the base. For each point in the base, the nearest neighbor is searched in the results of the other two methods as they contain both more accurate real peaks and outlier peaks. Finally, the top P peaks from the selected nearest neighbors are picked as the final AoA estimates. A pseudo code of the algorithm for estimating the AoA of the propagation paths of one or more of the collected signals is shown below as Algorithm 1.

[0094] Algorithm 1: Multi-Resolution Analysis Algorithm.

1 : function [AoAs, w] = MultiResolutionAoA( signal)

2: Bandpass filter in voice frequency range 3: spectLong = STFT(signal,LongWindow);

4: spectShortDiff = diff(STFT(signal,ShortWindow));

5: spectWaveletDiff = diff(Wavelet(signal));

6: Select frequency and time ranges based on spectrograms 7: for method in STFTLong,STFTDiff,WaveletDiff do 8: for time in SelectedTimeSlots do

9: for frequency in SelectedFrequencies do

10: forward backward smoothing;

11 : compute MUSIC profile;

12: end for 13: accumProfile = SUM(profile)

14: [results, weights] = fmdPeaks(accumProfile);

15 : estimate candidateAoAsm and weightsm,

16: end for 17: end for

18: AoAs = select top P peaks from candidateAoAsm for m=1..3

[0095] Returning to Figure 4, in conjunction with Figures 1-3, in step 403, room structure estimator 202 of smart speaker 101 estimates the room structure by emitting chirp pulses in the room structure, such as living space 100. In one embodiment, the chirp pulses correspond to frequency-modulated continuous-wave (FMCW) chirps.

[0096] In step 404, room structure estimator 202 of smart speaker 101 collects signals reflected by the reflectors (e.g., ceilings, doors, walls) of the room structure (e.g., living space 100) from the emitted chirp pulses. In one embodiment, the frequency of the FMCW signals reflected by the reflectors of the room structure from the emitted chirp pulses that are collected by smart speaker 101 is between 1 kHz and 3 kHz.

[0097] In step 405, room structure estimator 202 of smart speaker 101 estimates an angle of arrival (AoA) of the propagation paths of one or more signals reflected from the reflectors of the room structure.

[0098] An embodiment of estimating the AoA of the propagation paths of one or more signals reflected from the reflectors of the room structure is discussed below in connection with Figure 8

[0099] Figure 8 is a flowchart of a method 800 for estimating the AoA of the propagation paths of one or more signals reflected from the reflectors of the room structure in accordance with an embodiment of the present invention.

[00100] Referring to Figure 8, in conjunction with Figures 1-4, in step 801, room structure estimator 202 of smart speaker 101 divides a FMCW signal into multiple subbands in a time domain.

[00101] In step 802, room structure estimator 202 of smart speaker 101 runs a 3D multiple signal classification (MUSIC) algorithm in each of the multiple subbands to generate a 3D MUSIC profiles for each of the multiple subbands, where the 3D MUSIC profile corresponds to an azimuth AoA-distance profile.

[00102] In step 803, room structure estimator 202 of smart speaker 101 sums the generated 3D MUSIC profiles.

[00103] In step 804, room structure estimator 202 of smart speaker 101 searches and selects azimuth AoAs from the summed generated 3D MUSIC profiles that minimize a fitting error with a rectangular room.

[00104] In step 805, room structure estimator 202 of smart speaker 101 estimates the angle of arrival (AoA) of the propagation paths of one or more signals reflected from the reflectors of the room structure by adjusting the angles of the selected azimuth AoAs so that adjacent azimuth AoAs of the selected azimuth AoAs correspond to p/2.

[00105] In step 806, room structure estimator 202 of smart speaker 101 performs a delay-and- sum beamforming algorithm on the FMCW signals reflected by the reflectors of the room structure forming FMCW profiles containing distance information of the estimated distance of the reflectors to smart speaker 101.

[00106] A more detailed discussion of method 800 is provided below.

[00107] In one embodiment, in order to localize the user, we need not only find the AoAs of the propagation paths of the voice signals, but also the room structure information so as to retrace back the paths. As discussed below, the principles of the present invention estimate the room contour using wideband 3D MUSIC algorithms. Accuracy is improved by leveraging constraints of the azimuth AoA and applying beamforming.

[00108] In one embodiment, smart speaker 101 estimates the room structure once unless it is moved to a new position. In one embodiment, smart speaker 101 estimates the room structure by sending FMCW chirps. Let / c , B and T denote the center frequency, bandwidth, and duration of the chirp, respectively. Upon receiving the reflected signals, smart speaker 101 applies the 3D MUSIC algorithm.

[00109] In one embodiment, the 2D Range-Azimuth MUSIC algorithm is generalized to a 3D joint estimation of distance, azimuth AoA and elevation AoA. The 3D MUSIC algorithm has better resolution than the 2D MUSIC algorithm since the peaks that differ in any of the three dimensions are separated out. In one embodiment, the received signals are transformed into a 3D sinusoid whose frequencies are proportional to the distance and a function of the two angles. Furthermore, in one embodiment, the steering vector is extended to have three input parameters: distance R, azimuth angle Q, and elevation angle cp. , (1)

[00110] where i is the array index, N is the number of microphones, r is the radius of the microphone array, c is sound speed, N s is the subsampling rate, M s is the temporal smoothing window and T s is the time interval.

[00111] However, there are several challenges in applying the 3D MUSIC algorithm to indoor environments. First, the number of microphones and their sizes are both limited, which limits the resolution of the 3D MUSIC algorithm. Second, there is significant reverberation in indoor scenarios. Third, large bandwidth is required to get accurate distance estimation, but MUSIC requires narrowband signals for AoA estimation. Therefore, three techniques were developed to improve the 3D MUSIC algorithm: (i) leveraging frequency diversity, (ii) incorporating the fact that rooms are typically rectangular shaped, and (iii) using beamforming to improve distance.

[00112] In one embodiment, FMCW signals from 1 kHz to 3 kHz for AoA are used for estimation. To satisfy the narrowband requirement in the MUSIC algorithm, in one embodiment, the 2 kHz bandwidth is divided into 20 subbands each with 100 Hz. In one embodiment, since the frequency of the FMCW signal increases linearly over time, room structure estimator 202 divides the FMCW signal into multiple subbands in the time domain, runs the 3D MUSIC algorithm in each subband to generate a 3D MUSIC profile (azimuth AoA-distance profile) for each of the subbands, and then sums up the 3D MUSIC profiles from all the subbands.

[00113] In one embodiment, in order to use the 100 Hz subband for the 3D MUSIC algorithm, the transmission signal is aligned with the received signal so that they span the same subband. In one embodiment, the alignment is determined by the distance. Therefore, a peak is identified in the 3D MUSIC profile (azimuth AoA-distance profile), which is obtained by mixing the received signal with the transmitted signal that is sent d T ago, where dT is the propagation delay and determined based on the distance. [00114] In one embodiment, the azimuth AoA and distance output from the 3D MUSIC algorithm are used as shown in Figure 9. Figure 9 illustrates an exemplary azimuth-distance profile in accordance with an embodiment of the present invention.

[00115] Due to multipath, the MUSIC profile can be noisy, which makes it difficult to determine the right peaks to use for distance and AoA estimation of the walls. In one embodiment, since the shapes of most rooms in a living area correspond to rectangular shapes, such information is leveraged to improve peak selection. Specifically, room structure estimator 202 selects the peaks such that the difference in the azimuth AoA of two consecutive peaks are as close to 90° as possible. In one embodiment, room structure estimator 202 searches for the 4 peaks qo, qi, 0 ¾ Q3 from the 3D MUSIC profile that minimizes the fitting error with a rectangular room (i.e., min å, PhaseDi f f (0 / q, + i) p/2, where PhaseDi f f (.) is the difference between the two angles by taking into account the phase wraps every 2p).

[00116] After finding these peaks, room structure estimator 202 adjusts the solutions so that the difference between the adjacent AoA is exactly p/2. In one embodiment, this can be done by finding 0j that minimizes f maj AoA is set to

(&j , 8'i -j- p/2, 8j + p. Q) + 3/2p)

[00117] Accurate distance estimation requires a large bandwidth and high SNR. Therefore, to improve distance estimation, in one embodiment, smart speaker 101 sends 1 kHz - 10 kHz FMCW chirps. Among them, in one embodiment, only 1 kHz - 3 kHz FMCW chirps are used for AoA estimation to reduce computational cost since MUSIC requires expensive eigenvalue decomposition, but the 1 kHz - 10 kHz FMCW chirps are used for distance estimation. In one embodiment, the SNR is increased using beamforming. In one embodiment, the delay-and-sum (DAS) beamforming algorithm is utilized by room structure estimator 202 towards the estimated azimuth AoAs. Then, room structure estimator 202 searches for a peak in the beamformed FMCW profile. It has been discovered that after beamforming, the peak magnitude increases significantly with a more accurate distance estimation.

[00118] Returning to Figure 4, in conjunction with Figures 1-3 and 5-9, in step 406, constrained beam retracing engine 203 of smart speaker 101 identifies a location of user 104 by retracing the propagation path of the signal(s) collected from user 104 and retracing the propagation paths of the signals(s) collected from the reflectors of the room structure. “Retracing,” as used herein, refers to tracing back the propagation path of the signal(s) collected from user 104 and tracing back the propagation path of the signal(s) collected from the reflectors of the room structure. In one embodiment, the location of user 104 is determined by retracing each of the propagation paths of the signals collected from user 104 and retracing each of the propagation paths of the signals collected from the reflectors of the room structure as a cone structure resulting in a plurality of cone structures, where the location of user 104 corresponds to a point in the cone structures such that a circle centered at the point with a radius of 0.5 m overlaps with a maximum number of cones as discussed further below. In one embodiment, the width of each of the cone structures corresponds to a peak width obtained using the 3D MUSIC algorithm on the FMCW signals reflected by the reflectors of the room structure.

[00119] In one embodiment, the user can be localized by retracing the paths using the estimated AoA of the voice signals and room structure. As shown in Figure 10 A, in conjunction with Figures 1-4, constrained beam retracing engine 203 may first find the reflection points on the walls by the propagation path derived from the estimated AoA. Figure 10A illustrates retracing using a ray structure for each of the two near parallel paths in accordance with an embodiment of the present invention. Then, constrained beam retracing engine 203 traces back the incoming path of voice signals before the wall reflection based on the reflection property.

[00120] In an alternative embodiment, to reduce localization error, the following strategies may be employed by constrained beam retracing engine 203. First, instead of treating each propagation path as a ray defined by the estimated AoA, the propagation paths are treated as a cone where the cone center is determined by the estimated AoA and the cone width is determined by the MUSIC peak width. This allows one to capture the uncertainty in the AoA estimation.

[00121] Second, while theoretically two paths are sufficient to perform tri angulation, it is challenging to select the right paths for tri angulation. Therefore, instead of prematurely selecting incorrect paths, constrained beam retracing engine 203 lets the AoA estimation procedure return more paths so that the room structure is incorporated to make an informed decision on which paths to use for localization. Specifically, for each of the K paths returned by the AoA estimation, constrained beam retracing engine 203 traces back using the cone structure as shown in Figure 10B. Figure 10B illustrates retracing using a cone structure for each of the paths in accordance with an embodiment of the present invention. It has been observed that the azimuth AoA is reliable for the strongest path, which is the direct path in LoS or the path from user 104 to the ceiling and then to the microphone (e.g., microphone of microphone array 105) in NLoS. Therefore, within the cone corresponding to the strongest path, constrained beam retracing engine 203 searches for a point O such that the circle centered at the point with aradius of 0.5 m overlaps with the maximum number of cones corresponding to the other K- 1 paths. In one embodiment, the user is localized at the point O. In one embodiment, K is any whole number greater than 2.

[00122] In one embodiment, the location of the user is determined by retracing each of the propagation paths of the signals collected from the user based on the positions and orientations of the reflectors of the room structure as a cone structure resulting in a plurality of cone structures, where each point in the cone structure is assigned a probability based on the distance from the peaks in the multiple signal classification (MUSIC) profiles. The joint probability of a point in space is computed as the product of probabilities from the cone structures corresponding to all the reflectors. The location of the user is then derived as the weighted centroid of all the points where the weights are the joint probabilities.

[00123] In one embodiment, the width of the cone structure is determined by the width of the peaks in the multiple signal classification (MUSIC) profile. The location of the user then corresponds to a point in the plurality of cone structures such that a circle centered at the point overlaps with a maximum number of cones.

[00124] In another embodiment, the location of the user is determined by retracing each of the propagation paths of the signals collected from the user based on the positions and orientations of the reflectors of the room structure as an intersection of the propagation paths from multiple reflectors.

[00125] In one embodiment, the positions and orientations of the reflectors of the room structure are estimated using the multiple signal classification (MUSIC) algorithm.

[00126] In step 407, smart speaker 101 interprets a command from user 104 in connection with the identified location of user 104. For example, the user may have instructed smart speaker 101 to turn on the light; however, in a living space 100 containing multiple lights, smart speaker 101 may not know which light to turn on without knowing the user’s location. [00127] In one embodiment, after identifying the user’s location, smart speaker 101 may perform a table took-up in a table containing a listing of commands, objects and locations. Such a table may be stored in a storage device (e.g., memory 304). For example, after hearing the words to “turn on the light,” which smart speaker 101 may interpret using speak recognition software and natural language processing, such a statement may be associated with the command (e.g., “turn on light source”). For instance, speech recognition software may be used to recognize and translate spoken words into text. Examples of such speech recognition software include Braina Pro, e-Speaking, IBM® Watson Speech to Text, Amazon® Transcribe, etc. Such text may be listed in a table and associated with a command, such as “turn on light source.” Each command may then be associated with an object, such as a light source, and a location within living area 100 (e.g., room 102A). After identifying the location of user 104, the appropriate object associated with the command may be identified. For example, if the user’s location corresponded to being within room 102 A, then the command to turn on the light associated with light source 103 in room 102A will be identified. In this manner, smart speaker 101 will be able to determine which light source 103 to be activated.

[00128] In this manner, the location of the user can be determined by the smart device (e.g., smart speaker) localizing the user’s voice, even in situations when the user is not within the line of sight by the smart device, which could not be performed in prior user localization systems. By knowing the user’s location, the smart device, such as a smart speaker, is allowed to beamform its transmission to the user so that it can both hear from and transmit to a faraway user. Second, the user location gives context information, which can help to better interpret the user's intent. When the user issues the command to turn on the light, the smart device (e.g., smart speaker) can resolve the ambiguity and tell which light to turn on depending on the user's location. In addition, knowing the location also enables location based services. For instance, a smart device (e.g., smart speaker) can automatically adjust the temperature and lighting condition near the user. Moreover, location information can also help with speech recognition and natural language processing by providing important context information. For example, when a user says "orange" in the kitchen, it knows that refers to a fruit; when the same user says "orange" elsewhere, it may interpret that as a color. [00129] Furthermore, as discussed herein, the principles of the present invention may identify the location of the user via the use of a single smart speaker, such as in a building with multiple rooms, without the need for a smart speaker to be located in each room.

[00130] Additionally, the principles of the present invention may identify the location of the user via the use of multiple smart speakers, such as in a building. In such an embodiment, the smart speakers may collectively be utilized to estimate the location of the user. For example, in one embodiment, for non-overlapping sound, each smart speaker independently records the received sound. The smart speaker that receives the highest volume will run the algorithm of the present invention to localize the sound. For overlapping sounds, each smart speaker uses MUSIC to determine the AoA and beamform to the estimated AoA, and record the sound from each beamforming angle. Then the smart speakers share the recorded sounds and cluster them to determine the number of unique sounds. For each sound (or cluster), the present invention identifies the smart speaker that “hears” the loudest sound (or highest peak in the MUSIC profile) and appoints that smart speaker to localize the sound using the approach of the present invention.

[00131] With respect to localizing multiple sounds, for non-overlapping sounds, the approach of the present invention to localize the sound can be applied. For overlapping sounds, the MUSIC algorithm may be implemented to estimate the Angle of Arrival (AoA) as well as the beamform towards each AoA. The AoAs are clustered based on the similarity of the sound, and the AoAs corresponding to the sound in the same cluster are used as the AoA for the sound. The approach of the present invention is then applied to localize sound in each cluster. In one embodiment, a clustering algorithm (e.g., k means clustering algorithm) determines the number of sounds that are overlapping. Spectral clustering may then be used to automatically determine the number of clusters.

[00132] In one embodiment, the present invention tracks a sound generated from a moving source, such as a user. The approach of the present invention may be used to localize the sound during each snapshot. In one embodiment, accuracy is enhanced by leveraging the temporal relationship in the movement to enhance accuracy. In one embodiment, when cones are used to re-trace, a term is added to minimize the change in the positions during two consecutive intervals. In one embodiment, the location of the user corresponds to a point in the plurality of cone structures such that a circle centered at the point overlaps with a maximum number of cones and is also close to the previous position.

[00133] In one embodiment, the present invention localizes a user in a different room based on the room structure and the AoA for multiple paths that the user’s sound traverses.

[00134] As a result of the foregoing, embodiments of the present invention provide a means for improving the technology or technical field of user localization systems by more accurately estimating the location of the user using standard inexpensive equipment (e.g., smart speaker).

[00135] Furthermore, the present invention improves the technology or technical field involving user localization systems. As discussed above, user localization systems attempt to estimate the location of the user. For example, such systems may attempt to use audio and vision-based schemes to estimate the location of the user. Such vision-based schemes may involve the use of cameras. However, it may not be prudent to deploy cameras throughout a home due to privacy concerns. Other such user localization systems utilize device-based tracking, which requires the user, whose location is to be estimated, to carry a device (e.g., smartphone), which may not be convenient for the user at home. Furthermore, other such user localization systems may utilize device-free radio frequency. However, such a scheme requires a large bandwidth as well as many antennas or millimeter wave chirps to achieve high accuracy, which is not easy to deploy at home. Unfortunately, such user localization systems are deficient in accurately estimating the location of the user. Furthermore, such user localization systems may require the deployment of expensive equipment.

[00136] Embodiments of the present invention improve such technology by having a smart device, such as a smart speaker, collect signals emanating from a user in a living space by a microphone array of the smart speaker. The angle of arrival of the propagation paths of one or more of the collected signals are then estimated. Furthermore, the smart speaker estimates a room structure by emitting chirp pulses in the room structure. The smart speaker then collects the signal reflected by reflectors of the room structure from the emitted chirp pulses. The smart speaker then estimates the angle of arrival of the propagation paths of one or more signals collected from the reflectors of the room structure. The location of the user is then identified by retracing the propagation paths of the one or more signals collected from the user and the propagation paths of the one or more signals collected from the reflectors of the room structure. In this manner, the location of the user can be determined by the smart device (e.g., smart speaker) localizing the user’s voice, even in situations when the user is not within the line of sight by the smart device, which could not be performed in prior user localization systems. Furthermore, in this manner, the location of the user can be more accurately identified than in prior user localization systems. Additionally, the present invention utilizes inexpensive equipment to identify the location of the user as opposed to using expensive equipment as in prior user localization systems. Consequently, in this manner, there is an improvement in the technical field involving user localization systems.

[00137] The technical solution provided by the present disclosure cannot be performed in the human mind or by a human using a pen and paper. That is, the technical solution provided by the present disclosure could not be accomplished in the human mind or by a human using a pen and paper in any reasonable amount of time and with any reasonable expectation of accuracy without the use of a computer.

[00138] The descriptions of the various embodiments of the present disclosure have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.