Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR ADAPTIVE DIALOGUE VIA SCENE MODELING USING COMBINATIONAL NEURAL NETWORKS
Document Type and Number:
WIPO Patent Application WO/2021/030449
Kind Code:
A1
Abstract:
The present teaching relates to method, system, medium, and implementations for adaptive dialogue. One or more images of a dialogue scene are received that capture surrounding of a dialogue between a user and a machine conducted based on a dialogue policy. One or more objects present in the scene are detected from the images based on a first artificial neural network. Spatial relationships among the one or more objects are detected from the images based on a second artificial neural network. Scene modeling information of the scene is then generated based on the one or more objects and the spatial relationships and is to be used for adaptive dialogue.

Inventors:
HUANG SIYUAN (US)
Application Number:
PCT/US2020/045951
Publication Date:
February 18, 2021
Filing Date:
August 12, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DMAI INC (US)
International Classes:
G06K9/00; G06T7/00; G06T7/70; G10L15/00
Foreign References:
US20140324429A12014-10-30
US20170330319A12017-11-16
US20030091226A12003-05-15
CN105913039A2016-08-31
US20160219048A12016-07-28
Other References:
NESSELRATH: "SiAM-dp: an open development platform for massively multimodal dialogue systems in cyber-physical environments", DISS. SAARLÄNDISCHE UNIVERSITATS-UND LANDESBIBLIOTHEK, 17 December 2015 (2015-12-17), XP055793671, Retrieved from the Internet [retrieved on 20201014]
Attorney, Agent or Firm:
GADKAR, Arush (US)
Download PDF:
Claims:
WE CLAIM:

1. A method implemented on at least one machine including at least one processor, memory, and communication platform capable of connecting to a network for adaptive dialogue, the method comprising: receiving one or more images capturing surrounding of a dialogue scene in which a user and a machine are engaged in a dialogue conducted based on a dialogue policy; detecting, via a first type of artificial neural network, one or more objects present in the dialogue scene from the one or more images; detecting, via a second type of artificial neural network, spatial relationships among the one or more objects based on the one or more images; and generating scene modeling information characterizing the dialogue scene based on the one or more objects and the spatial relationships, wherein the scene modeling information is used for adaptive dialogue.

2. The method of claim 1, wherein the first type of artificial neural network is a convolutional neural network (CNN), configured based on CNN based object/feature detection models trained via supervised learning based on labeled objects detected from training images.

3. The method of claim 1, wherein the second type of artificial neural network is a graph neural network (GNN), configured based on GNN based spatial relation detection models trained via supervised learning based on labeled spatial relationships among objects detected from training images.

4. The method of claim 1, further comprising determining when to adapt the dialogue by assessing at least one of: whether the dialogue needs to be adapted; and whether the dialogue scene is to be augmented to allow the dialogue to be adapted in an augmented dialogue scene.

5. The method of claim 4, further comprising, when the dialogue scene is to be augmented, selecting one or more virtual objects to be rendered in the dialogue scene; determining parameters for rendering the one or more virtual objects based on the scene modeling information; and projecting the one or more virtual objects in the dialogue scene in accordance with the parameters to create the augmented dialogue scene.

6. The method of claim 4, further comprising adapting the dialogue in the augmented dialogue scene in accordance with an augmented dialogue policy consistent with the augmented dialogue scene.

7. The method of claim 6, wherein the step of adapting the dialogue in the augmented dialogue scene comprises: generating the augmented dialogue policy based on the augmented dialogue scene and the dialogue policy; and conducting the dialogue in the augmented dialogue scene based on the augmented dialogue policy.

8. Machine readable and non-transitory medium having information recorded thereon for adaptive dialogue, wherein the information, when read by the machine, causes the machine to perform: receiving one or more images capturing surrounding of a dialogue scene in which a user and a machine are engaged in a dialogue conducted based on a dialogue policy; detecting, via a first type of artificial neural network, one or more objects present in the dialogue scene from the one or more images; detecting, via a second type of artificial neural network, spatial relationships among the one or more objects based on the one or more images; and generating scene modeling information characterizing the dialogue scene based on the one or more objects and the spatial relationships, wherein the scene modeling information is used for adaptive dialogue.

9. The medium of claim 8, wherein the first type of artificial neural network is a convolutional neural network (CNN), configured based on CNN based object/feature detection models trained via supervised learning based on labeled objects detected from training images.

10. The medium of claim 8, wherein the second type of artificial neural network is a graph neural network (GNN), configured based on GNN based spatial relation detection models trained via supervised learning based on labeled spatial relationships among objects detected from training images.

11. The medium of claim 8, wherein the information, when read by the machine, further causes the machine to perform determining when to adapt the dialogue by assessing at least one of: whether the dialogue needs to be adapted; and whether the dialogue scene is to be augmented to allow the dialogue to be adapted in an augmented dialogue scene.

12. The medium of claim 11, wherein the information, when read by the machine, further causes the machine to perform, when the dialogue scene is to be augmented, selecting one or more virtual objects to be rendered in the dialogue scene; determining parameters for rendering the one or more virtual objects based on the scene modeling information; and projecting the one or more virtual objects in the dialogue scene in accordance with the parameters to create the augmented dialogue scene.

13. The medium of claim 11, wherein the information, when read by the machine, further causes the machine to perform adapting the dialogue in the augmented dialogue scene in accordance with an augmented dialogue policy consistent with the augmented dialogue scene.

14. The medium of claim 13, wherein the step of adapting the dialogue in the augmented dialogue scene comprises: generating the augmented dialogue policy based on the augmented dialogue scene and the dialogue policy; and conducting the dialogue in the augmented dialogue scene based on the augmented dialogue policy.

15. A system for adaptive dialogue, comprising: a dynamic dialogue scene modeling unit configured for receiving one or more images capturing surrounding of a dialogue scene in which a user and a machine are engaged in a dialogue conducted based on a dialogue policy; an object/feature detection unit configured for detecting, via a first type of artificial neural network, one or more objects present in the dialogue scene from the one or more images; an object spatial relation detector configured for detecting, via a second type of artificial neural network, spatial relationships among the one or more objects based on the one or more images; and a scene model generation unit configured for generating scene modeling information characterizing the dialogue scene based on the one or more objects and the spatial relationships, wherein the scene modeling information is used for adaptive dialogue.

16. The system of claim 15, wherein the first type of artificial neural network is a convolutional neural network (CNN), configured based on CNN based object/feature detection models trained via supervised learning based on labeled objects detected from training images.

17. The system of claim 15, wherein the second type of artificial neural network is a graph neural network (GNN), configured based on GNN based spatial relation detection models trained via supervised learning based on labeled spatial relationships among objects detected from training images.

18. The system of claim 15, further comprising a dialogue manager configured for determining when to adapt the dialogue by assessing at least one of: whether the dialogue needs to be adapted; and whether the dialogue scene is to be augmented to allow the dialogue to be adapted in an augmented dialogue scene.

19. The system of claim 18, further comprising, an augmented scene generation unit configured for, when the dialogue scene is to be augmented, selecting one or more virtual objects to be rendered in the dialogue scene; determining parameters for rendering the one or more virtual objects based on the scene modeling information; and projecting the one or more virtual objects in the dialogue scene in accordance with the parameters to create the augmented dialogue scene.

20. The system of claim 18, wherein the dialogue manager is further configured for adapting the dialogue in the augmented dialogue scene in accordance with an augmented dialogue policy consistent with the augmented dialogue scene.

21. The system of claim 20, further comprising an augmented dialogue policy generator configured for generating the augmented dialogue policy based on the augmented dialogue scene and the dialogue policy, wherein the dialogue manager is further configured for conducting the dialogue in the augmented dialogue scene based on the augmented dialogue policy.

Description:
SYSTEM AND METHOD FOR ADAPTIVE DIALOGUE VIA SCENE MODELING USING COMBINATIONAL NEURAL NETWORKS

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority from U.S. Provisional Patent Application 62/885,526, filed August 12, 2019, the contents of which are incorporated herein by reference in their entirety.

BACKGROUND

1 Technical Field

[0001] The present teaching generally relates to computer. More specifically, the present teaching relates to computerized intelligent agent.

2. Technical Background

[0002] With advancement of artificial intelligence technologies and the explosion of Internet based communications due to the ubiquitous Internet’s connectivity, computer aided dialogue systems have become increasingly popular. For example, more and more call centers deploy automated dialogue robots to handle customer calls. Hotels installed various kiosks that can answer questions from tourists or guests. Online bookings (whether travel accommodations or theater tickets, etc.) are also more frequently done by chatbots. In recent years, automated human machine communications in other areas are also becoming more and more popular. [0003] Such traditional computer aided dialogue systems are usually pre programed with certain questions and answers based on commonly known patterns of conversations in specific relevant domains. Unfortunately, human conversant can be unpredictable and sometimes does not follow a pre-planned dialogue pattern. In addition, in certain situations, a human conversant may digress during the process and continue the fixed conversation patterns likely will cause irritation or loss of interests. When this happens, such traditional machine dialogue systems often will not be able to continue to engage a human conversant so that the human machine dialogue either has to be aborted to hand the tasks to a human operator or the human conversant simply leaves the dialogue, which is undesirable.

[0004] In addition, traditional machine based dialogue systems are often not designed to address the emotional factor of a human, let alone taking into consideration as to how to address the emotional aspect of a conversation when conversing with a human. For example, a traditional machine dialogue system usually does not initiate a conversation unless a human activates the system or asks some questions. Even if a traditional dialogue system does initiate a conversation, it has a fixed way to conduct a conversation and does not change from human to human or adjusted based on observations. As such, as they are programmed to faithfully follow the pre-designed dialogue pattern, they are usually not able to react to the unplanned dynamics of the conversation and adapt in order to keep the conversation going in a way that can continue to engage the human. For example, when a human involved in a dialogue is clearly annoyed or frustrated, a traditional machine dialogue systems is completely unaware and will continue to press the conversation in the same manner that has annoyed the human.

This not only makes the conversation unpleasant (which the traditional machine dialogue system is still unaware) but also turns the person away from the dialogue with any machine based dialogue system in the future.

[0005] In some application, conducting a human machine dialogue session based on what is observed from the human is crucially important in order to determine how to proceed effectively. One example is an education related dialogue. When a chatbot is used for teaching a child to read, whether the child is perceptive to the way he/she is being taught has to be monitored and addressed continuously in order to be effective. Another limitation of the traditional dialogue systems is their context unawareness. For example, a traditional dialogue system is not equipped with the ability to observe the context of a conversation and improvise as to dialogue strategy in order to engage a user in a manner relevant to the context to improve the user experience.

[0006] Thus, there is a need for methods and systems that address such limitations.

SUMMARY

[0007] The teachings disclosed herein relate to methods, systems, and programming for data processing. More particularly, the present teaching relates to methods, systems, and programming related to modeling a scene to generate scene modeling information and utilization thereof.

[0008] In one example, a method, implemented on a machine having at least one processor, storage, and a communication platform capable of connecting to a network for adaptive dialogue. One or more images of a dialogue scene are received that capture surrounding of a dialogue between a user and a machine conducted based on a dialogue policy. One or more objects present in the scene are detected from the images based on a first artificial neural network. Spatial relationships among the one or more objects are detected from the images based on a second artificial neural network. Scene modeling information of the scene is then generated based on the one or more objects and the spatial relationships and is to be used for adaptive dialogue.

[0009] In a different example, the present teaching discloses a system for adaptive dialogue, which includes a dynamic dialogue scene modeling unit, an object/feature detection unit, an object spatial relation detector, and a scene model generation unit. The dynamic dialogue scene modeling unit is configured for receiving one or more images capturing surrounding of a dialogue scene in which a user and a machine are engaged in a dialogue conducted based on a dialogue policy. The object/feature detection unit is configured for detecting, via a first type of artificial neural network, one or more objects present in the dialogue scene from the one or more images. The object spatial relation detector is configured for detecting, via a second type of artificial neural network, spatial relationships among the one or more objects based on the one or more images. The scene model generation unit is configured for generating scene modeling information characterizing the dialogue scene based on the one or more objects and the spatial relationships, wherein the scene modeling information is used for adaptive dialogue.

[0010] Other concepts relate to software for implementing the present teaching. A software product, in accord with this concept, includes at least one machine-readable non- transitory medium and information carried by the medium. The information carried by the medium may be executable program code data, parameters in association with the executable program code, and/or information related to a user, a request, content, or other additional information.

[0011] In one example, machine-readable, non-transitory and tangible medium having data recorded thereon for adaptive dialogue, wherein the medium, when read by the machine, causes the machine to perform a series of steps. One or more images of a dialogue scene are received that capture surrounding of a dialogue between a user and a machine conducted based on a dialogue policy. One or more objects present in the scene are detected from the images based on a first artificial neural network. Spatial relationships among the one or more objects are detected from the images based on a second artificial neural network. Scene modeling information of the scene is then generated based on the one or more objects and the spatial relationships and is to be used for adaptive dialogue.

[0012] Additional advantages and novel features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The advantages of the present teachings may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] The methods, systems and/or programming described herein are further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

[0014] Fig. 1 depicts a networked environment for facilitating a dialogue between a user operating a user device and an agent device in conjunction with a user interaction engine, in accordance with an embodiment of the present teaching; [0015] Figs. 2A-2B depict connections among a user device, an agent device, and a user interaction engine during a dialogue, in accordance with an embodiment of the present teaching;

[0016] Fig. 3 A illustrates an exemplary structure of an agent device with exemplary types of agent body, in accordance with an embodiment of the present teaching;

[0017] Fig. 3B illustrates an exemplary agent device, in accordance with an embodiment of the present teaching;

[0018] Fig. 4A illustrates an exemplary dialogue scene ;

[0019] Fig. 4B illustrates exemplary aspects of operation for achieving adaptive dialogue strategy, in accordance with an embodiment of the present teaching;

[0020] Fig. 5A depicts an exemplary high level system diagram of a system for learning dialogue environment modeling via combinational neural networks, in accordance with an embodiment of the present teaching;

[0021] Fig. 5B is a flowchart of an exemplary process of a system for learning dialogue environment modeling via combinational neural networks, in accordance with an embodiment of the present teaching;

[0022] Fig. 6A depicts an exemplary high level system diagram of a dynamic dialogue scene modeling unit, in accordance with an embodiment of the present teaching;

[0023] Fig. 6B is a flowchart of an exemplary process of a dynamic dialogue scene modeling unit, in accordance with an embodiment of the present teaching;

[0024] Fig. 7A depicts an exemplary high level system diagram of a dialogue system in an augmented reality scene built via dialogue scene modeling, in accordance with an embodiment of the present teaching; [0025] Fig. 7B is a flowchart of an exemplary process of a dialogue system in an augmented reality scene built via dialogue scene modeling, in accordance with an embodiment of the present teaching;

[0026] Fig. 8A depicts an exemplary high level system diagram of a dialogue system in a virtual reality scene built via dialogue scene modeling, in accordance with an embodiment of the present teaching;

[0027] Fig. 8B is a flowchart of an exemplary process of a dialogue system in a virtual reality scene built via dialogue scene modeling, in accordance with an embodiment of the present teaching;

[0028] Fig. 9 is an illustrative diagram of an exemplary mobile device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments; and

[0029] Fig. 10 is an illustrative diagram of an exemplary computing device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments.

DETAILED DESCRIPTION

[0030] In the following detailed description, numerous specific details are set forth by way of examples in order to facilitate a thorough understanding of the relevant teachings. However, it should be apparent to those skilled in the art that the present teachings may be practiced without such details. In other instances, well known methods, procedures, components, and/or circuitry have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present teachings.

[0031] The present teaching aims to address the deficiencies of the traditional human machine dialogue systems and to provide methods and systems that enables a more effective and realistic human to machine dialogue. The present teaching incorporates artificial intelligence in an automated companion with an agent device in conjunction with the backbone support from a user interaction engine so that the automated companion can conduct a dialogue based on continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.

[0032] The automated companion according to the present teaching is capable of personalizing a dialogue by adapting in multiple fronts, including, but is not limited to, the subject matter of the conversation, the hardware/components used to carry out the conversation, and the expression/behavior/gesture used to deliver responses to a human conversant. The adaptive control strategy is to make the conversation more realistic and productive by flexibly changing the conversation strategy based on observations on how receptive the human conversant is to the dialogue and the context of the dialogue. The dialogue system according to the present teaching can be configured to achieve a goal driven strategy, including dynamically configuring hardware/software components that are considered most appropriate for a current user to achieve an intended goal. Such optimizations are carried out based on learning, including learning from prior conversations as well as from an on-going conversation, by continuously assessing a human conversant’ s behavior/reactions during the conversation with respect to the intended goals. Paths exploited to achieve a goal driven strategy may be determined to remain the human conversant engaged in the conversation even though in some instances, paths at some moments of time may appear to be deviating from the intended goal.

[0033] More specifically, the present teaching discloses a user interaction engine providing backbone support to an agent device to facilitate more realistic and more engaging dialogues with a human conversant. Fig. 1 depicts a networked environment 100 for facilitating a dialogue between a user operating a user device and an agent device in conjunction with a user interaction engine, in accordance with an embodiment of the present teaching. In Fig. 1, the exemplary networked environment 100 includes one or more user devices 110, such as user devices 110-a, 110-b, 110-c, and 110-d, one or more agent devices 160, such as agent devices 160- a, ... 160-b, a user interaction engine 140, and a user information database 130, each of which may communicate with one another via network 120. In some embodiments, network 120 may correspond to a single network or a combination of different networks. For example, network 120 may be a local area network (“LAN”), a wide area network (“WAN”), a public network, a proprietary network, a proprietary network, a Public Telephone Switched Network (“PSTN”), the Internet, an intranet, a Bluetooth network, a wireless network, a virtual network, and/or any combination thereof. In one embodiment, network 120 may also include various network access points. For example, environment 100 may include wired or wireless access points such as, without limitation, base stations or Internet exchange points 120-a, ... , 120-b. Base stations 120-a and 120-b may facilitate, for example, communications to/from user devices 110 and/or agent devices 160 with one or more other components in the networked framework 100 across different types of network.

[0034] A user device, e.g., 110-a, may be of different types to facilitate a user operating the user device to connect to network 120 and transmit/receive signals. Such a user device 110 may correspond to any suitable type of electronic/computing device including, but not limited to, a desktop computer (110-d), a mobile device (110-a), a device incorporated in a transportation vehicle (110-b), ..., a mobile computer (110-c), or a stationary device/computer (110-d). A mobile device may include, but is not limited to, a mobile phone, a smart phone, a personal display device, a personal digital assistant (“PDAs”), a gaming console/device, a wearable device such as a watch, a Fitbit, a pin/broach, a headphone, etc. A transportation vehicle embedded with a device may include a car, a truck, a motorcycle, a boat, a ship, a train, or an airplane. A mobile computer may include a laptop, an Ultrabook device, a handheld device, etc. A stationary device/computer may include a television, a set top box, a smart household device (e.g., a refrigerator, a microwave, a washer or a dryer, an electronic assistant, etc.), and/or a smart accessory (e.g., a light bulb, a light switch, an electrical picture frame, etc.).

[0035] An agent device, e.g., any of 160-a, ..., 160-b, may correspond one of different types of devices that may communicate with a user device and/or the user interaction engine 140. Each agent device, as described in greater detail below, may be viewed as an automated companion device that interfaces with a user with, e.g., the backbone support from the user interaction engine 140. An agent device as described herein may correspond to a robot which can be a game device, a toy device, a designated agent device such as a traveling agent or weather agent, etc. The agent device as disclosed herein is capable of facilitating and/or assisting in interactions with a user operating user device. In doing so, an agent device may be configured as a robot capable of controlling some of its parts, via the backend support from the application server 130, for, e.g., making certain physical movement (such as head), exhibiting certain facial expression (such as curved eyes for a smile), or saying things in a certain voice or tone (such as exciting tones) to display certain emotions.

[0036] When auser device (e.g., user device 110-a) is connected to an agent device, e.g., 160-a (e.g., via either a contact or contactless connection), a client running on a user device, e.g., 110-a, may communicate with the automated companion (either the agent device or the user interaction engine or both) to enable an interactive dialogue between the user operating the user device and the agent device. The client may act independently in some tasks or may be controlled remotely by the agent device or the user interaction engine 140. For example, to respond to a questions from a user, the agent device or the user interaction engine 140 may control the client running on the user device to render the speech of the response to the user. During a conversation, an agent device may include one or more input mechanisms (e.g., cameras, microphones, touch screens, buttons, etc.) that allow the agent device to capture inputs related to the user or the local environment associated with the conversation. Such inputs may assist the automated companion to develop an understanding of the atmosphere surrounding the conversation (e.g., movements of the user, sound of the environment) and the mindset of the human conversant (e.g., user picks up a ball which may indicates that the user is bored) in order to enable the automated companion to react accordingly and conduct the conversation in a manner that will keep the user interested and engaging. [0037] In the illustrated embodiments, the user interaction engine 140 may be a backend server, which may be centralized or distributed. It is connected to the agent devices and/or user devices. It may be configured to provide backbone support to agent devices 160 and guide the agent devices to conduct conversations in a personalized and customized manner. In some embodiments, the user interaction engine 140 may receive information from connected devices (either agent devices or user devices), analyze such information, and control the flow of the conversations by sending instructions to agent devices and/or user devices. In some embodiments, the user interaction engine 140 may also communicate directly with user devices, e.g., providing dynamic data, e.g., control signals for a client running on a user device to render certain responses.

[0038] Generally speaking, the user interaction engine 140 may control the state and the flow of conversations between users and agent devices. The flow of each of the conversations may be controlled based on different types of information associated with the conversation, e.g., information about the user engaged in the conversation (e.g., from the user information database 130), the conversation history, surround information of the conversations, and/or the real time user feedbacks. In some embodiments, the user interaction engine 140 may be configured to obtain various sensory inputs such as, and without limitation, audio inputs, image inputs, haptic inputs, and/or contextual inputs, process these inputs, formulate an understanding of the human conversant, accordingly generate a response based on such understanding, and control the agent device and/or the user device to carry out the conversation based on the response. As an illustrative example, the user interaction engine 140 may receive audio data representing an utterance from a user operating user device , and generate a response (e.g., text) which may then be delivered to the user in the form of a computer generated utterance as a response to the user. As yet another example, the user interaction engine 140 may also, in response to the utterance, generate one or more instructions that control an agent device to perform a particular action or set of actions.

[0039] As illustrated, during a human machine dialogue, a user, as the human conversant in the dialogue, may communicate across the network 120 with an agent device or the user interaction engine 140. Such communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user’s speech or gesture so that the user’s emotion or intent may be estimated and used to determine a response to the user.

[0040] Fig. 2A depicts specific connections among a user device 110-a, an agent device 160-a, and the user interaction engine 140 during a dialogue, in accordance with an embodiment of the present teaching. As seen, connections between any two of the parties may all be bi-directional, as discussed herein. The agent device 160-a may interface with the user via the user device 110-a to conduct a dialogue in a bi-directional communications. On one hand, the agent device 160-a may be controlled by the user interaction engine 140 to utter a response to the user operating the user device 110-a. On the other hand, inputs from the user site, including, e.g., both the user’s utterance/action and information about the surrounding of the user, are provided to the agent device via the connections. The agent device 160-a may be configured to process such input and dynamically adjust its response to the user. For example, the agent device may be instructed by the user interaction engine 140 to render a tree on the user device. Knowing that the surrounding environment of the user (based on visual information from the user device) shows green trees and lawns, the agent device may customize the tree to be rendered as a lush green tree. If the scene from the user site shows that it is a winter weather, the agent device may control to render the tree on the user device with parameters for a tree that has no leaves. As another example, if the agent device is instructed to render a duck on the user device, the agent device may retrieve information from the user information database 130 on color preference and generate parameters for customizing the duck in a user’s preferred color before sending the instruction for the rendering to the user device.

[0041] In some embodiments, such inputs from the user’s site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.

[0042] In some embodiments, a client running on the user device may be configured to be able to process raw inputs of different modalities acquired from the user site and send the processed information (e.g., relevant features of the raw inputs) to the agent device or the user interaction engine for further processing. This will reduce the amount of data transmitted over the network and enhance the communication efficiency. Similarly, in some embodiments, the agent device may also be configured to be able to process information from the user device and extract useful information for, e.g., customization purposes. Although the user interaction engine 140 may control the state and flow control of the dialogue, making the user interaction engine 140 light weight improves the user interaction engine 140 scale better.

[0043] Fig. 2B depicts the same setting as what is presented in Fig. 2A with additional details on the user device 110-a. As shown, during a dialogue between the user and the agent 210, the user device 110-a may continually collect multi-modal sensor data related to the user and his/her surroundings, which may be analyzed to detect any information related to the dialogue and used to intelligently control the dialogue in an adaptive manner. This may further enhance the user experience or engagement. Fig. 2B illustrates exemplary sensors such as video sensor 230, audio sensor 240, ... , or haptic sensor 250. The user device may also send textual data as part of the multi-model sensor data. Together, these sensors provide contextual information surrounding the dialogue and can be used for the user interaction system 140 to understand the situation in order to manage the dialogue. In some embodiment, the multi-modal sensor data may first be processed on the user device and important features in different modalities may be extracted and sent to the user interaction system 140 so that dialogue may be controlled with an understanding of the context. In some embodiments, the raw multi-modal sensor data may be sent directly to the user interaction system 140 for processing.

[0044] As seen in Figs. 2A-2B, the agent device may correspond to a robot that has different parts, including its head 210 and its body 220. Although the agent device as illustrated in Figs. 2A-2B appears to be a person robot, it may also be constructed in other forms as well, such as a duck, a bear, a rabbit, etc. Fig. 3A illustrates an exemplary structure of an agent device with exemplary types of agent body, in accordance with an embodiment of the present teaching. As presented, an agent device may include a head and a body with the head attached to the body. In some embodiments, the head of an agent device may have additional parts such as face, nose and mouth, some of which may be controlled to, e.g., make movement or expression. In some embodiments, the face on an agent device may correspond to a display screen on which a face can be rendered and the face may be of a person or of an animal. Such displayed face may also be controlled to express emotion.

[0045] The body part of an agent device may also correspond to different forms such as a duck, a bear, a rabbit, etc. The body of the agent device may be stationary, movable, or semi-movable. An agent device with stationary body may correspond to a device that can sit on a surface such as a table to conduct face to face conversation with a human user sitting next to the table. An agent device with movable body may correspond to a device that can move around on a surface such as table surface or floor. Such a movable body may include parts that can be kinematically controlled to make physical moves. For example, an agent body may include feet which can be controlled to move in space when needed. In some embodiments, the body of an agent device may be semi-movable, i.e., some parts are movable and some are not. For example, a tail on the body of an agent device with a duck appearance may be movable but the duck cannot move in space. A bear body agent device may also have arms that may be movable but the bear can only sit on a surface.

[0046] Fig. 3B illustrates an exemplary agent device or automated companion 160- a, in accordance with an embodiment of the present teaching. The automated companion 160-a is a device that interacts with people using speech and/or facial expression or physical gestures. For example, the automated companion 160-a corresponds to an animatronic peripheral device with different parts, including head portion 310, eye portion (cameras) 320, a mouth portion with laser 325 and a microphone 330, a speaker 340, neck portion with servos 350, one or more magnet or other components that can be used for contactless detection of presence 360, and a body portion corresponding to, e.g., a charge base 370. In operation, the automated companion 160-a may be connected to a user device which may include a mobile multi -function device (110-a) via network connections. Once connected, the automated companion 160-a and the user device interact with each other via, e.g., speech, motion, gestures, and/or via pointing with a laser pointer.

[0047] Other exemplary functionalities of the automated companion 160-a may include reactive expressions in response to a user’s response via, e.g., an interactive video cartoon character (e.g., avatar) displayed on, e.g., a screen as part of a face on the automated companion. The automated companion may use a camera (320) to observe the user’s presence, facial expressions, direction of gaze, surroundings, etc. An animatronic embodiment may “look” by pointing its head (310) containing a camera (320), “listen” using its microphone (340), “point” by directing its head (310) that can move via servos (350). In some embodiments, the head of the agent device may also be controlled remotely by a, e.g., the user interaction system 140 or by a client in a user device (110-a), via a laser (325). The exemplary automated companion 160-a as shown in Fig. 3B may also be controlled to “speak” via a speaker (330).

[0048] Fig. 4A illustrates an exemplary dialogue scene 400 where a robot agent

410 interacts with a user in a dialogue in the dialogue scene, in accordance with an embodiment of the present teaching. In this illustration, the dialogue scene 400 is a room with a number of objects, including a desk, a chair, a computer on the desk, walls, a window, and various hanging pictures and/or board on the wall. In addition to such fixtures, there is the robot agent 410 on the desk and the user 420. In interacting with the user 420, the robot agent 410 may make observations of the dialogue scene via scene modeling. Such scene modeling may provide useful information to the robot agent 410 and play a role in assisting the robot agent 451 to adapt its dialogue strategy when needed. For instance, if the robot agent 410 noticed that the user 420 does not engage well in responding to questions from the agent and always looks at a certain direction, using the information from the scene modeling, the robot agent 410 may estimate the object that the user is gazing at. In this case, the robot agent 410 may ask some questions related to the object in order to continue to engage the user in the on-going dialogue.

[0049] In a dialogue system, such an adaptive dialogue strategy may be needed in order to enhance user experience and engagement. An adaptive dialogue strategy may require, as illustrated above, an understanding of not only the dialogue itself but also the surrounding of the dialogue such as the environment in which the dialogue is conducted. Fig. 4B shows different exemplary aspects of functionalities to support an adaptive dialogue strategy. For example, to adapt a dialogue strategy, environment modeling may be carried out to enable a robot agent to understand its surrounding. Spontaneous dialogue is part of the adaptive dialogue strategy which may require that a robot agent use adaptive policies to dynamically control conversations. In some embodiments, to adapt, a robot agent may also create some type of collaborative augmented or virtual reality environment based on observed needs by, e.g., including an adaptively determined environment and/or some virtual agents to deliver adaptively determined content to a user in the augmented/virtual dialogue reality. The present teaching focuses on the aspect of environment modeling.

[0050] Gaining an understanding of a dialogue environment may be the basis of an adaptive dialogue strategy. For example, during a conversation planned between a virtual agent and a user, the virtual agent may be present in the dialogue environment (real or virtual or augmented) to conduct a conversation, which may be conducted based on, e.g., objects present in the dialogue scene. Such objects may also be real or virtual. For instance, if a robot agent is teaching a student user the concept of adding, knowing that the student likes fruits, the robot agent may render multiple number of fruits in the dialogue scene and ask the student user to perform adding. In this case, the fruits may be rendered in the dialogue scene in a manner that is coherent with the existing real scene, e.g., fruits must be displayed in front of the user rather than in the back of the user or rendered on a table rather than in the sky.

[0051] Fig. 5A depicts an exemplary high level system diagram of a system 500 for dialogue environment modeling via combinational neural networks, in accordance with an embodiment of the present teaching. In this illustrated embodiment, the system 500 comprises a convolutional neural network 520 and a graph neural network 530, which are used to learn different aspects of objects present in a dialogue scene. Such a combined use of different neural networks is for training each to capture different characteristics of an environment, e.g., to learn, via the convolutional neural network 520 to learn CNN based object/feature detection models 540 and via the graph neural network 530 to learn GNN based spatial relation detection models 550, respectively. In some embodiments, such models are represented by parameters associated with their respective neural networks, including, e.g., the number of layers, the number of nodes in each layer, the connections among different nodes, weights associated with different connections, the parameters for functions used by different nodes, etc. In such embodiments, with such parameters, each neural network may then be configured accordingly and used to perform the tasks learned.

[0052] In system 500, the convolutional neural network 520 is used for learning, based on training images 510, the CNN based object/feature detection models 540 based on supervised training data. The training images 510 include images with labeled objects and such objects may include a human user present in the dialogue scene and features associated therewith and other types of non-human objects such as a tree, a bench, sky, a picture on a wall, a table, a chair, a computer, or a toy that may be observed in a dialogue scene. For example, a user may be identified in a dialogue scene and features characterizing the user may include his/her pose, orientation, etc. and such a person and his/her characteristics may be detected/extracted using the CNN based object feature detection models 540.

[0053] Similarly, the graph neural network 530 is used for learning, based on training images 510, the GNN based spatial relation detection models 550 based on labeled image training data. The labeled data for training the GNN based spatial relation detection models 550 may include various spatial relations observed in a scene in the training images such as captured spatial relationship among objects may include the spatial relationship between a user and various objects present in the scene. For example, a user and a desk may be identified in a dialogue scene and each may be associated with different characteristics such as each’s physical location, orientation, etc. The spatial relationship between the two may be determined based on their respective characteristics or may be detected/extracted from the training image. Such identified spatial relationships among different objects in the training images may be used to build the GNN based object feature detection models 550.

[0054] To facilitate the training, objects (including a person) and their attributes in scene images 510-2 of the training images 510 are identified, saved in as objects/attributes 510-1, and used by the convolutional neural network 520 for supervised learning to obtain object/feature detection models 540. On the other hand, spatial relations among the objects identified from training images are also identified, labeled, saved in object spatial relationships 510-3. Such identified spatial relations among objects in different scene images are used by the graph neural network 530 for supervised training to obtain the GNN based spatial relation detection models 550. [0055] Fig. 5B is a flowchart of an exemplary process of the system 500 for learning dialogue environment modeling via combinational neural networks, in accordance with an embodiment of the present teaching. At 555, training scene images 510-2 are accessed to obtain, at 560, labeled objects and associated features (attributes) 510-1. Spatial relationships 510-3 among such identified objects are also obtained at 565. The scene images 510-2 as well as the labeled objects/attributes are then used to train, at 570, the convolutional neural network 520 to obtain, at 575, the CNN based object/feature detection models 540. At the same time, the scene images 510-2 and the extracted spatial relations among objects in the scene images are used to train, at 580, the graph neural network 530 to obtain, at 585, the GNN based spatial relation detection models 550.

[0056] The derived CNN object/feature detection models 540 may then be used for detecting objects and associated attributes from an image of a dialogue scene. Similarly, the derived GNN based spatial relation detection models 550 may then be used for detecting spatial relations among objects detected from an image of a dialogue scene. Fig. 6A depicts an exemplary high level system diagram of a dynamic dialogue scene modeling unit 600, in accordance with an embodiment of the present teaching. In this illustrated embodiment, the dynamic dialogue scene modeling unit 600 comprises an object/feature detection unit 610, a scene estimation unit 620, an object spatial relation detector 630, and a scene model generation unit 640. In operation, the dynamic dialogue scene modeling unit 600 receives an image of a dialogue scene and derives a scene model characterizing the dialogue scene based on the CNN based object/feature detection models 540 and the GNN based spatial relation detection models 550.

[0057] Fig. 6B is a flowchart of an exemplary process of the dynamic dialogue scene modeling unit 600, in accordance with an embodiment of the present teaching. When an image of a dialogue scene is received at 655, the object/feature detection unit 610 detects, at 660, objects and their associated features based on the CNN based object/feature detection models 540. In some embodiments, the object/feature detection unit 610 is a trained convolutional neural network configured based on parameters specified in the CNN based object/feature detection models 540. Based on the detected objects and features, the scene estimation unit 620 may then estimate, at 665, the nature of the dialogue scene based on the detected scene objects/features. As each type of scene may have some characterizing objects, it may be estimated as to a type of the dialogue scene based on objects detected from the scene. For instance, an office scene usually has a desk, some chair, a bookcase, and/or computer on the desk. If some of such objects are detected from an image of a scene, it may be inferred or estimated that the dialogue scene involves an office.

[0058] To facilitate adaptive dialogue strategy, spatial relations among detected scene objects may also be detected based on the detected objects and their features. The object spatial relation detector 630 may identify, at 670, spatial relations among objects based on the scene image received and the GNN based spatial relation detection models 550. In some embodiments, the spatial relation detector 630 is a trained graph neural network configured based on parameters specified in the GNN based spatial relation detection models 550. In some embodiments, the object spatial relation detector 630 may detect the spatial relations among different objects directly from the scene image (e.g., if when it is trained, it learned from the images directly). In some embodiments, the object spatial relation detector 630 may also receive information from the object/feature detection unit 610 about the objects/features detected and use that to guide its detection of spatial relations among such detected objects. In some embodiments, the object spatial relation detector 630 may also receive information from the scene estimation unit 620 about an estimated type of scene and use that to facilitate its determination on spatial relations among objects. For instance, if the estimated type of scene is an office scene, then this knowledge may be utilized in determining that a computer is on top of a desk (because in an office, a computer is usually on top of a desk). With the identified scene objects/features and their spatial relations, the scene model generation unit 640 obtains, at 675, a scene model and then stores a representation of the scene model at 680.

[0059] Such a dynamically generated scene model may be used in determining adaptive dialogue strategy. In some embodiments, a scene model created based on a dialogue scene may be used to determine how the dialogue scene may be augmented for the purpose of, e.g., better engaging a user or carrying out some specific tasks. Fig. 7A depicts an exemplary high level system diagram of an adaptive dialogue system 700 for dialogue in an augmented reality scene generated via dialogue scene modeling, in accordance with an embodiment of the present teaching. In this illustrated example application, a user 705 is engaged in a dialogue with the adaptive dialogue system 700. An utterance of the user 705 is received by a spoken language understanding (SLU) unit 702. With an understanding of what the user said, determined by the SLU unit 702, a dialogue manager 710 determines a response based on dialogue trees 725 and directs a response generator 730 to generate a textual response, which is then used by a text to speech (TTS) unit 735 to convert the textual response to a speech response to be delivered to the user 705.

[0060] During the dialogue, the dynamic dialogue scene modeling unit 600 may also receive a scene image capturing the dialogue environment and accordingly generates a scene model in 650 (as discussed herein with reference to Figs. 6A - 6B) representing the surrounding of the dialogue with the user 705. Such scene models may be created continuously and adapted to the change in the scene. For instance, the user moves around in the scene so that features associated with the user and his/her spatial relations with other objects in the scene may change accordingly. The changing features of the objects as well as the changing spatial relations among such objects in the dialogue scene may be captured via continuous scene modeling by the dynamic dialogue scene modeling unit 600 and, thus, the scene models stored in the environment modeling database 650 are adaptively updated with time.

[0061] When the dialogue manager 710 decides to continue the dialogue using an augmented dialogue scene, the scene model for the surrounding environment is used by an augmented scene generation unit 720 to generate an augmented scene by visualizing, e.g., certain selected virtual objects in the dialogue scene. The dialogue manager 710 may also invoke an augmented dialogue policy generator 715 to create augmented dialogue policy to be used to conduct the adapted dialogue in the augmented dialogue scene. For example, if the dialogue system 700 is engaged in a dialogue with user 705 teaching the user the concept of adding. During the dialogue, it may be recognized that the user does not respond well on questions presented on a computer screen and the user appears to be distracted. In this situation, the dialogue manager 710 may decide to better engage the user by projecting colorful virtual objects in the scene for continue the dialogue on adding. To facilitate that, the dialogue manager 710 may invoke the augmented scene generation unit 720 to create virtual objects (e.g., balls of different colors) and the augmented dialogue policy generator 715 to generate the dialogue policy to be used to govern the dialogue based on the augmented scene.

[0062] The created virtual objects are to be visualized or projected in the dialogue scene. To render virtual objects in an actual scene (to form an augmented reality scene) in a coherent manner, various parameters need to be determined, e.g., the 3D location in the scene to visualize the virtual objects, the orientation in which the virtual objects are to be projected (e.g., has to be in the field of view of the user), the size of the virtual objects after the projection, etc. According to the present teaching, such parameters may be determined in accordance with the scene modeling characterizing the objects already present in the scene and their spatial relationships. For example, if the scene modeling indicates that the user is sitting on a chair near a desk, facing the screen of a computer on top of the desk, then the visualization may be performed at a location that will not be occluded by the computer on the desk. If the user is distracted and already stood up walking around in the dialogue scene, then the visualization needs to be provided in the field of view of the user. To achieve that, the augmented scene generation unit 720 may access the adaptively updated scene modeling information from database 650 and accordingly determine the parameters needed to generate the augmented scene. Once the augmented scene and the augmented dialogue policy are accordingly generated, the dialogue manager 710 conducts the dialogue based on the augmented dialogue policy consistent with the augmented scene generated.

[0063] Fig. 7B is a flowchart of an exemplary process of the dialogue system 700 for adaptive dialogue strategy by creating an augmented reality scene via dialogue scene modeling, in accordance with an embodiment of the present teaching. The dialogue system 700 receives, at 740, various input, including an image capturing the dialogue scene and additional information such as audio data representing an utterance of a user. To model the dialogue surrounding, the dynamic dialogue scene modeling unit 600 detects, at 745, objects and their features from the scene image and generates, at 750, a scene model for the dialogue scene. At the same time, based on the received audio information representing an utterance of the user 705, the SLU unit 702 performs, at 755, spoken language understanding and send the SLU result to the dialogue manager 710.

[0064] Based on the current state of the dialogue and surrounding information represented by the scene model, the dialogue manager 710 determines, at 760, whether an augmented dialogue scene for an augmented dialogue is needed. If no augmented dialogue is needed, the dialogue manger 710 proceeds to determine, at 780, a response to the user based on the SLU result from the SLU unit 702 and the dialogue trees 725. Such a determined response is then delivered, at 785, to the user via the TTS unit 735. If an augmented dialogue is needed, the dialogue manager 710 invokes the augmented scene generation unit 720 to create an augmented dialogue scene. This is achieved by identifying, at 765, virtual objects to be rendered in the dialogue scene and determining parameters to be used to render such virtual objects in the scene based on the dynamically updated scene model associated with the dialogue scene. To facilitate the augmented dialogue, the augmented dialogue policy generator 715 is invoked to determine, at 770, the augmented dialogue policy corresponding to the augmented dialogue. Based on the virtual objects and rendering parameters, the augmented reality scene is rendered accordingly, at 775, and then the dialogue manager 710 conducts the augmented dialogue with respect to the augmented dialogue scene by determine, at 780, a response in accordance with the augmented dialogue policy. The response is then delivered to the user at 785. Then the process proceeds back to 740 to receive the updated scene image and the next utterance from the user.

[0065] Continuing the example of teaching the concept of adding numbers where the augmented dialogue is conducted in an augmented dialogue scene where colored balls are rendered or thrown in the air to the user (to get the user’s attention for improved engagement), the augmented dialogue policy may be generated in accordance with the content of the augmented scene. In some embodiments, the dialogue manager 710 may have conducted the dialogue in terms of the original policy to present different types of objects to the user on a computer screen and ask the user to add and provide an answer. When it is observed that the user does not engage well using this means of communication (either answers incorrectly or looks away), the dialogue manager 710 may decide to use augmented means by putting the tutorial content in a more interesting way to get the user’s attention. The dialogue manager 710 may provide the content in the original dialogue to the augmented scene generation unit 720 so that virtual and attractive objects may be selected (e.g., different animals in different colors) according to the original tutorial material. For instance, if the original content is “what is the total number of candies in one box with 2 candies and in another box with 5 candies,” the accordingly generated augmented content may be 2 orange bears and 5 blue dolphins with a question “How many animals?” The corresponding augmented dialogue policy may be accordingly changed to fit the new augmented content.

[0066] In some embodiments, a scene model derived via the dynamic dialogue scene modeling based on one or more scene images captured at one location may be used to construct a virtual scene accordingly at a different location in order to carry out a conversation. Different from an augmented scene where virtual objects may be added to a real scene to form an augmented scene, a virtual scene created based on a scene modeling result may be a computer created virtual scene that is created based on what is described in a scene modeling. For instance, if an office scene is modeled based on one or more scene images with a desk, two chairs, a computer on the desk, and a bookcase physically arranged in certain spatial relationships, such a scene, once detected, can be electronically represented by specifying the objects present, features thereof, and the spatial arrangement. Such a representation may then be used to create a virtual scene of the same objects (but virtual) with the specified features of such virtual objects and arranged in the virtual scene with spatial relationships as in the original physical scene.

[0067] Fig. 8A depicts an exemplary high level system diagram of a dialogue system 800 operating to conduct a dialogue in a virtual reality scene generated via scene modeling, in accordance with an embodiment of the present teaching. In this illustrated example application of user machine dialogue in a virtual scene created via real scene modeling, a user 805 is engaged in a dialogue with the dialogue system 800. An utterance of the user 805 is received by a spoken language understanding (SLU) unit 802, which then analyzes the audio data received and produces a result, representing an understanding of what the user said. A dialogue manager 830 determines a response based on dialogue trees 835 and directs a response generator 730 to a TTS unit 840 to generate audio response via text to speech (TTS) conversion. In this example application, the dialogue is conducted in a virtual environment, which is created based on scene modeling information based on a scene at a different location, where the dynamic dialogue scene modeling unit 600 is deployed to receive one or more scene images from which modeling information is generated to model the scene at that different location.

[0068] In generating the scene modeling information, the dynamic dialogue scene modeling unit 600 may receive one or more images capturing the remote base scene to be modeled and accordingly generate a scene model in accordance with the disclosed approach with reference to Figs. 6A - 6B. Such scene modeling information may be created continuously over time and adapted to the changes as they occurred in the scene. For instance, if the scene is an outdoor scene and objects in the scene may move around, e.g., people move around in a park scene (e.g., people sat on a bench and later left), cloud in the sky travel, etc. The changing features of the objects in the scene may also cause changes in the spatial relations among them and such dynamics may be captured by scene modeling information that is continuously generated. Based on such dynamically updated scene modeling information, the dialogue system 800 may utilize the information stored in the database 650 to create virtual scenes. In some embodiments, the dialogue system 800 may also utilize static scene modeling information and create a virtual scene that is static with virtual objects created based on characterization of corresponding base objects in the remote base scene and spatially configured in the virtual scene in such 3D pose similar to that in the base scene.

[0069] To generate a virtual scene, the dialogue system 800 includes a virtual reality scene generation unit 800, a virtual reality Tenderer 810, and a dynamic content authoring tool 820. The virtual reality scene generation unit 800 accesses the scene modeling information from the database 650 and accordingly generates a virtual scene in accordance with the accessed scene modeling information. As the database 650 stores scene models obtained from different remote scenes, the virtual reality scene generation unit 800 may select a scene model appropriate for the on-going dialogue. For instance, if the on-going dialogue requires an office scene with windows, desk, chairs, and a bookcase, the virtual reality scene generation unit 800 may retrieve a scene model that meets that criterion. If the on-going dialogue requires an outdoor scene in a park with a bench and a lake, the virtual reality scene generation unit 800 may select another scene model for that purpose. The need for a virtual scene and a type of virtual scene to be created may be determined based on, e.g., what is needed to advance the dialogue or a desire to continue to engage a user.

[0070] The ability to create virtual reality dialogue scenes facilitates adaptive dialogue strategy. For instance, during the course of an on-going dialogue, the dialogue manager 830 may also decide, based on circumstances observed during the dialogue, to change virtual scenes for, e.g., better engaging the user 805 or providing enhanced user experience. For example, if a user appears to have fatigue and does not engage very well with the robot agent in a virtual office scene, the dialogue manager 830 may invoke the virtual reality scene generation unit 800 to create an outdoor scene in a park with sunshine to perk up the user. As another example, if a user appears bored in a tutorial session to teach adding and it is known that the user loves animals, a different virtual scene at a farm may be created with different animals at a patting farm. In this way, it may interest the user more and at the same time to accomplish the goal of the tutorial. In some embodiments, different scene models stored in the environment modeling database 650 may provide choices of many different types of scene models which can be retrieved for rendering a virtual scene in accordance with a base scene existed somewhere else in the past.

[0071] With virtual scene created, the dialogue manager 830 may also invoke the dynamic content authoring tool 820 to create dialogue content consistent with not only the virtual scene but also the original dialogue policy evidenced by the dialogue trees 835. For instance, the original dialogue content dictated by the dialogue trees in a dialogue with a virtual office scene may be to ask a user to add two numbers displayed on a computer screen. If due to the need to better engage the user, the virtual office scene is not changed to a virtual outdoor scene with a number of red and green benches in the scene, the dialogue content needs to be accordingly changed to ask what the total number of benches is (adding red and green benches). The dynamic content authoring tool 820 may receive the scene modeling information for the outdoor scene and the original dialogue content from the dialogue trees 835 and generate virtual reality dialogue content and store it in storage 825 so that it can now be used by the dialogue manager 830.

[0072] Fig. 8B is a flowchart of an exemplary process of the dialogue system 800 operating to conduct a dialogue in a virtual reality scene generated via dialogue scene modeling, in accordance with an embodiment of the present teaching. In a dialogue conducted in a virtual scene, a scene model can be first selected, at 850, from a database 650 that stores various scene models constructed by the dynamic dialogue scene modeling unit 600 based on images of different scenes at different locations. In some embodiments, such a selection may be made in accordance with the dialogue strategy to be applied to the dialogue, e.g., planned dialogue content specified in, e.g., dialogue trees 835. In some embodiments, the selection of a virtual reality scene used to conduct a dialogue may be made based on preference of a user who is currently engaged in the dialogue. In some embodiments, the selection of a virtual scene may be made based on both the current dialogue strategy and the preference of the user involved.

[0073] With the selected scene model, the virtual reality scene generation unit 800 invokes the virtual reality Tenderer 810 that renders, at 855, the virtual reality scene of choice according to the scene model. In some embodiments, the rendering may be performed based on some scene rendering configuration 815 which may specify different parameters that can be used to control the rendering. In some embodiments, the rendering may also be performed based on some preference of the user (not shown), e.g., a user may prefer a darker environment. Based on the selected virtual reality scene, dialogue content to be used by the dialogue manager 830 to control the dialogue may also be authored accordingly with respect to the virtual reality scene. This is achieved by the dynamic content authoring tool 820. The content authoring may be done automatically. For instance, if the planned dialogue content asks question on adding two numbers represented by, e.g., different objects, the scene model for virtual reality scene may be selected so that the rendered virtual scene will have the same number of obj ects. In addition, dynamic dialogue content consistent with the dialogue strategy as well as the virtual scene content needs to be developed so that the dialogue strategy may be carried out in light of the virtual scene. The dynamic content authoring tool 820 is to generate dynamically dialogue content with respect to the virtual scene based on the planned dialogue content. It receives, at 860, the dialogue content specified in the dialogue trees and generate, at 865, dynamic dialogue content authored in accordance with the dialogue content intended and the virtual scene rendered. [0074] In such generated virtual scene, when the dialogue system 800 receives, at

870, an utterance from a user, the SLU unit 802 performs, at 875, spoken language understanding to obtain a result representing what the user uttered. Such a result is provided to the dialogue manager 830 which then determines, at 880, a response based on either the planned dialogue content represented in the dialogue trees 835 or the dynamic dialogue content stored in 825. The determined response is then used to generate, by the TTS unit 840 at 885, a speech response by text to speech operation and delivered, at 890, to the user in response to the utterance. In some embodiments, the dialogue manager 830 may determine, at 895, whether the virtual scene needs to be changed based on the current state of the dialogue. If a change is called for, the process proceeds to 850 to select/render a different virtual scene suitable for the current dialogue state and accordingly update dynamically the dialogue content based on the new virtual scene before the user responds, at 870, to the response delivered. If there is no need to change the virtual scene, the dialogue system 800 returns to 870 directly to receive the next utterance from the user.

[0075] In some embodiments, once the dialogue system renders a virtual scene, it may not change the virtual scene until the dialogue is complete. In some embodiments, scene modeling may also be utilized in a scenario where virtual scene and augmented dialogue scene may be combined. For example, in a virtual scene created based on a selected scene model, additional virtual objects may be projected based on needs, e.g., throw in 3 bears and 5 dolphins in the virtual dialogue scene and ask a user to performing adding.

[0076] Fig. 9 is an illustrative diagram of an exemplary mobile device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments. In this example, the user device on which the present teaching is implemented corresponds to a mobile device 900, including, but is not limited to, a smart phone, a tablet, a music player, a handled gaming console, a global positioning system (GPS) receiver, and a wearable computing device (e.g., eyeglasses, wrist watch, etc.), or in any other form factor. Mobile device 900 may include one or more central processing units (“CPUs”) 940, one or more graphic processing units (“GPUs”) 930, a display 920, a memory 960, a communication platform 910, such as a wireless communication module, storage 990, and one or more input/output (I/O) devices 940. Any other suitable component, including but not limited to a system bus or a controller (not shown), may also be included in the mobile device 900. As shown in Fig. 9 a mobile operating system 970 (e.g., iOS, Android, Windows Phone, etc.), and one or more applications 980 may be loaded into memory 960 from storage 990 in order to be executed by the CPU 940. The applications 980 may include a browser or any other suitable mobile apps for managing a conversation system on mobile device 900. User interactions may be achieved via the I/O devices 940 and provided to the automated dialogue companion via network(s) 120.

[0077] To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform(s) for one or more of the elements described herein. The hardware elements, operating systems and programming languages of such computers are conventional in nature, and it is presumed that those skilled in the art are adequately familiar therewith to adapt those technologies to appropriate settings as described herein. A computer with user interface elements may be used to implement a personal computer (PC) or other type of workstation or terminal device, although a computer may also act as a server if appropriately programmed. It is believed that those skilled in the art are familiar with the structure, programming and general operation of such computer equipment and as a result the drawings should be self-explanatory. [0078] Fig. 10 is an illustrative diagram of an exemplary computing device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments. Such a specialized system incorporating the present teaching has a functional block diagram illustration of a hardware platform, which includes user interface elements. The computer may be a general purpose computer or a special purpose computer. Both can be used to implement a specialized system for the present teaching. This computer 1000 may be used to implement any component of conversation or dialogue management system, as described herein. For example, conversation management system may be implemented on a computer such as computer 1000, via its hardware, software program, firmware, or a combination thereof. Although only one such computer is shown, for convenience, the computer functions relating to the conversation management system as described herein may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load.

[0079] Computer 1000, for example, includes COM ports 1050 connected to and from a network connected thereto to facilitate data communications. Computer 1000 also includes a central processing unit (CPU) 1020, in the form of one or more processors, for executing program instructions. The exemplary computer platform includes an internal communication bus 1010, program storage and data storage of different forms (e.g., disk 1070, read only memory (ROM) 1030, or random access memory (RAM) 1040), for various data files to be processed and/or communicated by computer 1000, as well as possibly program instructions to be executed by CPU 1020. Computer 1000 also includes an I/O component 1060, supporting input/output flows between the computer and other components therein such as user interface elements 1080. Computer 1000 may also receive programming and data via network communications. [0080] Hence, aspects of the methods of dialogue management and/or other processes, as outlined above, may be embodied in programming. Program aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Tangible non-transitory “storage” type media include any or all of the memory or other storage for the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.

[0081] All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, in connection with conversation management. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.

[0082] Hence, a machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, which may be used to implement the system or any of its components as shown in the drawings. Volatile storage media include dynamic memory, such as a main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.

[0083] Those skilled in the art will recognize that the present teachings are amenable to a variety of modifications and/or enhancements. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution — e.g., an installation on an existing server. In addition, the fraudulent network detection techniques as disclosed herein may be implemented as a firmware, firmware/software combination, firmware/hardware combination, or a hardware/firmware/ software combination.

[0084] While the foregoing has described what are considered to constitute the present teachings and/or other examples, it is understood that various modifications may be made thereto and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.