Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR INTELLIGENT DIALOGUE BASED ON KNOWLEDGE TRACING
Document Type and Number:
WIPO Patent Application WO/2020/256992
Kind Code:
A1
Abstract:
The present teaching relates to method, system, medium, and implementations for adaptive dialogue management. A language understanding result is received with an assessment thereof. The language understanding result is derived based on an utterance from a user engaged in a dialogue directed to a topic and governed by a dialogue policy. The assessment is obtained with respect to an expected result represented in the dialogue policy. A plurality of probabilities are derived based on the language understanding result and the associated assessment. A set of parameters associated with the dialogue policy are updated based on the plurality of probabilities, wherein the first set of parameters parameterize the dialogue policy with respect to the user and characterize effectiveness of the dialogue with the user under the dialogue policy.

Inventors:
THOMBARE SHALAKA (US)
LIU CHANGSONG (US)
Application Number:
PCT/US2020/036743
Publication Date:
December 24, 2020
Filing Date:
June 09, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DMAI INC (US)
International Classes:
G10L15/22
Foreign References:
US20090112598A12009-04-30
US20100138215A12010-06-03
US20120053945A12012-03-01
US20020173960A12002-11-21
US20110282879A12011-11-17
Attorney, Agent or Firm:
GADKAR, Arush et al. (US)
Download PDF:
Claims:
WE CLAIM:

1. A method implemented on at least one machine including at least one processor, memory, and communication platform capable of connecting to a network for adaptive dialogue management, the method comprising:

receiving a language understanding result with an assessment thereof, wherein the language understanding result is derived based on an utterance from a user engaged in a dialogue directed to a topic, the dialogue is governed by a dialogue policy, and the assessment is obtained with respect to an expected result represented in the dialogue policy;

determining a plurality of probabilities based on the language understanding result and the associated assessment;

updating a first set of parameters associated with the dialogue policy based on the plurality of probabilities, wherein the first set of parameters parameterize the dialogue policy with respect to the user and characterize effectiveness of the dialogue with the user under the dialogue policy.

2. The method of claim 1, wherein

the dialogue policy represents alternative ways to conduct the dialogue with the user on the topic; and

the expected result represents an answer from the user ins response to a statement presented to the user according to the dialogue policy.

3. The method of claim 1, wherein the plurality of probabilities include a knowing positive probability indicative of a likelihood that the user knows the expected result, whether the language understanding result is the same as the expected result;

a knowing negative probability indicative that the user does not know the expected result, whether the language understanding result is the same as the expected result; and

a guessing probability indicative of a likelihood that the user guessed the language understanding result.

4. The method of claim 1, further comprising updating a second set of parameters associated with a representation of the topic based on the plurality of probabilities, wherein the second set of parameters represents a dynamic evaluation of user’s mastery of the topic.

5. The method of claim 4, wherein the first and second sets of parameters characterize utilities of the user with respect to the topic.

6. The method of claim 5, wherein the utilities include:

a state reward indicative of a level of reward to conduct a dialogue with the user on the topic and associated with a representation of the topic; and

one or more path rewards each of which being associated with one of alternative dialogue paths embedded in the dialogue policy and representing effectiveness of the dialogue with the user along the dialogue path.

7. The method of claim 5, further comprising determining a response to the user based on the utilities of the user dynamically updated based on knowledge traced with respect to the dialogue policy for the topic.

8. Machine readable and non-transitory media having information recorded thereon for adaptive dialogue management, wherein the information, when read by the machine, causes the machine to perform the following:

receiving a language understanding result with an assessment thereof, wherein the language understanding result is derived based on an utterance from a user engaged in a dialogue directed to a topic, the dialogue is governed by a dialogue policy, and the assessment is obtained with respect to an expected result represented in the dialogue policy;

determining a plurality of probabilities based on the language understanding result and the associated assessment;

updating a first set of parameters associated with the dialogue policy based on the plurality of probabilities, wherein the first set of parameters parameterize the dialogue policy with respect to the user and characterize effectiveness of the dialogue with the user under the dialogue policy.

9. The medium of claim 8, wherein

the dialogue policy represents alternative ways to conduct the dialogue with the user on the topic; and

the expected result represents an answer from the user ins response to a statement presented to the user according to the dialogue policy.

10. The medium of claim 8, wherein the plurality of probabilities include a knowing positive probability indicative of a likelihood that the user knows the expected result, whether the language understanding result is the same as the expected result;

a knowing negative probability indicative that the user does not know the expected result, whether the language understanding result is the same as the expected result; and

a guessing probability indicative of a likelihood that the user guessed the language understanding result.

11. The medium of claim 8, wherein the information, when read by the machine, further causes the machine to perform updating a second set of parameters associated with a representation of the topic based on the plurality of probabilities, wherein the second set of parameters represents a dynamic evaluation of user’s mastery of the topic.

12. The medium of claim 11, wherein the first and second sets of parameters characterize utilities of the user with respect to the topic.

13. The medium of claim 12, wherein the utilities include:

a state reward indicative of a level of reward to conduct a dialogue with the user on the topic and associated with a representation of the topic; and

one or more path rewards each of which being associated with one of alternative dialogue paths embedded in the dialogue policy and representing effectiveness of the dialogue with the user along the dialogue path.

14. The medium of claim 12, wherein the information, when read by the machine, further causes the machine to perform determining a response to the user based on the utilities of the user dynamically updated based on knowledge traced with respect to the dialogue policy for the topic.

15. A system for adaptive dialogue management, comprising:

a knowledge tracing unit configured for receiving a language understanding result with an assessment thereof, wherein the language understanding result is derived based on an utterance from a user engaged in a dialogue directed to a topic, the dialogue is governed by a dialogue policy, and the assessment is obtained with respect to an expected result represented in the dialogue policy;

a plurality of probability estimators configured for determining a plurality of probabilities based on the language understanding result and the associated assessment;

an information state updater configured for updating a first set of parameters associated with the dialogue policy based on the plurality of probabilities, wherein the first set of parameters parameterize the dialogue policy with respect to the user and characterize

effectiveness of the dialogue with the user under the dialogue policy.

16. The system of claim 15, wherein

the dialogue policy represents alternative ways to conduct the dialogue with the user on the topic; and the expected result represents an answer from the user ins response to a statement presented to the user according to the dialogue policy.

17. The system of claim 15, the plurality of probabilities estimators include:

a knowing positive probability estimator configured for estimating a knowing positive probability indicative of a likelihood that the user knows the expected result, whether the language understanding result is the same as the expected result;

a knowing negative probability estimator configured for estimating a knowing negative probability indicative that the user does not know the expected result, whether the language understanding result is the same as the expected result; and

a guessing probability estimator configured for estimating a guessing probability indicative of a likelihood that the user guessed the language understanding result.

18. The system of claim 15, wherein the information state updater is further configured for updating a second set of parameters associated with a representation of the topic based on the plurality of probabilities, wherein the second set of parameters represents a dynamic evaluation of user’s mastery of the topic.

19. The system of claim 18, wherein the first and second sets of parameters characterize utilities of the user.

20. The system of claim 19, wherein the utilities include: a state reward indicative of a level of reward to conduct a dialogue with the user on the topic and associated with a representation of the topic; and

one or more path rewards each of which being associated with one of alternative dialogue paths embedded in the dialogue policy and representing effectiveness of the dialogue with the user along the dialogue path.

21. The method of claim 19, further comprising determining a response to the user based on the utilities of the user dynamically updated based on knowledge traced with respect to the dialogue policy for the topic.

Description:
SYSTEM AND METHOD FOR INTELLIGENT

DIALOGUE BASED ON KNOWLEDGE TRACING

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority from U.S. Provisional Patent Application 62/862268, filed June 17, 2019, (Attorney Docket No. : 047437- 0503582), U.S. Provisional Patent Application 62/862,253, filed June 17, 2019, (Attorney Docket No.: 047437- 0503572), U.S. Provisional Patent Application 62/862257, filed June 17, 2019, (Attorney Docket No. : 047437- 0503574), U.S. Provisional Patent Application 62/862261, filed June 17, 2019, (Attorney Docket No. : 047437- 0503575), U.S. Provisional Patent Application 62/862264, filed June 17, 2019, (Attorney Docket No. : 047437- 0503578), U.S. Provisional Patent Application 62/862265, filed June 17, 2019, (Attorney Docket No. : 047437- 0503581), U.S. Provisional Patent Application 62/862273, filed June 17, 2019, (Attorney Docket No. : 047437- 0503579), U.S. Provisional Patent Application 62/862275, filed June 17, 2019, (Attorney Docket No. : 047437- 0503580), U.S. Provisional Patent Application 62/862279, filed June 17, 2019, (Attorney Docket No. : 047437- 0503584), U.S. Provisional Patent Application 62/862282, filed June 17, 2019, (Attorney Docket No. : 047437- 0503585), U.S. Provisional Patent Application 62/862286, filed June 17, 2019, (Attorney Docket No. : 047437- 0503586), U.S. Provisional Patent Application 62/862290, filed June 17, 2019, (Attorney Docket No.: 047437- 0503587), and U.S. Provisional Patent Application 62/862296, filed June 17, 2019, (Attorney Docket No.: 047437- 0503589), the contents of which are incorporated herein by reference in their entirety. BACKGROUND

1. Technical Field

[0001] The present teaching generally relates to computer. More specifically, the present teaching relates to human machine dialogue management.

2. Technical Background

[0002] With advancement of artificial intelligence technologies and the explosion of Internet based communications due to the ubiquitous presence of Internet’s connectivity, computer aided dialogue systems have become increasingly popular. For example, more and more call centers deploy automated dialogue robot to handle customer calls. Hotels started to install various kiosks that can answer questions from tourists or guests. In recent years, automated human machine communications in other areas are also becoming more and more popular.

[0003] Traditional computer aided dialogue systems are usually pre-programed with certain dialogue content such as questions and answers based on commonly known patterns of conversations in relevant domains. Unfortunately, some conversation patterns may work for some human users but may not for others. In addition, a human user may digress during a conversation and continuing a fixed conversation pattern without regard to what the user says likely will cause irritation or loss of interests, which is undesirable.

[0004] In planning a conversation, a human designer usually needs to manually author the content of the conversation based on known knowledge, which is time consuming and tedious. Considering the need to authoring different conversation patterns, even more labor is required. Upon the dialogue content being authored, any deviation from the designed conversation patterns may need to be noted and used in determining how to continue the conversation. The prior dialogue systems do not effectively address such issues.

[0005] With the recent development in the AI field, dynamic information observed may be adaptively incorporated in the learning and used to guide the progression of a human machine interaction session. The issue of how to develop a knowledge representation capable of incorporating dynamic information in different dimensions and sometimes in different modalities is a challenging issue. As such a knowledge representation serves as a basis of conducting a dynamic dialogue process between a human and a machine, the representation needs to be adequately configured to support adaptive conversation in a relevant manner.

[0006] To conduct a conversation with a human, an automated dialogue system may need to achieve different levels of understanding of what the human said linguistically, what is the semantic meaning of what was said, sometimes the emotional state of the human, and the mutual causal effect between what is said and the surrounding of the conversation environment. The traditional computer aided dialogue systems are not adequate to address such issues.

[0007] Thus, there is a need for methods and systems that address such limitations.

SUMMARY

[0008] The teachings disclosed herein relate to methods, systems, and programming for advertising. More particularly, the present teaching relates to methods, systems, and programming related to exploring sources of advertisement and utilization thereof.

[0009] In one example, a method, implemented on a machine having at least one processor, storage, and a communication platform capable of connecting to a network for adaptive dialogue management. A language understanding result is received with an assessment thereof. The language understanding result is derived based on an utterance from a user engaged in a dialogue directed to a topic and governed by a dialogue policy. The assessment is obtained with respect to an expected result represented in the dialogue policy. A plurality of probabilities are derived based on the language understanding result and the associated assessment. A set of parameters associated with the dialogue policy are updated based on the plurality of probabilities, wherein the first set of parameters parameterize the dialogue policy with respect to the user and characterize effectiveness of the dialogue with the user under the dialogue policy.

[0010] In a different example, a system for adaptive dialogue management is disclosed that includes a knowledge tracing unit, a plurality of probability estimators, and an information state updater. The knowledge tracing unit is configured for receiving a language understanding result with an assessment thereof, wherein the language understanding result is derived based on an utterance from a user engaged in a dialogue directed to a topic, the dialogue is governed by a dialogue policy, and the assessment is obtained with respect to an expected result represented in the dialogue policy. The plurality of probability estimators are configured for determining a plurality of probabilities based on the language understanding result and the associated assessment. The information state updater is configured for updating a first set of parameters associated with the dialogue policy based on the plurality of probabilities, wherein the first set of parameters parameterize the dialogue policy with respect to the user and characterize effectiveness of the dialogue with the user under the dialogue policy.

[0011 ] Other concepts relate to software for implementing the present teaching. A software product, in accord with this concept, includes at least one machine-readable non- transitory medium and information carried by the medium. The information carried by the medium may be executable program code data, parameters in association with the executable program code, and/or information related to a user, a request, content, or other additional information.

[0012] In one example, a machine-readable, non-transitory and tangible medium having data recorded thereon for adaptive dialogue management, wherein the medium, when read by the machine, causes the machine to perform a series of steps as required by the disclosed method for dialogue management. A language understanding result is received with an assessment thereof. The language understanding result is derived based on an utterance from a user engaged in a dialogue directed to a topic and governed by a dialogue policy. The assessment is obtained with respect to an expected result represented in the dialogue policy. A plurality of probabilities are derived based on the language understanding result and the associated assessment. A set of parameters associated with the dialogue policy are updated based on the plurality of probabilities, wherein the first set of parameters parameterize the dialogue policy with respect to the user and characterize effectiveness of the dialogue with the user under the dialogue policy.

[0013] Additional advantages and novel features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The advantages of the present teachings may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014] The methods, systems and/or programming described herein are further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

[0015] Fig. 1A depicts an exemplary configuration of a dialogue system centered around an information state capturing dynamic information observed during a dialogue, in accordance with an embodiment of the present teaching;

[0016] Fig. IB is a flowchart of an exemplary process of a dialogue system using an information state capturing dynamic information observed during a dialogue, in accordance with an embodiment of the present teaching;

[0017] Figs. 2A depicts an exemplary construction of an information state, in accordance with an embodiment of the present teaching;

[0018] Fig. 2B illustrates how representations of estimated different mindsets are connected in a dialogue with a robot tutor teaching a user adding fractions, in accordance with an embodiment of the present teaching;

[0019] Fig. 2C shows an exemplary relationship among estimated representations of an agent’ s mindset, a shared mindset, and a user’ s mindset in an information state, in accordance with an embodiment of the present teaching;

[0020] Fig. 3A shows exemplary relationships among different types of And-Or-

Graphs (AOGs) used to represent estimated mindsets of parties involved in a dialogue, in accordance with an embodiment of the present teaching;

[0021] Fig. 3B depicts exemplary associations between spatial AOGs (S-AOGs) and temporal AOGs (T-AOGs) in an information state, in accordance with an embodiment of the present teaching; [0022] Fig. 3C illustrates an exemplary S-AOG and its associated T-AOGs, in accordance with an embodiment of the present teaching;

[0023] Fig. 3D illustrates exemplary relationships among an S-AOG, a T-AOG, and a C-AOG, in accordance with an embodiment of the present teaching;

[0024] Fig. 4A illustrates an exemplary S-AOG representing partially an agent’s mindset for teaching different mathematical concepts, in accordance with an embodiment of the present teaching;

[0025] Fig. 4B illustrates an exemplary T-AOG representing a dialogue policy associated partially with an agent’s mindset to teach the concept of fraction, in accordance with an embodiment of the present teaching;

[0026] Fig. 4C shows exemplary dialogue content for teaching a concept associated with fraction, in accordance with an embodiment of the present teaching;

[0027] Fig. 5A illustrates an exemplary temporal parsed graph (T-PG) within a T-

AOG representing a shared mindset between a user and a machine, in accordance with an embodiment of the present teaching;

[0028] Fig. 5B illustrates a part of a dialogue between a machine and a human along a dialogue path representing a present representation of a shared mindset, in accordance with an embodiment of the present teaching;

[0029] Fig. 5C depicts an exemplary S-AOG with nodes parameterized with measures related to levels of mastery of different underlying concepts to represent a user’ s mindset, in accordance with an embodiment of the present teaching; [0030] Fig. 5D shows exemplary types of personality traits of a user that can be estimated based on observations from a dialogue, in accordance with an embodiment of the present teaching;

[0031] Fig. 6A depicts a generic S-AOG for a tutoring dialogue, in accordance with an embodiment of the present teaching;

[0032] Fig. 6B depicts a specific T-AOG for a dialogue on greeting, in accordance with an embodiment of the present teaching;

[0033] Fig. 6C shows different types of parameterization alternatives for different types of AOGs, in accordance with an embodiment of the present teaching;

[0034] Fig. 6D illustrates an S-AOG with different nodes parameterized with rewards updated based on dynamic observations from a dialogue, in accordance with an embodiment of the present teaching;

[0035] Fig. 6E illustrates an exemplary T-AOG generated by consolidating different graphs via graph matching with parameterized content, in accordance with an embodiment of the present teaching;

[0036] Fig. 6F illustrates an exemplary T-AOG with parameterized content associated with nodes, in accordance with an embodiment of the present teaching;

[0037] Fig. 6G shows a T-AOG having each node parameterized with one or more content sets, in accordance with an embodiment of the present teaching;

[0038] Fig. 6H illustrates exemplary data in different content sets associated with different nodes of a T-AOG, in accordance with an embodiment of the present teaching; [0039] Fig. 61 illustrates an exemplary T-AOG with different paths traversing different nodes parameterized with rewards updated based on dynamic observations from a dialogue, in accordance with an embodiment of the present teaching;

[0040] Fig. 7A depicts a high level system diagram of a knowledge tracing unit, in accordance with an embodiment of the present teaching;

[0041] Fig. 7B illustrates how knowledge tracing enables adaptive dialogue management, in accordance with an embodiment of the present teaching;

[0042] Fig. 7C is a flowchart of an exemplary process of a knowledge tracing unit, in accordance with an embodiment of the present teaching;

[0043] Fig. 8A shows an example of utility-driven tutoring (node) planning with respect to S-AOGs, in accordance with an embodiment of the present teaching;

[0044] Fig. 8B illustrates an example of utility driven path planning with respect to

T-AOGs, in accordance with an embodiment of the present teaching;

[0045] Fig. 8C illustrates a dynamic state in utility-driven adaptive dialogue management derived based on parameterized AOGs, in accordance with an embodiment of the present teaching;

[0046] Fig. 9A depicts exemplary modes to create AOGs with authored content, in accordance with an embodiment of the present teaching;

[0047] Fig. 9B depicts an exemplary high level system diagram of a content authoring system for automatically creating AOGs via machine learning, in accordance with an embodiment of the present teaching;

[0048] Fig. 9C shows different types of topic based AOGs derived from machine learning, in accordance with an embodiment of the present teaching; [0049] Fig. 9D is a flowchart of an exemplary process for a content authoring system for creating AOGs via machine learning, in accordance with an embodiment of the present teaching;

[0050] Fig. 10A illustrates an exemplary visual programming interface configured for content authoring associated with AOGs, in accordance with an embodiment of the present teaching;

[0051] Fig. 10B illustrates an exemplary visual programming interface configured for authoring content for parameterized AOGs, in accordance with an embodiment of the present teaching;

[0052] Fig. IOC is a flowchart of an exemplary process for creating AOGs and content authoring via visual programming, in accordance with an embodiment of the present teaching;

[0053] Fig. 11A illustrates exemplary codes obtained via automated/semi- automated content authoring for generating a scene associated with an S-AOG, in accordance with an embodiment of the present teaching;

[0054] Fig. 11B illustrates exemplary codes obtained via automated/semi- automated content authoring for generating a T-AOG, in accordance with an embodiment of the present teaching;

[0055] Fig. 12A depicts an exemplary high level system diagram of a system for authoring content based on multimodal inputs from a user, in accordance with an embodiment of the present teaching; [0056] Fig. 12B illustrates different types of meta data that can be automatically generated based on multimodal inputs from a user, in accordance with an embodiment of the present teaching;

[0057] Fig. 12C is a flowchart of an exemplary process of a system configured for authoring content based on multimodal inputs from a user, in accordance with an embodiment of the present teaching;

[0058] Fig. 13 is an illustrative diagram of an exemplary mobile device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments; and

[0059] Fig. 14 is an illustrative diagram of an exemplary computing device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments.

DETAILED DESCRIPTION

[0060] In the following detailed description, numerous specific details are set forth by way of examples in order to facilitate a thorough understanding of the relevant teachings. However, it should be apparent to those skilled in the art that the present teachings may be practiced without such details. In other instances, well known methods, procedures, components, and/or circuitry have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present teachings.

[0061] The present teaching aims to address the deficiencies of the traditional human machine dialogue systems and to provide methods and systems that enables rich representations of multimodal information from the conversation environment to allow the machine to have an improved sense of the surrounding of the dialogue in order to better adapt the dialogue with more effective conversation and enhanced engagement with the users. Based on such representations, the present teaching further discloses different modes to create such representations and to author content of dialogues in such representations. Furthermore, to allow adaptation of the representations based on the dynamics occurred during the conversation, the present teaching also discloses the mechanism of tracing the dynamics of a conversation and accordingly update the representations, which are then used by the machine to conduct the dialogues in an adaptive manner that is utility-driven to achieve maximized outcome.

[0062] Fig. 1A depicts an exemplary configuration of a dialogue system 100 centered around an information state 110 capturing dynamic information observed during the dialogue, in accordance with an embodiment of the present teaching. The dialogue system 100 comprises multimodal information processor 120, an automatic speech recognition (ASR) engine 130, a natural language understanding (NLU) engine 140, a dialogue manager (DM) 150, a natural language generation (NLG) engine 160, a text-to-speech (TTS) engine 170. The system 100 interfaces with a user 180 to conduct a dialogue.

[0063] During a dialogue, multimodal information is collected from the environment (including from the user 180), which captures the surrounding information of the conversation environment, the speech and expressions, either facial or physical, of the user, etc. Such collected multimodal information is analyzed by the multimodal information processor 120 to extract relevant features in different modalities in order to estimate different characteristics of the user and the environment. For instance, the speech signal may be analyzed to determine speech related features such as talking speed, pitch, or even accent. The visual signal related to the user may also be analyzed to extract, e.g., facial features or physical gestures, etc. in order to determine expressions of the user. Combining the acoustic features and visual features, the multimodal information analyzer 120 may also be able to infer the emotional state of the user. For instance, a high pitch in voice, fast talking, plus an angry facial expression may indicate that the user is upset. In some embodiments, observed user activities may also be analyzed to better understand the user. For example, if a user is pointing or walking towards a specific object, it may reveal what the user is referring to in his/her speech. Such multimodal information may provide useful context in understand the intent of the user. The multimodal information processor 120 may continuously analyze the multimodal information and store such analyzed information in the information state 110, which is then used by different components in system 100 to facilitate decision makings related to dialogue management.

[0064] In operation, speech information from the user 180 is sent to the ASR engine

130 to perform speech recognition. The speech recognition may include discern the language spoken and the words being uttered by the user 180. To understand the semantics of what the user said, the result from the ASR engine 130 is further processed by the NLU engine 140. Such understanding may rely on not only the words being spoken but also other information such as the expression and gesture of the user 180 and/or other contextual information such as what was said previously. Based on the understanding of the user’s utterance, the dialogue manager 150 determines how to respond to the user. Such a determined response may then be generated via the NLG engine 160 in a text form and further transformed from the text form to speech signals via the TTS engine 170. The output of the TTS engine 170 may then be delivered to the user 180 as a response to the user’s utterance. The process continues via such back and forth interactions for the machine dialogue system to carry on the conversation with the user 180. [0065] As seen in Fig. 1A, components in system 100 are connected to the information state 110, which, as discussed herein, captures the dynamics around the dialogue and provides relevant and rich contextual information that can be used to facilitate speech recognition (ASR), language understanding (NLU), and various dialogue related determinations, including what is an appropriate response (DM), what linguistic feature to be applied to the textual response (NLG), and how to convert a textual response to a speech form (TTS) (e.g., what accent). As discussed herein, the information state 110 may represent the dynamics relevant to a dialogue obtained based on multimodal information, either related to the user 180 or to the surrounding of the dialogue.

[0066] Upon receiving the multimodal information from the dialogue scene (either about the user or about the dialogue surrounding), the multimodal information processor 170 analyzes the information and characterizes the dialogue surroundings in different dimensions, e.g., acoustic characteristics (e.g., pitch, speed, accent of the user), visual characteristics (e.g., facial expressions of the user, objects in the environment), physical characteristics (e.g., user’s hand waving or pointing at an object in the environment), estimated emotion and/or the state of mind of the user, and/or preferences or intent of the user. Such information may then be stored in the information state 110.

[0067] The rich media contextual information stored in the information state 110 may facilitate different components to play their respective roles so that the dialogue may be conducted in a way that is adaptive, more engaging, and more effective with respect to the goals intended. For example, rich contextual information can improve understanding the utterance of the user 180 in light of what was observed in the dialogue scene, assessing the performance of the user 180, and/or estimating the utilities (or preferences) of the user in light of the intended goal of a dialogue, determining how to respond to the utterance of the user 180 based on estimated emotional state of the user, and delivering a response in a manner that is consider most appropriate based on what is known about the user, etc. For instance, if accent information is captured in the information state about a user, represented in both acoustic form (e.g., special way to speak certain phonemes) and visual form (e.g., special visemes of a user), the ASR engine 130 may utilize that information to figure out the words a user said. Similarly, NLU engine 140 may also utilize the rich contextual information to figure out the semantics of what a user means. For instance, if a user points to a computer placing on a desk (visual information) and said,“I like this,” the NLU engine 140 may combine the output of the ASR engine 130 (i.e.,“I like this”) and the visual information that the user is pointing at a computer in the room to understand that by“this” the user means the computer. As another example, if 180 user repeatedly made mistakes in a tutoring session and, at the same time, the user appears to be quite annoyed estimated based on the tone of the speech and/or user’s facial expression (e.g., determined based on multimodal information), instead of keeping pressing on the tutoring content, the DM 140 may determine to change the topic temporarily based on known interests of the user (e.g., talk about Lego games) in order to continue to engage the user. The decision of distracting the user may be determined based on, e.g., utilities previously observed with respect to the user as to what worked (e.g., intermittently distracting the user has worked in the past) and what would not work (e.g., continue to pressure the user to do better).

[0068] Fig. IB is a flowchart of an exemplary process of the dialogue system 100 with the information state 110 capturing dynamic information observed during the dialogue, in accordance with an embodiment of the present teaching. As seen in Fig. IB, the process is an iterated process. At 105, multimodal information is received, which is then analyzed by the multiple information processor 170 at 125. As discussed herein, the multimodal information includes information related to the user 180 and/or that related to the dialogue surroundings. Multimodal information related to the user may include the user’s utterance and/or visual observations of the user such as physical gestures and/or facial expressions. Information related to the dialogue surroundings may include information related to the environment such as objects present, the spatial/temporal relationships between the user and such observed objects (e.g., user stands in front of a desk) , and/or the dynamics between the user’s activities and the observed objects (e.g., user walked towards the desk and points at a computer on the desk). An understanding of the multimodal information captured from the dialogue scene may then be used to facilitate other tasks in the dialogue system 100.

[0069] Based on the information stored in the information state 110 (representing the past state) as well as the analysis result from the multimodal information processor 170 (on present state), the ASR engine 120 and the NLU engine 130 perform, at 125, respectively, speech recognition to ascertain the words spoken by the user and language understanding based on the recognized words. ASR and NLU may be performed based on the current information state 110 as well as the analysis results from the multimodal information processor 170.

[0070] Based on the multimodal information analysis and the result of language understanding, i.e., what the user said or meant, the changes of the dialogue state are traced, at 135, and such changes are used to update, at 145, the information state 110 accordingly to facilitate the subsequent processing. To carry on the dialogue, the DM 140 determines, at 155, a response based on a dialogue tree designed for the underlying dialogue, the output of the NLU engine 130 (understanding of the utterance), and the information stored in the information state 110. Once the response is determined, the response is generated, by the NLG engine 150, in, e.g., its textual form based on the information state 110. When a response is determined, there may be different ways of saying it. The NLG engine 150 may generate, at 165, a response in a style based on the user’s preferences or what is known to be more appropriate to the particular user in the current dialogue. For instance, if the user answers a question incorrectly, there are different ways to point out that the answer is incorrect. For a particular user in the present dialogue, if it is known that the user is sensitive and easily gets frustrated, a gentler way to tell the user that his/her answer is not correct may be used to generate the response. For example, instead of saying“It is wrong,” the NLG engine 150 may generate a textual response of“It is not completely correct.”

[0071] The textual response, generated by the NLG engine 150, may then be rendered into a speech form, at 175 by the TTS engine 160, e.g., in an audio signal form. Although standard or commonly used TTS techniques may be used to perform TTS, the present teaching discloses that the response generated by the NLG engine 150 may be further personalized based on information stored in the information state 110. For instance, if it is known that a slower talking speed or a softer talking manner works better for the user, the generated response may be rendered, at 175 by the TTS engine 160, into a speech form accordingly, e.g., with a lower speed and pitch. Another example is to render the response with an accent consistent with the student’s known accent according to the personalized information about the user in the information state 110. The rendered response may then be delivered, at 185, to the user as a response to the user’s utterance. Upon the response to the user, the dialogue system 100 then traces the additional change of the dialogue and updates, at 195, the information state 110 accordingly.

[0072] Figs. 2A depicts an exemplary construction of the information state representation 110, in accordance with an embodiment of the present teaching. Without limitation, the information state 110 includes representations for estimated minds or mindsets. As illustrated, different representations may be estimated to represent, e.g., the agent’s mindset 200, user’s mindset 220, and a shared mine 210, in connection with other information recorded therein. The agent’s mindset 200 may refer to the intended goal(s) that the dialogue agent (machine) is to achieve in a particular dialogue. The shared mindset 210 may refer to the representation of the present dialogue situation which is a combination of the agent’s carrying out the intended agenda according to the agent’s mindset 200 and the actual performance of the user. The user’s mindset 220 may refer to the representation of an estimation, by the agent according to the shared mindset or the performance of the user, of where the student is with respect to the intended purpose of the dialogue. For example, if an agent’s current task is teaching a student user the concept of fraction in math (which may include sub-concepts to build up the understanding of fraction), the user’s mindset may include an estimated levels of mastery of the user on various related concepts. Such an estimation may be derived based on an assessment of the student performance at different stages of tutoring such relevant concepts.

[0073] Fig. 2B illustrates how such representations of different mindsets are connected in an example of a robot tutor 205 teaching a student user 180 on concept 215 related to adding fractions, in accordance with an embodiment of the present teaching. As seen, a robot agent 205 is interacting with a student user 180 via multimodal interactions. The robot agent 205 may start with the tutoring based on an initial representation of the agent’s mindset 200 (e.g., the course on adding fractions which may be represented as AOGs). During the tutoring, student user 180 may answer questions from the robot tutor 205 and such answers to the questions form a certain dialogue path, enabling estimation of a representation of the shared mindset 210. Based on the user’s answers, the performance of the user is assessed and a representation of the user’s mindset 220 is estimated with respect different aspects, e.g., whether the student masters the concept taught and what dialogue style works for this particular student.

[0074] As seen in Fig. 2A, the representations of the estimated mindsets are based on some graph related forms, including but not limited to, Spatial-Temporal-Causal And-Or- Graphs STC-AOGs 230, STC parsed graphs (STC-PGs) 240, and may be used in connection with other types of information stored in the information state such as dialogue history 250, dialogue context 260, event-centric knowledge 270, common sense models 280, ..., and user profiles 290. These different types of information may be of multiple modalities and constitute different aspects of the dynamics of each dialogue with respect to each user. As such, the information state 110 captures both general information of various dialogues and personalized information with respect to each user and each dialogue and interconnect together to facilitate different components in the dialogue system 100 to carry out respective tasks in a more adaptive, personalized, and engaging manner.

[0075] Fig. 2C shows an exemplary relationship among the agent’s mindset 200, the shared mindset 210, and a user’s mindset 220 represented in the information state 110, in accordance with an embodiment of the present teaching. As discussed herein, the share mindset 210 represents a state of the dialogue achieved via interactions between an agent and a user and is a combination of what the agent intended (according to the agent’s mindset) and how the user performed in following the agent’s intended agenda. Based on the shared mindset 210, the dynamics of the dialogue may be traced as to what the agent is able to achieve and what the user is able to achieve up to that point.

[0076] Tracing such dynamic knowledge enables the system to estimate what the user has so far achieved up to that point, which concept or sub-concepts that the student user has so far mastered in what way (i.e., which dialogue path(s) are working and which one(s) may not work well). Based on what the student user has so far achieved, the user’s mindset 220 may be inferred or estimated, which will be used to determine how the agent may further facilitate to adjust or update the dialogue strategy in order to achieve the intended goal or to adjust the agent’ s mindset on adapting to the user. The process of adjusting the agent’s mindset enable to derive an updated agent’ s mindset 200. Based on the dialogue history, the dialogue system 100 learns the preferences of the user or what works better for the user (utility). Such information, once incorporated into the information state, is to be used to adapt the dialogue strategy via utility-drive (or preference driven) dialogue planning. An updated dialogue strategy drives the next step in the dialogue which in turn leads to a response from the user, and subsequently the updates to the shared mindset, user’ s mindset, and the agent’s mindset. The process iterates so that the agent can continue to adapt dialogue strategy based on the dynamic information state.

[0077] According to the present teaching, different mindsets are represented based on, e.g., STC-AOGs and STC-PGs. Fig. 3 A shows exemplary relationships among different types of And-Or-Graphs (AOGs) used to represent estimated mindsets of parties involved in a dialogue, in accordance with an embodiment of the present teaching. AOGs are graphs with AND and OR branches. Branches associated with a node in an AOG and related by an AND relation represent tasks that need to be all traversed. Branches from a node in an AOG and related by an OR relation represent tasks that can be alternatively traversed. As discussed herein, STC-AOGs include S- AOGs corresponding to spatial AOGs, T-AOGs corresponding to temporal AOGs, and C-AOGs corresponding to causal AOGs. According to the present teaching, an S-AOG is a graph comprising nodes each of which may correspond to a topic to be covered in a dialogue. A T-AOG is a graph comprising nodes each of which may correspond to a temporal action to be taken. Each T-AOG may be associated with a topic or node in an S-AOG, i.e., representing steps to be carried out during a dialogue on the topic corresponding to the S-AOG node. A C-AOG is a graph comprising nodes each of which may link to a node in a T-AOG and a node in a corresponding S- AOG, representing an action occurred in the T-AOG and its causal effect to the node in the corresponding S-AOG.

[0078] Fig. 3B depicts exemplary relationships between nodes in a S-AOG and nodes of an associated T-AOG represented in the information state 110, in accordance with an embodiment of the present teaching. In this illustration, each K node corresponds to a node in S- AOG, representing a skill or a topic to be taught in a dialogue. An evaluation with respect to each K node may include“mastered” or“not-yet-mastered” with, e.g., respective probabilities P(T) and l-P(T), i.e., P(T) representing the transition probability from not-yet mastered to mastered states. P(L0) denotes the probability of prior learning skill or prior knowledge on a topic, i.e., the likelihood that a student already mastered the concept before the tutoring session starts. To teach the skill/concept associated with each K node, a robot tutor may ask a number of questions in accordance with a T-AOG associated with the K node and then the student is to answer each question. Each question is shown as a Q node and a student answer is represented in Fig. 3B as an A node.

[0079] During a conversation between an agent and a user, a student’ s answer may be a correct answer A(c) or a wrong answer A(w), as seen in Fig. 3B. Based on each answer received from the user, additional probabilities are determined based on, e.g., various knowledge or observations collected during the dialogue. For example, if a user provides a correct answer (A(c)), a probability P(G) of the answer being a guess may be determined, representing the likelihood that the student does not know the correct answer but guessed the correct answer. Conversely, l-P(G) is the probability that the user knows the correct answer and answered correctly. For an incorrect or wrong answer, A(w), a probability P(S) may be determined, representing the likelihood of the student giving a wrong answer but the student does know the concept. Based on P(S), a probability of l-P(S) may be estimated, representing the likelihood that the student giving the wrong answer because the student does not know the concept. Such probabilities may be computed with respect to each node along a path traversed based on the actual dialogue and can be used to estimate when the student masters the concept and what may work well and what may not work well in terms of teaching this specific student on each specific topic.

[0080] Fig. 3C illustrates an exemplary S-AOG and its associated T-AOGs, in accordance with an embodiment of the present teaching. In an example S-AOG 310 for tutoring the concept of fraction, each node corresponds to a topic or concept to be taught to a student user during a dialogue. For example, S-AOG 310 includes a node P0 or 310-1, representing the concept of fraction, a node PI or 310-2, representing the concept of divide, a node P2 or 310-3, representing the concept of multiply, a node P3 or 310-4, representing the concept of add, and a node P4 or 310-5, representing the concept of subtract. In this example, different nodes in the S-AOG 310 are related. For instance, to master the concept of fraction, at least some of the other concepts on add, subtract, multiply, and divide may need to be mastered first. To teach a concept represented by a S-AOG node, e.g., fraction, there is a series of steps or a process that an agent may need to carry out in a dialogue session with a student user. Such a process or a series of steps corresponds to a T-AOG. In some embodiments, for each node in a S-AOG, there may be multiple T-AOGs, each of which may represent a different way to teach a student and may be invoked in a personalized manner. As shown, S-AOG node 310-1 has a plurality of T-AOGs 320, one of which is illustrated as 320-1 corresponding to a series of temporal steps of questions/answers 330, 340, 350, 360, 370, 380, ... , etc. In each tutoring session to teach the concept of fraction, a choice of which T-AOG to use may vary and may be determined based on various considerations, e.g., the user in the session (personalized), the present level of mastery of the concept (e.g., P(L0)), etc.

[0081] A STC-AOG based representation of a dialogue captures entities/ objects/concepts related to the dialogue (S-AOGs), possible actions observed during the dialogue (T-AOGs), and impact of each of the actions to the entities/objects/concepts (C-AOGs). Actual dialogue activities occurred during the dialogue (speech) may cause to traverse the corresponding graph representation or STC-AOGs, resulting in parsed graphs (PG) corresponding to traversed portions of the STC-AOGs. In some embodiments, an S-AOG may model a spatial decomposition of objects and scenes of the dialogue. In some embodiments, an S-AOG may model a decomposition of concepts and sub-concepts as discussed herein. In some embodiments, a T-AOG may model a temporal decomposition of events/sub-events/actions that may be performed or have occurred in a dialogue in connection with certain entities/objects/concepts represented in a corresponding S-AOG. A C-AOG may model a decomposition of events represented in a T-AOG and its causal implication with respect to corresponding entity/object/concept represented in an S- AOG. That is, a C-AOG describes a change to a node in an S-AOG caused by an event/action taken in a dialogue and represented in a T-AOG. Such information is with respect to different aspects of a dialogue and is captured in the information state 110. That is, the information state 110 represents the dynamics of a dialogue between a user and a dialogue agent. This is illustrated in Fig. 3D.

[0082] As discussed herein, based on the actual dialogue sessions, specific paths traversed based on the conversations yield different types of corresponding parsed graphs (PGs). For instance, it may be applicable to S-AOGs, yielding S-PGs, applicable to T-AOGs, yielding T- PGs, and applicable to C-AOGs, yielding C-PGs. That is, based on STG-AOGs, an actual dialogue leads to dynamic STC-PGs, which represent, at least partially, different mindsets of parties involved in the dialogue sessions. To illustrate that, Figs. 4A-4C show exemplary S-AOG/T-AOG associated with an agent’s mind for teaching fraction related concepts; Figs. 5A-5B provide exemplary representation of a shared mindset via a T-PG yielded based on dialogues in a specific tutoring session; Figs. 6A - 6B show exemplary representation of a user’s mindset in term of estimated level of mastery of different concepts being taught in dialogues with a dialogue agent.

[0083] Fig. 4A shows an exemplary representation of an agent’s mindset with respect to tutoring on fraction, in accordance with an embodiment of the present teaching. As discussed herein, a representation of an agent’s mindset may reflect what the agent intends to or is designed to cover in a dialogue. The agent’s mindset may adapt during a dialogue session based on the user’s performance/behavior so that its representation is to capture such dynamics or adaptation. The exemplary representation of an agent’s mindset, as illustrated in Fig. 4A, comprises various nodes, each of which represent a sub-concept in connection with the concept of fraction. For example, there are sub-concepts related to“understand fraction” 400,“compare fractions” 405,“understand equivalent fractions” 410,“expand and reduce equivalent fractions” 415,“find factor pairs” 420,“apply properties of multiplication/division” 425,“add fractions” 430, “find LCM” 435,“solve unknown in multiplication/division” 440,“multiply and divide within 100” 445,“simplify improper fractions” 450,“understand improper fraction” 455, and“add and subtract” 460. These sub-concepts may constitute the landscape of fraction and some sub-concepts may need to be taught before others, e.g.,“understand improper fraction” 455 may need to be covered prior to“simplify improper fractions” 450,“add and subtract” 460 may need to be mastered prior to“multiply and divide within 100” 445, etc. [0084] Fig. 4B illustrates an exemplary T-AOG representing an agent’s mindset in teaching a concept related to fraction, in accordance with an embodiment of the present teaching. As discussed herein, a T-AOG includes various steps associated with a dialogue, some of which relating to what an agent says, some of which relating to what a user responds, and some of which corresponding to certain evaluation directed to the conversation performed by the agent. There are branches in the T-AOG representing decisions. For example, at 470, the corresponding action is for the agent to highlight numerator and denominator boxes, which, e.g., may be after a lecture to a student on what are numerators and denominators. Following link 480, the agent proceeds to 490 to ask for user input, e.g., to tell the agent which highlighted one is a denominator. Based on the answer received from the student, the agent follows the two links combined by an OR (the plus sign), where each of the links represents one path the user takes. For instance, if the user answers correctly on which one is a denominator, the agent proceeds to 490-3 to, e.g., further ask the user to evaluate the denominator. If the user answers incorrectly, the agent proceeds to 490-4 to provide a hint to the user on denominator and then follows link 490-2 to go back to 490 asking the user for input again on which one is a denominator.

[0085] If an agent asks the user to evaluate the denominator at 490-3, there are two associated outcomes, one being a wrong answer and one being a correct answer. The former leads to 490-5 at which the agent indicates to the user that the answer is not correct and then follow link 490-1 to go back to 490, asking again for user’s input. If the answer is correct, the agent follows the other path to proceed to 490-6, letting the user know that he/she is correct and continuing along the path to further set denominators and clear highlight at 490-7 and 490-8, respectively. As can be seen, the steps at 490 represent temporal actions that the agent plans to do in connection with teaching the concept of a denominator, in connection with the S-AOG in Fig. 4A that represents the concepts that the agent plans to teach a student the concept of fraction. Hence, they together form a part of a representation for the agent’s mindset. Fig. 4C shows exemplary dialogue content authored for teaching a concept associated with fraction, in accordance with an embodiment of the present teaching. With a similar dialogue policy, the conversation is intended to be carried out in a question-answer flow.

[0086] Fig. 5A illustrates an exemplary representation for a shared mindset in the form of a T-PG (corresponding to a path in the T-AOG in Fig. 4B), in accordance with an embodiment of the present teaching. The highlighted steps form a specific path taken in a dialogue on actions carried out by a dialogue agent based on the answers from the user. Compared with the T-AOG shown in Fig. 4B, what is shown in Fig 5A is a T-PG along various highlighted steps, e.g., 470, 510, 520, 530, 540, 550, ... , along the highlighted path. T-PG as shown in Fig. 5A represents an instantiated path traversed based on both the agent and the user’s actions and thus, representing a shared mindset. Fig. 5B illustrates a part of authored dialogue content between an agent and a user based on which a representation of a shared mindset may be obtained, in accordance with an embodiment of the present teaching. As discussed herein, a representation of a shared mindset may be derived based on a flow of a dialogue, forming a particular path or T-PG traversed along an existing T-AOG.

[0087] As discussed herein, during a dialogue, a dialogue agent estimates the mindset of a user engaged in the dialogue based on observations of the conversation with the user so that the dialogue and a representation of the estimated user’s mindset are both adapted based on the dynamics of the dialogue. For example, in order to determine how to proceed with a dialogue, the agent may need to assess or estimate, based on observations of the user, a level of mastery of the user with respect to a specific topic. The estimation may be probabilistic as discussed with respect to Fig. 3B. Based on such probabilities, the agent may infer a current level of mastery of a concept and determine how to further conduct the conversation, e.g., either continue to tutor on the current topic if the estimated level of mastery is not adequate or move on to other concept if the estimated level of user’s mastery of the current concept suffices to do so. The agent may assess regularly in the course of a dialogue and annotate a PG (parameterizes) along the way to facilitate the decision of a next move in traversing a graph. Such annotated or parameterized S- AOGs may yield S-PGs, i.e., indicating, e.g., which nodes in S-AOGs have been adequately covered and which ones have not. Fig. 5C depicts an exemplary S-PG of a corresponding S-AOG, representing an estimated user’s mindset, in accordance with an embodiment of the present teaching. The underlying S-AOG is illustrated in Fig. 4A. In this illustrated example, during a dialogue, each node in this S-AOG is assessed based on the conversation and is parameterized or annotated based on such assessment. As shown in Fig. 5C, nodes representing different sub concepts related to fraction are annotated with respective different parameters indicating, e.g., a level of mastery of the corresponding nodes.

[0088] As shown Fig. 5C, the nodes in the initial S-AOG (in Fig. 4A) are now annotated in Fig. 5C with different weights, each of which indicates an assessed level of mastery of the sub-concept for that corresponding node. As seen, the nodes in Fig. 5C are presented in different shades determined by the weights representing different degrees of mastery of the underlying sub-concepts. For example, nodes that are now dotted may correspond to those sub concepts that have been mastered so no further traverse is needed. Nodes 560 and 565 (corresponding to“understand fraction” and“understand improper fraction”) may correspond to the sub-concepts that have not reached a required mastery level. All nodes connected to these two nodes that are in-between, e.g., mastered and not-yet-mastered, may be considered to be attributing reasons that the user still has not yet mastered the concepts of fraction and improper fraction.

[0089] The mastery levels estimated as such for respective nodes in an original S-

AOG yield an annotated S-PG, representing an estimated user’s mindset which is to indicate degrees of understanding of the concepts associated with such nodes. This provides a basis for a dialogue agent to understand relevant landscape about a user on, e.g., what the user already understood of what was taught and what the user still has problem with. As seen, the representation of a user’s mindset is estimated dynamically based on, e.g., the user’s performance and activities during an on-going dialogue. In addition to estimating the mastery levels of concepts associated with different nodes to understand the mindset of the user, contextual observations and information about a user that can be collected during the on-going dialogue may also be used to estimate other characteristics or behavior indicators of the user as part of understanding the mindset of the user. Fig. 5D shows exemplary types of personality traits of a user that can be estimated based on information observed in a conversation, in accordance with an embodiment of the present teaching. As illustrated, in a conversation with a user, based on observations of user’s behavior or expressions, whether it is in oral, visual, or physical form, an agent may, via multimodal information processing (e.g., by the multimodal information processor 170), estimate, in different dimensions, various characteristics of the user’s in terms of, e.g., whether the user has an outgoing personality, how mature the user is, whether the user is mischievous, whether the user easily gets excited, whether the user is generally cheerful, how confident or secure the user feels about him/herself, whether the user is reliable, rigorous, etc. Such information, once estimated, forms a profile of the user which may influence the dialogue system 100 to determine how to adapt its dialogue strategy when needed and in what manner its agent should conduct a dialogue with a user. [0090] Both S-AOGs and T-AOGs may have certain structures, organized based on, e.g., topics, concepts, or flow of the conversation. Fig. 6A depicts an exemplary generic structure of an S-AOG related to a tutoring dialogue, in accordance with an embodiment of the present teaching. Instead of being subject matter specific, this general structure as shown in Fig. 6A may be used for teaching on any subject matter. The exemplary structure comprises different stages involved in a tutoring dialogue represented as different nodes in the S-AOG. As illustrated, a node 600 is for a dialogue related to a greeting, node 605 for a chitchat about, e.g., weather or health, node 610 for a dialogue to review previously learned knowledge (e.g., as a basis of teaching the subject matter intended), node 615 for teaching the subject matter intended, node 620 for testing a student user on the subject matter taught, and node 625 for a dialogue for evaluating the mastery level of a student user on the subject matter taught based on the testing. Different nodes may be connected in a way encompassing different flows among underlying sub-dialogues but specific flow in each dialogue may be dynamically determined based on the situation. Some branches out of a node may be related via an AND relationship and some branches out of a node may be related via an OR relationship.

[0091 ] As seen in Fig. 6 A, a dialogue related to tutoring may start with the greeting dialogue 600, such as“Good morning,”“Good afternoon,” or“Good evening.” There are three branches out of the node 600, including to node 605 for a brief chitchat, to node 610 for a review of previously learned knowledge, and to node 615 to start to teach directly. These three branches are ORed together, i.e., a dialogue agent may proceed to follow any of the three branches. After the chat session 605, there are also three branches, one to the teaching node 615, one to the testing node 620, and one to the review node 610 . The review node 610 also has two branches, one to the teaching node 615 and the other to the testing node 620 (may test the student first before teaching for prior knowledge or a prior level of mastery on the subject matter). In this illustrated embodiment, teaching and testing nodes are required dialogues so that the branches from nodes 605 and 610 to the teaching and testing nodes 615 and 620 are related by AND.

[0092] Teaching and testing may be iterated, as evidenced by the bidirectional arrows between nodes 615 and 620. Both the teaching node 615 or the testing node 620 may proceed to the evaluation node 625, as needed. That is, an evaluation may be carried out based on either on the teaching result from node 615 or a testing result from node 620. Based on an evaluation result, the dialogue may proceed to one of the three alternatives (related by OR), including to teaching 615 (to go over the concept again), testing 620 (re-testing), or review 610 (to strengthen a user’s understanding of some concepts), or even chat 605 (e.g., if the user is found frustrated, the dialogue system 100 may switch topics in order to continue to engage the user rather than losing the user). This generic S-AOG for a tutoring related dialogue is provided as an illustration rather than a limitation. An S-AOG for tutoring may be derived according to any logic flows as needed by an application.

[0093] As seen in Fig. 6A, each of the nodes is itself a dialogue and, as discussed herein, may be associated with one or more T-AOGs, each representing a flow of conversation directed to the subject matter intended. Fig. 6B depicts an exemplary T-AOG with dialogue content authored for S-AOG node 600 on greeting, in accordance with an embodiment of the present teaching. A T-AOG may be defined as a dialogue policy for a dialogue. Following the steps defined in a T-AOG is to carry out the policy to achieve some intended purposes. In Fig. 6B, content in each rectangular box represents what is to be spoken by an agent and content in an ellipse represents what a user responds. As seen, in the T-AOG for greeting illustrated in Fig. 6B, the agent first says one of three alternative greetings, i.e., good morning 630-1, good afternoon 630-2, and good evening 630-3. A response to such a greeting from a user may differ. For example, a user may repeat what the agent said (i.e., good morning, good afternoon, or good evening). Some may repeat and then add“to you, too,” 635-1. Some may say“Thank you, and you?” at 635-2. Some may say both 635-1 and 635-2. Some may simply remain silent 635-3. There may be other alternative ways to respond to the agent’s greeting. Upon receiving a response from a user, a dialogue agent may then answer the user’s response. Each answer may correspond to the user’s response, with respect to each of the alternative responses from the user. This is illustrated by the content at 640-1, 640-2, and 640-3 in Fig. 6B.

[0094] The T-AOG shown in Fig. 6B may encompass multiple T-AOGs. For instance, 630-1, 635-2, and 640-2 in Fig. 6B may constitute a T-AOG for greeting. Similarly, 630- 1, 635-1, and 640-1 may correspond to another T-AOG for greeting; 630-2, 635-1, 640-1 may form another one; 630-1, 635-3, 640-3 form a different one; 630-2, 635-3, and 640-3 may form yet another different one, etc. Although different, these alternative T-AOGs all have substantial similar structure and generic content. Such a commonality may be utilized to generate a simplified T-AOG with flexible content associated with each node. This can be achieved via, e.g., graph matching. For instance, the above mentioned different T-AOGs related to greetings, although with different authored content for the greetings, they all have the similar structure, i.e., initial greeting plus response from a user and plus a response to the user’s response on greeting. In this sense, the T-AOG in Fig. 6B may not correspond to the most simplified generic T-AOG for greeting.

[0095] To facilitate flexible dialogue content and enable the dialogue system 100 to adapt a dialogue in a personalized manner, AOGs may be parameterized. Such parameterization may be applied to both S-AOGs and T-AOGs in terms of both parameters associated with nodes in the AOGs as well as parameters associated with links between different nodes, according to different embodiments of the present teaching. Fig. 6C shows different exemplary types of parameterization, in accordance with an embodiment of the present teaching. As illustrated, parameterized AOGs includes parameterized S-AOGs and T-AOGs. For a parameterized S-AOG, each of its nodes may be parameterized with, e.g., a reward representing the reward obtained by covering the subject matter or topic/concept associated with the node. In the context of tutoring, the high the reward associated with a node in a S-AOG, the more valuable for the agent to teach the concept associated with the node to a student user. Conversely, if a student user is already familiar with the concept associated with a node in S-AOG (e.g., already mastered the concept), the lower the reward assigned to the node because there is no further benefit by teaching the student the associated concept. Such rewards associated with nodes may be updated dynamically during the course of a tutoring dialogue. This is illustrated in Fig. 6D, where the S-AOG 310 with nodes associated with relevant math concepts related to fraction. As seen, each node representing a relevant concept is parameterized with a reward, estimated to indicate whether there is a reward to teaching a student the concept.

[0096] Each node in an S-AOG may have different branches and each branch leads up to another node associated with a different topic. Such branches may also be associated with parameters such as probabilities to take the respective branches, as represented in Fig. 6C. Parameterizing paths in an AOG is also illustrated in Fig. 6D. Teaching towards fraction may require building up the knowledge starting from add and subtract and then multiplication and division. Along each of the connections among different concepts, there are probabilities from one to the other. For instance, as illustrated, from“add” node 310-4 to : subtract” node 310-5, the parameterized probability P a,s may indicate the likelihood of success in teaching a student to understand the concept of“subtract” if the concept of“add” is taught first. Conversely, probability Ps,a may indicate the likelihood of success in teaching the student to understand add if subtract is taught first. As another illustration, from“add” to“multiplication”/“division” are parameterized with probabilities P a ,m and P a ,d, respectively. Similarly, from “subtract” to“multiplication”/ “division” are also parameterized with probabilities P s ,m and P s ,d, respectively. With such probabilities, a dialogue agent may select an optimized path by maximizing the probability of success in teaching intended concepts in an order that may work better. Such probabilities may also be updated dynamically based on, e.g., observations from the dialogue. In this manner, the best course of action in teaching a student may be adapted in real time based on individual situations.

[0097] Parameterization may also be applied to T-AOGs, as indicated in Fig. 6C.

As discussed herein, a T-AOG represents a dialogue policy directed to a specific topic. Each node in a T-AOG represent a specific step in a dialogue, often relating to what a dialogue agent is to say to a user or what a user is going to respond to the dialogue agent, or an evaluation of the conversion. As discussed herein, it is frequently the case that the same thing may be said in different ways and anyway of saying it ought to be recognized as conveying the same thing. Based on this observation, content associated with nodes in a T-AOG may be parameterized. This is illustrated in Fig. 6E, according to an embodiment of the present teaching. As shown in Fig. 6B, there are different ways to carrying on a greeting dialogue. Even for such a simple subject, there may be many different ways to say pretty much the same thing. Content of a greeting dialogue may be parameterized in a more simplified T-AOG. Fig. 6E shows an exemplary parameterized T-AOG that corresponds to the T-AOG illustrated in Fig. 6B. The initial greeting is now parameterized as“Good [ _ ]!”

650 where the content in the brackets is parameterized with possible instantiations of“morning,”

“afternoon,” and“evening.” An answer from a user to respond to that initial greeting is now classified into two situations, one with a verbal answer 655-1 and the other without a verbal answer or silence 655-2. A verbal answer 655-1 may be parameterized with different choices of content to respond to the initial greeting, as shown in braces associated with 655-1. That is, anything included in the parameterized set 655-1 may be recognized as a possible answer from a user to respond to the initial greeting from the agent. Similarly, to respond to a user’s answer, the content for such a response from an agent at 660-1 may also be parameterized to bet a set of all possible responses. The response content 660-2 by an agent in the event of a silence from the user may also be similarly parameterized.

[0098] Another example of parameterizing content associated with nodes in a T-

AOG is illustrated in Fig. 6F. This example involves a T-AOG for testing a student on the concept of“add.” As shown, a T-AOG for such a testing may comprise the steps of presenting a problem (665), inquiring the student for an answer (667), the student providing the answer (670-1 or 675- 1), responding to the user’s answer (670-2 or 675-2), and then evaluating the reward associated with the S-AOG node for“add” (677). For each node of the T-AOG in Fig. 6F, the content associated therewith is parameterized. For instance, for step 665, the parameters involved include X, Y, Oi, where X and Y are numerical numbers and Oi refers to an object of type i. By instantiate specific values for these parameters, many questions may be formed. In this example in Fig. 6F, the first step at 665 of the testing is to present X objects of type 1 (“ol”) and Y objects of type 2 (“o2”), where X and Y are instantiated with numerical numbers (3 and 4) and“ol” and“o2” may be instantiated with types of objects (such as apple and orange). Based on such an instantiation of parameters, a specific testing question can be generated. At 667 of T-AOG in Fig. 6F, to test a student, a dialogue agent is to ask a user what is the sum of X objects of type“ol” and Y objects of type“o2.” When X, Y, ol, and o2 are instantiated with specific values, e.g., X=3, Y=4, ol=apple, and o2=orange, the text problem may be presented with“3 apples, 4 oranges” (or even picture of the same) and a testing question may be asked by instantiate the parameterized question to inquire about a sum of X+Y, e.g.,“how many fruits are there?” or“can you tell me the total number of fruits?” In this manner, flexible testing questions may be generated under the generic and parameterized T-AOG. Also parameterized testing questions may also facilitate to generate an expected correct answer. In this example, as X and Y are respectively instantiated as 3 and 4, then the expected correct answer to the testing question on the sum may be dynamically generated as X+Y=7. Such a dynamically generated expected correct answer may then be used to evaluate an answer from a student user in response to the question. In this manner, a T-AOG may be parameterized with a simpler graphical structure yet at the same time enables a dialogue agent to flexibly configure different dialogue content in a parameterized framework to carry out the task intended.

[0099] As discussed herein, when parameterized content is instantiated, a dialogue agent may also dynamically derive a basis for the evaluation for the testing. In this example, an expected correct answer 7 is formed based on the instantiation of X=3 and Y=4. When an answer is received, the answer may be classified as a not-know answer (e.g., when the user either did not respond at all or the answer does not contain a number) at 670- 1 or an answer with a number (either a correct or an incorrect number) at 675-1. A response to an answer may also be parameterized. Either a not-know answer or an incorrect answer may be considered as a not-correct answer, which can be responded to at 670-2 using a parameterized response.

[00100] A response to a not-correct answer may be further classified into different situations and in each situation, the response content may be parameterized with appropriate content suitable for that classification. For example, when a not-correct answer is an incorrect total, the response to that may be parameterized to address the incorrect answer, which can be due to a mistake or a guess. If a not-correct answer is because the user simply does not know, e.g., did not answer at all, the response may be parameterized to direct to that situation with appropriate response alternatives. Similarly, the response to a correct answer may also be parameterized to address either it is indeed a correct answer or it is estimated to be a lucky guess.

[00101] As shown in Fig 6F, after a dialogue agent responds to an answer (or lack thereof) from the user in different situations, the T-AOG includes a step at 677 for evaluating the current reward for mastery of the“add” concept. Upon the evaluation, the process may return to 665 to test the student on more problems. It may also proceed to“teaching” if the evaluation reveals that the student did not quite understand the concept or exit if the student is viewed as already mastered the concept. In some situations, the process may also go to exceptions if, e.g., it is detected that the student simply cannot do the work so that the system may consider switching topics temporarily as discussed with respect to Fig. 6A.

[00102] As shown in Fig. 6F, a T-AOG may be created with parameterized content associated with different nodes. The parameterized content associated with each node represents what is expected to be said/heard during a dialogue. Given the common recognition that there can be alternative ways to express the same thing, the more alternative content included in the parameterized content associated with each node, the more capable that a parameterized T-AOG supports flexible dialogues. Figs. 6G - 6H show a scheme of parameterizing content associated with nodes in an T-AOG, in accordance with an embodiment of the present teaching. Fig. 6G presents a T-AOG as seen in Fig. 6F except that each node in Fig. 6G is now associated with one or more content sets that may be spoken by an agent or heard from a user during a dialogue. Such parameterized content is also to be used to train ASR and/or NLU models to understand the utterances to be spoken.

[00103] A parameterized content set associated with a node represents what is possible to occur at the node. For example, as illustrated in Fig. 6G, various nodes are parameterized with their respective corresponding content sets. As illustrated, node 665 is associated with two content sets, [TN] and [To], wherein the former is for numbers (X and Y can be the data items included therein) and the latter is for objects ol or o2 (e.g., apple, orange, pear, etc.); node 667 is associated with content set [Ti] for inquiry sentences; node 670-1 is associated with content set [TNKA] for not-know answers from a user; node 675-1 is associated with content set [TN] for a user’s answer with a number; node 670-2 is associated with content set [TRNCA] for responses to a not-correct answer (which is either alternative responses to not-know answer or alternative responses to an incorrect answer); node 670-1 is associated with content set [TRCA] for responses to a correct answer; node 675-2 is associated with content set [TNKA] for responses to not-know answers from a user; node 672-1 is associated with content set [TRNK] for responses to not-know answers from a user; node 672-2 is associated with content set [TRI] for responses to incorrect answers from a user; node 672-3 is associated with content set [TRC] for responses to correct answers from a user; node 674-1 is associated with content set [TRM] for responses to mistake answers from a user; and node 674-2 is associated with content set [TRG] for responses to not-know answers from a user.

[00104] Fig. 6H illustrates exemplary data in different content sets associated with different nodes of T-AOG in Fig. 6G, in accordance with an embodiment of the present teaching. As shown, e.g., [TN] may be a set of any single or multiple-digit numbers, [To] may include names of alternative objects, [Ti] may include alternative ways to ask for a total of two numbers; [TNKA] may include alternative ways to say“I don’t know” by a user; [TRC] may include alternative ways to respond to a correct answer; [TRNK] may include alternative ways to respond to a not-known answer from a user; [TRM] may include alternative ways to respond to a mistake answer from a user; [TRG] may include alternative ways to respond to a guess answer from a user; [TRI] may include alternative ways to respond to an incorrect answer from a user which can include alternative responses to a mistake answer [TRM] or alternative responses to a guessed answer [TRG]; [TRNCA] may include alternative ways to respond to a not-correct answer from a user which can include alternative responses to a not-know answer [TRNK] or alternative responses to an incorrect answer [TRI]; and [TRCA] may include alternative ways to respond to a correct answer from a user which can include alternative responses to a correct answer [TRC] or alternative responses to a guessed answer [TRG]. The content sets for responses (e.g., [TRC], [TRNK], . . . , [TRCA]) may be used for generating robot agent’ s responses to a user’ s answer. As discussed herein, in addition to being used to describe possible utterances, such content sets can also be used as training data to train ASR/NLU to understand an utterance.

[00105] Such parameterized content sets associated with different nodes in a T-AOG may significantly improve both the capability of a robot agent to have different ways to express the same response and in different ways that are suitable to users, providing the flexibility in generating a personalized response to a user in different situations. In some embodiments, variations may be automatically generated not only in terms of language variations but also in terms of linguistic stylistic considerations. Language variations may be due to different spoken language, alternative expressions, etc. Different stylistic considerations may include different acoustic characteristics, e.g., accent, pitch, volume, speed, etc.. With respect to alternative expressions, for each text string, there may be linguistically and semantically equivalent expressions derived based on, e.g., synonyms or slangs, etc. Such variations may be collected and stored in some databases and indexed based on content categories (e.g., greetings, responses to not-know answers, responses to correct answers, etc.) so that they can be retrieved for parameterizing nodes for similar content categories.

[00106] As discussed herein, a T-AOG corresponds to a dialogue policy dictating alternative possible flows of a conversation, an actual conversation may traverse a part of the T- AOG by following a particular path in the T-AOG. Different users or the same user in different dialogue sessions may yield different paths embedded in the same T-AOG. Such information may be useful in allowing the dialogue system 100 to personalize the dialogues by parameterizing links along different pathways with respect to different users and such parameterized paths may represent what works and what does not work well with respect to each user. For example, for each link between two nodes in a T-AOG, a reward of the link may be estimated with respect to the performance of each student user towards understanding of the underlying concept taught. Such a path-centric reward may be computed based on probabilities associated with different branches with respect to each node along the path. Fig. 61 illustrates a T-AOG associated with a user with different paths between different nodes parameterized with rewards updated based on dynamic information observed in conversations with the user, in accordance with an embodiment of the present teaching. In this exemplary parameterized T-AOG (which is similar to that presented in Fig. 6F), after presenting at 680 X object 1 and Y object 2 to a student user, the agent inquires, at 685, the user on the total of X+Y. Based on prior teaching or testing of the same user, there may be estimated likelihood as to how the student will do this round, i.e., on each possible outcome (690-1, 680-2, and 690-3), there is an associated reward Rn, R12, and R11, respectively. [00107] If the answer from the student is incorrect (690-2), there may be different ways to respond to that, e.g., 695-1, 695-2, and 695-3. Based on the past experience or known personality of the user (again estimated in a personalized manner), there may be different reward scores associated with each possible response, R22, R23, and R24, respectively. For instance, if it is known that the user is sensitive and works better with encouragement or positive manner, the reward associated with response 695-2 may be the highest. In this case, the dialogue system 100 may select to respond to an incorrect answer with a response 695-2 that is more positive with encouragement. For example, the dialogue agent may say“Almost there. Think again.” A different user may prefer not being told a mistake, in this case, the reward R22 linked to response 695-1 for that user may be the highest as compared with R23 and R24. Such reward scores associated with alternative paths of a T-AOG are personalized based on what is known about a particular user and/or past dealings with the user. With AOGs configured with parameters with respect to both nodes and paths, the dialogue system 100 may dynamically configure and update, during each dialogue, the parameters to personalize the AOGs so that the dialogues may be conducted in a flexible (content is parameterized), personalized (parameters are computed based on personalized information), and, hence, a more productive manner.

[00108] As discussed herein and shown in Fig. 2A, information state 110 is represented based on not only AOGs but also a variety of types of information such as dialogue context 260, dialogue history 250, user profile 290, event-centric knowledge 270, and some commonsense models 280. Representations of different mindsets 200-220 are determined based on dynamically updated AOGs as well as other information from 250-290. For example, although AOGs are used to represent different mindsets, their corresponding PGs (the outcome of traversing the AOGs based on the dialogues) are generated based on the actual traversal (nodes and paths) in AOGs and the dynamic information collected during the dialogue. For instance, values of parameters associated with nodes/links in AOGs may be dynamically estimated based on, e.g., the on-going dialogue, the dialogue history, the dialogue contexts, user profiles, events occurred during dialogues, etc. Given that, to update the information state 110, different types of information, such as knowledge about events, surroundings, user’s characteristics, activities, etc., may be traced and such traced knowledge may then be used to update different parameters and ultimately the information state 110.

[00109] As discussed herein, AOGs/PGs are used to represent different mindsets, including a robot agent’s mindset (designed in terms of what is intended to be accomplished), a representation of a share mindset between a robot agent and a user (derived based on the actual dialogue occurred), and a mindset of the user (estimated based on the dialogue occurred and the performance of the user in the dialogue). When AOGs and PGs are parameterized, values of the parameters associated with nodes and links may be evaluated based on, e.g., information related to the dialogue, the performance of the user, the characteristics of a user, and optionally event(s) occurred during the dialogue, etc. Based on such dynamic information, representations of such mindsets can be updated over time during a dialogue based on the changing situations.

[00110] Fig. 7A depicts a high level system diagram of a knowledge tracing unit 700 for tracing information and updating rewards associated with nodes/path in AOGs/PGs, in accordance with an embodiment of the present teaching. As discussed herein, nodes in an AOG may be respectively parameterized with state related rewards and paths in PGs may also be respectively parameterized with path related rewards or utilities. A reward/utility associated with a state or a node in an AOG may include a reward/utility representing a level of mastery of a concept associated with the node. The higher the level of mastery with respect to the concept of an AOG node, the lower the associated state reward/utility of the node, i.e., the reward/utility to teach a concept that is already been mastered is rather low. Rewards/utilities are personalized and obtained based on an assessment of a user’s performance and such an assessment may be done continuously on-the-fly during a dialogue or regularly.

[00111] As discussed herein, each S-AOG node associated with a concept (e.g., to be taught in tutoring) may have one or more T-AOGs, each of which may correspond to a specific way to teach the concept. A parsed path or PG is formed based on nodes and links in a T-AOG traversed during a dialogue. A reward/utility associated with a path in a T-AOG or T-PG may represent a likelihood that this path will lead to a successful tutoring session or a successful mastery of the concept. Given that, the better the performance of a user assessed when traversing along a path, the higher a path reward/utility associated with the path. Such path related rewards/utilities may also be determined based on performances of a plurality of users which indicate statistically what teaching style works better over a group of users. Such estimated rewards/utilities along different branching paths may be especially helpful in a dialogue session when determining which branching path to take to continue a dialogue and may guide a dialogue agent to select a path that may have a statistically better chance to lead to a better performance, i.e., quicker to achieve the level of mastery of the concept.

[00112] The illustrated embodiment shown in Fig. 7A is directed to tracing state and path rewards during a dialogue. In this illustrated embodiment, rewards associated with nodes and paths are determined based on different probabilities estimated based on the dynamic situations of the dialogue. For instance, for a node in an S-AOG associated with, e.g., concept of “add,” its reward is a state based reward representing whether there is a return or reward by teaching the concept“add” to a specific user. For each student user who signs up to learn math from a robot agent, the reward value for each node of an S-AOG on math concepts is adaptively computed. The reward for each node in such an S-AOG (e.g., for math concept“Add”) may be assigned an initial reward value and the reward value may continue to change when the user goes through a dialogue dictated by an associated T-AOG (flows of conversation on concept“Add”). During the dialogue dictated by the T-AOG, the robot agent may ask questions to the user who then answers the questions. The answers from the user may be continuously evaluated and probabilities are estimated on whether the user is learning or making progress. Such probabilities estimated while traversing the T-AOG may be used to estimate the reward associated with the node in the S-AOG (i.e., the node representing concept“Add”) indicating whether the user has mastered the concept. That is, the reward value associated with a node representing the concept is updated during the dialogue. If the teaching is successful, the reward may drop to a low value indicating that there is no further value or reward to teach the student the concept because the student already mastered the concept. As seen, such state based rewards are personalized because they are computed based on the performance of each user in the dialogues.

[00113] There are also rewards associated different paths in a T-AOG, which comprises different nodes each of which may have multiple branches, representing alternative pathways. Choices of different branches lead to different traversals of the underlying T-AOG and each traversal yields a T-PG. In tutoring application, to track the effectiveness of tutoring, at each node of a T-AOG (being traversed during a dialogue), different branches may be associated with respective measurements which may indicate the likelihoods of achieving the intended goal when respective branches are selected. The higher the measurement associated with a branch, the more likely that it will lead to a path that will fulfill the intended purpose. However, optimizing a selection of a branch out of each of individual nodes may not lead to an overall optimal path. In some embodiments, instead of optimizing individual selection of a branch at each node, optimization may be performed based on a path, i.e., the optimization is to be performed with respect to a path (of a certain length). In operation, such a path based optimization may be implemented as a look-ahead operation, i.e., in considering the next K choices along a path, what is the best choice at the current branch with respect to a current node. Such a look-ahead operation is to base the selection of a branch in accordance with a compound measurement along each possible path, determined based on measurements accumulated over the links from a current node along each possible path. The length of the look-up may vary and may be determined based on application needs. The compound measurements associated with all the alternative paths (stemming from the current node) may be referred to as path based rewards. A choice of a branch from the current node may then be made by maximizing the path-based reward from all possible traverse from that current node.

[00114] The rewards along paths of a T-AOG may be determined based on a number of probabilities determined based on performance of a user observed during a dialogue. For example, at a current node in the T-AOG, a dialogue agent may ask a question to a student user and then receives an answer from the user in response to the question, where the answer corresponds to a branch stemming from the node for the question in the T-AOG. A measurement relating to a reward for the branch may then be estimated based on the probabilities. Such measurements, and hence the path based rewards, are personalized because they are computed based on personal information observed from a dialogue involving a specific user. The measurements associated with different branches along a path in a T-AOG (associated with an S- AOG node) may be used to estimate the reward of the S-AOG node as to the level of mastery of the student. The rewards, including both node based and path based rewards, may constitute “utilities” or preferences of the user and can be used by a robot agent to adaptively determine how to proceed with a dialogue in a utility-driven dialogue planning. This is shown in Fig. 7B, which shows how knowledge can be traced to enable the dialogue system 100 to adapt, on-the-fly, to relevant knowledge based on“shared mind” (which represents an actual dialogue), to use such traced knowledge to dynamically update the models (parameters in parameterized AOGs, e.g., rewards for S-AOG on level of mastery of an underlying concept in a student/user’s mind, and/or rewards for different paths in T-AOG in an agent’s mind), which can then be used (by the agent) to perform utility-driven dialogue planning in accordance with the dynamic situation of a dialogue with a specific user.

[00115] Referring back to Fig. 7A, to perform knowledge tracing and update of information state 110 according to traced knowledge, the knowledge tracing unit 700 comprises an initial knowing probability estimator 710, a knowing positive probability estimator 720, a knowing negative probability estimator 730, a guessing probability estimator 740, a state reward estimator 760, a path reward estimator 750, and an information state updater 770. Fig. 7C is a flowchart of an exemplary process of the knowledge tracing unit 700, in accordance with an embodiment of the present teaching. In operation, initial knowing probabilities for nodes in relevant AOG representations may be first estimated at 705. This may include both the initial knowing probabilities for each relevant node in an S-AOG and each branch of each T-AOG associated with the S-AOG node.

[00116] With the initial probabilities estimated, a dialogue agent may proceed with a dialogue with a user on a certain topic represented by a relevant S-AOG node and a specific T- AOG for the S-AOG node with associated probabilities initialized. To initiate a dialogue, the robot agent start with the dialogue by following the T-AOG. When the user responds to the robot agent, the NLU engine 120 may analyze the response and produces a language understanding output. In some embodiments, to understand the user’s utterance, the NLU engine 120 may also perform language understanding based on information besides the utterance, e.g., information from a multimodal information analyzer 702. For example, a user may say“this is a machine toy” while pointing at one of the toys on a desk. To understand the semantics of this utterance, i.e., what “this” means, the multimodal information analyzer 702 may analyze both audio and visual information to combine cues in different modalities to facilitate the NUL engine 120 to make a sense of what the user meant and output the user’s response with, e.g., an assessment as to the correctness of the response based on, e.g., the T-AOG.

[00117] When the user’s response with the assessment is received, at 715 by the knowledge tracing unit 700, to trace knowledge based on what occurred in the dialogue, different modules may be invoked to estimate respective probabilities based on the received inputs. For instance, if the user’s response corresponds to a correct answer, the knowing positive probability estimator 720 may be invoked to determine the probabilities associated with positively knowing the correct answer; the knowing negative probability estimator 730 may be invoked to estimate the probabilities associated with not knowing the answer; the guessing probability estimator 740 may be invoked to determine the probabilities evaluating the likelihood that the user just made a guess. If the user’s response corresponds to an incorrect answer, the knowing positive probability estimator 720 may also determine the probabilities associated with positively knowing yet that the user made a mistake; the knowing negative probability estimator 730 may estimate the probabilities associated with not knowing and the user still answered wrong; the guessing probability estimator 740 may determine the probabilities that the answer is just a guess. These steps are performed at 725, 735, and 745, respectively. [00118] As discussed herein, with a T-AOG, when a user interacts with a dialogue agent, such interactions forms a parsed graph which continues to grow with the progression of the dialogue. One example is shown in Fig. 5A. Given the parse graph or history of the interaction between a robot agent and a user, the probability that the user’s knowledge in the underlying concept may be adaptively updated based on the estimated probabilities. In some embodiments, the probability of initial knowledge in (or knowing) the concept at time t+1 may be updated based on observations. In some embodiments, it may be computed based on the below formulation: where P(Lt+i|obs = correct) represents the probability of initial knowledge at time t+1 given an observed correct answer, P(Lt+i|obs = wrong) represents the probability of initial knowledge at time t+1 given an observed wrong answer, P(Lt) is the probability of initial knowledge at time t, P(S) is the probability of slip, and P(G) represents the probability of guessing. Thus, with probabilities estimated based on the observations of the dialogue, the probability of prior knowledge may be updated dynamically as the example herein shows. Such prior knowledge probability associated with a node in an S-AOG may then be used to compute, at 755 in Fig 7C by the state reward estimator 760, the state based reward or node based reward associated with a node in an S-AOG, representing the user’s mastery of related skills associated with the concept node.

[00119] Based on the probabilities computed with respect to different branches (e.g., some corresponding to correct answer, some corresponding to wrong answer as shown in Fig. 3B) of each node along a PG path in a T-AOG, a path based reward may be computed, at 765 in Fig. 7C by the path reward estimator 750, with respect to each of the paths. Based on such estimated state based reward and path based rewards, the information state updater 770 may then proceed to update, at 775, the parameterized AOGs in the information state 110. When parameters associated with AOGs in the information state 110 are updated, the updated parameterized AOGs can then be used to control the dialogue based on utilities (preferences) of the user.

[00120] In some embodiments, different parameters used to parameterize AOGs may be learned based on observations and/or computed probabilities. In some implementations, unsupervised learning approaches may be employed to learn such model parameters. This includes, e.g., knowledge tracing parameters and/or utility/reward parameters. Such learnings may be performed either online or offline. Below, an exemplary learning scheme is provided:

[00121] Figs. 8A - 8B depict utility-driven dialogue planning based on dynamically computed AOG parameters, in accordance with embodiments of the present teaching. Utility- driven dialogue planning may include dialogue node planning and dialogue path planning. The former may refer to selecting a node in a S-AOG to proceed with a dialogue session. The latter may refer to selecting a path in a T-AOG to conduct a dialogue. Fig. 8A shows an example of utility-driven tutoring planning with respect to parameterized S-AOGs, in accordance with an embodiment of the present teaching. Fig. 8B shows an example of utility-driven path planning in a parameterized T-AOG, in accordance with an embodiment of the present teaching.

[00122] With regard to node planning, as illustrated in Fig. 6D, an exemplary S- AOG 310 is for teaching various math concepts and each node corresponds to one concept. It is shown in Fig. 8A with different nodes having reward related parameters associated therewith and some nodes therein may be parameterized with conditions formulated based on rewards of connected nodes. As seen in Fig. 8 A, node 310-4 is for teaching concept“Add,” node 310-5 is for teaching concept“Subtract,” ..., etc. Each node is parameterized with, e.g., a reward indicating the reward to teaching the concept in connection with, e.g., a current mastery level of the concept. The rewards associated with some nodes in S-AOG 310 are expressed as functions of reward parameters from its connected nodes.

[00123] Some concepts may need to be taught with the requirement or condition that a user has mastered some other (e.g., prerequisite) concepts. For instance, to teach a student on “Division” concept, it may be required that the user already mastered the concepts of“Add” and “Subtract.” This may be evidenced by requirement 820, expressed as Rd=Fd(Ra, Rs), where reward Rd associated with node 310-3 is a function Fd of R a , Rs representing the rewards associated with node 310-4 on“Add” and 310-5 on“Subtract,” respectively. For instance, an exemplary condition for teaching the concept of“Division” 310-3 may be that its reward level has to be high enough (i.e., a user has not yet mastered the concept of“Division”) and the reward Ra for“Add” (310-4) and reward R s for“Subtract” (310-5) have to be low enough (i.e., a user already mastered the prerequisite concepts on“Add” and“Subtract”). The mathematical formulation of function Fd may be devised according application needs to satisfy such conditions.

[00124] Node based planning may be set up so that a dialogue (T-AOG) associated with a node conditioned on some reward criterion in an S-AOG may not be scheduled until the reward condition associated with the node is met. In this way, initially, when a user does not know any concept, the only nodes that are not conditioned can be scheduled are 310-4 and 310-5. During the dialogue for either“Add” or“Subtract,” the reward associated therewith may be continuously updated (either Ra and R s ) and is propagated to nodes 310-2 and 310-3 so that Rm or Rd get updated as well in accordance with F m or Fd. At certain point when the user mastered the concepts of both “Add” and“Subtract,” the rewards Ra and R s become low enough so that no dialogue associated with node 310-4 and 310-5 need to be scheduled. At the same time, the low Ra and R s may be plugged in F m or Fd so that conditions associated with node 310-2 and 310-3 may now be met to make 310-2 and 310-3 active because Rm or Rd may now become high enough so that they are ready to be chosen to carry out dialogues on topics of multiplication and division. When that happens, T-AOGs associated therewith may be used to initiate the dialogues for teaching the respective concepts.

[00125] The same may also apply to the node for“Fraction.” It may be required that a user already mastered the concepts of“Multiplication” and“Division” (the rewards for 310-2 and 310-3 be adequately low) and the reward for the node“Fraction” becomes accordingly reasonably high. In this manner, the state based rewards associated with nodes in an S-AOG may be utilized to dynamically control how to traverse among the nodes in the S-AOG in a personalized way, e.g., adaptive based on a situation relevant to each individual. That is, in actual dialogues with different users, the traverse can be controlled adaptively in a personalized manner based on observations of the actual dialogue situations. For example, in Fig. 8A, depending on the stage of the teaching, different nodes may have different rewards at different time instances. As illustrated, node 310-4 on“Add” is the darkest, signifying, e.g., a lowest reward value, which may indicate that a user already mastered the concept of“Add.” Node 310-5 on“Subtract” has a reward value in-between, indicating, e.g., that the user is currently not yet mastered the concept but close. Nodes 310-1, 310-2, and 310-3 are bright indicating, e.g., high levels of reward values representing that a user has not yet mastered the corresponding concepts.

[00126] Path related or path based rewards associated with paths in T-AOGs may also be dynamically computed based on observations of actual dialogues and may also be used for adapting how to traverse a T-AOG (how to select branches) during a dialogue. Fig. 8B illustrates an example of utility driven path planning with respect to T-AOGs, in accordance with an embodiment of the present teaching. As shown, in traversing a T-AOG, at each time instance, e.g., at time t, upon receiving an answer from a user, a robot agent needs to determine how to respond. During time instances 1, ... , t, the dialogue traversed a parse graph pgi . t with traversed states si, S2 , ... , st. To respond, there may be multiple branches from state st, leading up to the next state St+l.

[00127] To determine which branch to take, a look-ahead operation may be performed based on the path based rewards along alternative paths. For example, to look-ahead one step, the rewards associated with alternative branches stemming from st (one step further) may be considered and a branch that represents the best path based reward may be selected. To look ahead two steps, rewards associated with each of the first set of alternative branches stemming from st as well as rewards associated with every of the secondary alternative branches (stemming from each of the first set of alternative branches) are considered and a branch that leads to the best oath based reward is selected as the next step. A deeper look-ahead can also be implemented based on the same principle. The example illustrated in Fig. 8B is a scheme in which two step look ahead is implemented, i.e., at time t, the scope of the look-ahead includes multiple paths at t+1 as well as each of multiple paths at t+2 stemming from each path at t+1. A branch is then selected via look-ahead to optimize the path based reward.

[00128] The path based rewards associated with branches may first be initialized and then updated during a dialogue. In some embodiments, initial path based rewards may be computed based on prior dialogues of the user indicating. In some embodiments, such initial path based rewards may also be computed based on prior dialogues of multiple users who are similarly situated. Each path based reward may then be dynamically updated with time during a dialogue based on how each branch choice leads up to a satisfaction as to the intended purpose of a dialogue. Based on such dynamically updated path based rewards, a look-ahead optimization scheme may be driven by the utilities (or preferences) of each user as to how to proceed with a conversation. Thus, it enables adaptive path planning. Below is an exemplary formulation for path planning based on look-ahead operation to optimize the path selection. In this exemplary formulation, a* is the optimally selected path given multiple branch selections a, the current state st, and the parse graph pgi . t, EU is the expected utility of a branch choice a, and R(st+i, a) represents the reward of choice a at state st+i. As can be seen, the optimization is recursive which allows look-ahead at any depth.

[00129] Combined with the state based utility-driven node planning, the dialogue system 100 in accordance with the present teaching is capable of dynamically control conversations with a user based on past accumulated knowledge about the user as well as the on- the-fly observations of the user in connection with the intended purposes of the underlying dialogues. Fig. 8C illustrates the use of utility-driven dialogue management of a dialogue with a student user based on a combination of node and path planning, in accordance with an embodiment of the present teaching. That is, in a dialogue with a student user, a dialogue agent conducts the dialogue with the user via utility-driven dialogue management based on dynamic node and path selection based on parameterized AOGs.

[00130] In Fig. 8C, S-AOG 310 comprises different nodes for respective concepts to be taught with annotated rewards and/or conditions. The rewards associated with the nodes may be determined previously based on knowledge about the user. For example, as illustrated, four nodes (310-2, 310-3, 310-4, and 310-5) may have lower rewards (represented as darker nodes), e.g., indicating that the student user already mastered the concepts of add, subtract, multiply, and divide. There is one node with a high rewards for teaching (i.e., a dialogue may be scheduled) which is 310-1 on“Fraction.” Selecting one of the S-AOG nodes to proceed with a dialogue is therefore reward-driven or utility-driven node planning.

[00131] Node 310-1 is shown to be associated with one or more T-AOGs 320, each corresponding to a dialogue policy governing a dialogue to teach the student the concept of “Fraction.” One of the T-AOGs, i.e., 320-1, may be selected to govern a dialogue session and T- AOG 320-1 comprises various steps such as 330, 340, 350, 360, 370, 380, .... T-AOG 320-1 may be parameterized with, e.g., path based rewards. During the dialogue, the path based rewards may be used to dynamically conduct path planning to optimize the likelihood of achieving the goal of teaching the student to master the concept of“Fraction.” As illustrated, the highlighted nodes in 320-1 correspond to a path selected based on path planning, forming a parse graph, representing the dynamic traverse based on the actual dialogue. This is to illustrate that knowledge tracing during a dialogue enables the dialogue system 100 to continuously update the parameters in the parameterized AOGs to reflect the utilities/preferences learned from the dialogues and such learned utilities/preferences in turn enables the dialogue system 100 to adapt its path planning, making the dialogues more effective, engaging, and flexible.

[00132] AOGs used to represent an agent’s mindset in the information state 110 need to be created with content prior to being used in a human machine dialogue. The process of creating the structures of AOGS and the content associated with nodes/branches is referred to as content authoring. Traditionally, content in AOGs have been authored by humans, which can be time consuming and tedious. Different AOGs may be created in certain orders, e.g., T-AOGs for each node in an S-AOG may be created after the node in the S-AOG has been created. This is shown in Fig. 9A, which depicts exemplary modes to create AOGs with authored content, in accordance with an embodiment of the present teaching. As seen, creations of AOGs includes creating S-AOGs and then T-AOGs that are associated with the nodes in the S-AOGs. The present teaching discloses ways to create AOGs via automatic or semi-automatic means.

[00133] In creating an AOG, different creators may author different contents. For instance, teachers may be called upon to create AOGs related to teaching. Different teachers may find different ways to teach a student a subject matter, e.g., add and subtract. Some may find it useful to teaching add first and then subtract and some may feel the opposite way. Depending on their personal experience, they may create different S-AOGs. In addition, with respect to a topic (e.g.,“Add” corresponding to a node in an S-AOG), different creators may author different sequence of steps in a dialogue or T-AOGs to interact with a student on the topic, creating flexible ways of conveying the same topic. This is evidences in Fig. 6B, where there are different ways of carrying out a greeting. According to the present teaching, different AOGs on a same topic may sometimes be consolidated via graph matching (see Fig. 6B and Fig. 6E). This enables creation of T-AOGs with parameterized content while preserving the structure of the flow as a more succinct representation of a parameterized T-AOG. Although this may simplify the representation of T-AOGs without losing content, this does not make the content authoring process more efficient. The present teaching discloses different ways to author content in a more effective manner, including both automatic and semi-automated content authoring process.

[00134] Fig. 9B depicts an exemplary high level system diagram of a content authoring system 900 for automatically creating AOGs via machine learning, in accordance with an embodiment of the present teaching. Via this AOG learning system 900, structures of S-AOGs and associated T-AOGs may be automatically created and content associated with the AOGs be automatically authored, both via learning. In some embodiments, such automatically learned S- AOGs and T-AOGs may be further refined by humans, achieving semi-automated means to creating AOGs. In the illustrated embodiment, the content authoring system 900 comprises a data- driven AOG learning engine 910, which is configured to learn both structure of dialogues (which correspond to S-AOGs) but also the content associated with different parts of a dialogue structure (T-AOGs). The learning is based on data from past dialogues retrieved from a past dialogue database 915. The content stored in the past dialogue database 915 may be organized based on different criteria such as subjects, demographics of users, characteristic profiles, etc. Content from past dialogues may be indexed with respect to different classifications so that appropriate content may be used for learning. By accessing relevant indexed learning content, the data-driven AOG learning engine 910 may learn to create AOGs (both structurally and dialogue content wise) that are appropriate for certain types of users.

[00135] To learn AOGs related to a certain subject, e.g., tutoring, the data-driven AOG learning engine 910 may access relevant past dialogue data on tutoring, e.g., via the subject- based indices 920, and learn the structure or S-AOG of tutoring in terms of flow of dialogue on sub-concepts as well as the speech content for each of the sub-concepts involved, i.e., the T-AOGs for each of the sub-concepts. A tutoring session on any subject matter may have several common sub-dialogues, each of which may be directed to a sub-concept. One example is shown in Fig. 6A in the form of an S-AOG, in which a tutoring session usually includes different sub-dialogues (nodes in the S-AOG) directed to, e.g., greetings (600), chat (605) or review (610), teaching the intended concept (615), testing (620), and evaluate (625). This general structure of the tutoring related dialogue may be generally applicable to any tutoring session, regardless of the specific concepts or subject matters to be tutored during the dialogue. Such a general structure forms the S-AOG for tutoring and can be learned from past dialogue data.

[00136] To learn the structure of S-AOGs associated with a subject, the data-drive AOG learning engine 910 may invoke topic based classification models 930 to recognize different sub-dialogues and classify them into different topics to derive the underlying structure. For instance, past dialogues related to tutoring may be processed and a flow of different sub-dialogues directed to different topics may be recognized and such sub-dialogues may be always existing (AND relationship) or alternatively conducted (OR relationship). As illustrated in Fig. 6A, chat (605) and review (610) are connected in an OR relationship and teaching (615) and testing (620) are two required sub-dialogues in tutoring and are related by AND relationship. After the evaluation (625), the next step can be any one of four possibilities, including back to review (610), back to testing (620), back to teaching (615), back to chat (605). These four possibilities are connected via an OR relationship.

[00137] To learn T-AOG, the data-driven AOG learning engine 910 may classify different portions of each sub-dialogue into corresponding types based on nature of the dialogue. For instance, as shown in Fig. 6E, a sub-dialogue related to greetings includes different portions, some relating to initial greeting (650), some related to an answer to the initial greeting (655-1), and some related to a response to an answer to initial greeting (660-1). Such different portions may be recognized via different means, e.g., based on voice based approach, utterances from different parties may be recognized as such. Based on the content of the speech from different parties and the topic based classification models 930, each portion may be assigned a label representing the nature of the utterance. Different dialogues may have portions with the same label but with different content. For example, in answering to an initial greeting (e.g.,“Good morning”), in one dialogue, the answer may be“Thank you and you?” A party in a different dialogue may respond to the same greeting differently by answering“Good morning to you, too!” Both answers may be classified or labeled as an answer to an initial greeting but with different content. Similarly, a response to the answer (660-1) may also be recognized from a different party who answered and be classified as a response to the answer. Such a response to an answer to a greeting may also include different utterances (e.g., Thank you” or“I am good” in Fig. 6E). The sequence of steps in a conversation may then be generalized as a T-AOG with each node representing an utterance spoken by a participant of a dialogue. As seen in the example shown in Fig. 6E, each label in T- AOG may be associated with alternative content that may be spoken to instantiate the label.

[00138] Fig. 9C shows exemplary different types of subject based AOGs derived from machine learning, in accordance with an embodiment of the present teaching. Such AOGs, including both S-AOGs and T-AOGs, may be derived by learning from past dialogue data in a manner as disclosed herein. For example, based on past dialogue data, AOGs for tutoring math, language, ... , chemistry may be obtained in terms of both relationships among different sub dialogues (S-AOGs) under each, a sequence of utterances (T-AOG) for each sub-dialogue with each step in the sequence associated with alternative utterances.

[00139] Fig. 9D is a flowchart of an exemplary process of a content authoring system for creating AOGs via machine learning, in accordance with an embodiment of the present teaching. In this illustrated embodiment, to create AOGs for a particular subject (e.g., tutoring session for teaching math), past dialogues related to the subject (e.g., all past dialogue data corresponding to a tutoring session for teaching math concepts) are first accessed at 950 and used to recognize, at 955, sub-dialogues (structure) in each of the past dialogues based on the topic based classification models 930, enabling to obtain, at 960, the structure or S-AOG for the subject and each node in the S-AOG is labeled, at 965, according to the nature of the sub-dialogue. For instance, if a sub-dialogue is related to teaching the concept of fraction to a student, then the corresponding node in the S-AOG structure may be labeled as teaching.

[00140] For each node in an S-AOG linked to a sub-dialogue, the past dialogue data corresponding to the sub-dialogue may then be analyzed, at 970, to derive T-AOGs for the S-AOG node. In some embodiments, the learning system 900 may simply adopt certain sub-dialogue content to form a T-AOG. In some embodiments, past dialogue data from similar sub-dialogues involving similar yet different dialogue content may be used to create different T-AOGs for the S- AOG node. In some embodiments, different T-AOGs learned from past dialogue data may also be consolidated, e.g., via graph matching, to create one or more merged T-AOGs. In some situations, based on different T-AOGs to be merged to generate an integrated T-AOG, dialogue content from different T-AOGs may be used to generate parameterized content for the integrated T-AOG. To create the structure of a T-AOG, the learning system 900 may recognize, at 975, different portions (e.g.,“initial greeting,”“answer to initial greeting,” and“response to the answer” in Fig. 6E) of a sub-dialogue based on topic models 930. Each of such portions may be provided with parameterized dialogue content generated based on that of similar portions of different T- AOGs of the same S-AOG node. The concept of parameterized AOGs is discussed herein with reference to Figs. 6E - 6G.

[00141] In addition to automatically creating AOGs, including S-AOGs and/or T- AOGs, AOGs may also be created in a semi-associated with each S-AOG node, the automatically generated AOGs may be further examined, confirmed, refined or modified by a human via, e.g., a user graphical interface. This may include adjusting the S-AOGs and/or changing the dialogue content in T-AOGs obtained via machine learning. Fig. 10A illustrates an exemplary visual programming interface 1000 configured for authoring T-AOG content, in accordance with an embodiment of the present teaching. At the bottom of this exemplary interface 1000, it is indicated that the example is a content authoring interface for a dialogue related to“Greeting” user in a tutoring dialogue session. There are questions (Q) and answers (A) with authored text content (Q - 1010-1, 1010-3, ... , and A - 1010-2, 1010-4, ...). Associated with each text content, there is an illustrated“Edit” button to enable editing of the authored text content in a corresponding box.

[00142] In some embodiments, the authored text content 1010-1, ... 1010-4, ... , may be initially created via learning and displayed in the interface 1000 for potential processing. If the machine automatically authored text content is acceptable, a human may click on an“Save” button 1025 to store the automatically authored text content associated with a T-AOG. A human may also, via the“Edit” options (1020-1, 1020-2, 1020-3, 1020-4, ...), modify the authored text context in the linked box. After the modifications, the“Save” may be clicked to save the modified dialogue content associated with the T-AOG. Different humans may save a different version of modified dialogue content of the T-AOG. For example, one human may find the authored text content via machine learning acceptable and may then save such automatically generated content to be used by a robot agent to teach a student in a tutoring session on“Add.” Yet another human may prefer a robot agent for teaching the same“Add” concept in a different way so that he/she may revise what the machine learned from the past dialogue data to customize the T-AOG differently and save accordingly to drive his/her robot agent.

[00143] In some embodiments, although S-AOGs may be generated automatically via learning, the generation of T-AOGs for associated S-AOG nodes may be done manually, i.e., the interface 1000 may not initially display automatically populated authored text content and instead, a human may need to enter the text content in each box. Even if with manual creation of T-AOG content, as S-AOGs are learned via machine learning, the process of semi-automated AOG creation is more efficient than a complete manual process. As discussed herein, in some embodiments, dialogue content of T-AOGs may be parameterized. Fig. 10B illustrates an exemplary visual programming interface 1030 configured for authoring parameterized T-AOGs, in accordance with an embodiment of the present teaching. This exemplary authoring tool interface 1030 may be provided in accordance with the T-AOG in Fig. 6F. In some embodiments, the dialogue flow/structure in Fig. 6F includes parameterized content and may be learned from past dialogue data via the AOG learning system 900. The interface 1030 shown in Fig. 10B may then be provided for one or more humans to add choices for the parameterized content.

[00144] As seen in Fig. 10B, the interface 1030 may present different portions related to the T-AOG on testing the concept of “Add.” Some portions may correspond to instructional portions such as 1040-1 and 1040-4 that merely dictate some action to be performed, e.g., display (1040-1) something and/or say something (1040-4). Some portions are editable such as underlined parts for entering the value of variables X and Y (1040-2) or the content items in bracket [] to, e.g., specify desired objects (1040-3, 1040-4), comments (1040-6, 1040-8, and 1040- 9). Some portions are for provide conditions expressed in braces { } such as 1040-5, 1040-7. For example, if a problem of X + Y is presented to a student in a tutoring session and a student in the session answered the question 1040-4, the condition to utter a positive comment is when the answer from the student equals X+Y, i.e., the condition of |[input]=X+Y} is met. When [input] does not equal to X+Y, a dialogue is dictated to evaluate whether the incorrect answer is due to not-know (e.g., pure guess without knowing) or merely incorrect (knowing but said wrong). Such evaluations may be performed during the dialogue on-the-fly based on, e.g., probabilities estimated based on, e.g., various observations related to the student. If it is evaluated as an incorrect answer, alternative comments for [Incorrect comment] 1040-8 may be authored via this authoring tool. If it is evaluated as a not-know answer, comment for [Incorrect comment] 1040-9 may be authored and used to comment on the user’ s answer.

[00145] An editable portion may be edited, which may be achieved either via a selection from a list of selectable items via a pull down menu or in an“Edit” mode which allows a human to enter text content or modify an existing text string. For example, a pull down menu associated with [obj] (1040-3) may be activated by right clicking on 1040-3. Upon a list of existing selectable items is presented, a human may make a selection from the list. For instance, for [obj 1] and [obj2], their pull down menus may be associated with lists of objects such as apples, oranges, .... Another mode of editing is to enter text strings. For example, the value of variables X and Y may be entered by a human in an editing mode. [00146] For some editable portions, they may be edited based on both existing selectable choices (e.g., learned from past dialogue data) and newly entered authored text content. One example is provided in Fig. 10B with respect to content item [Positive comment] 1040-6. When the Edit button 1045-3 associated therewith is clicked, an additional window 1046 may pop up. The pop-up window 1046 includes both pre-existing choices for positive comment (each may be associated with a selection button on the left) and the choice to add more (by clicking on the “More” button 1049-1) dialogue content. In this example, there are two pre-existing choices (“Great Job!” and“That is Correct. Nice work!”) with one selected (“Great Job!”) and two more entries newly entered (“Wonderful!” and“Fantastic!”). The button“Accept” 1049-2 may be clicked to save the selected and added choices as alternative dialogue content associated with [Positive comment], i.e., any of the saved content items may be recognized in a dialogue as a positive comment response to a student’s answer on X + Y.

[00147] If AOG structures are learned from past dialogue data via machine learning, either such learned AOGs may be directly used by a dialogue system to conduct dialogues with human users or they can be further modified or enriched based on the automatically generated AOGs via the semi-automated authoring tool 1030. The results from such an authoring tool may also be used to generate program codes that are to carry out the underlying dialogues dictated by the AOGs. That is, based on such generated AOGs (both S-AOGs and T-AOGs), codes can be automatically generated that will follow through the dialogue flows specified by the AOGs in accordance with the dialogue contents embedded with nodes of the AOGs. So, the authoring tool or interface 1030 can be considered as a visual programming tool and together with the AOG learning system 900, the process of designing and implementing a machine agent for dialogues can be enhanced significantly. [00148] Fig. IOC is a flowchart of an exemplary process for creating AOGs with authored dialogue content, in accordance with an embodiment of the present teaching. Based on past dialogue data, the AOG learning system 900 learns AOGs at 1050. In an automated mode, determined at 1055, such machine learned AOGs may be used directly to generate, at 1085, codes for a machine agent to carry out the underlying dialogue without further performing content authoring or editing in a semi-automated mode. If machine learned AOGs are to be further refined/modified/edited, it is via semi-automated means with an authoring tool such as 1000 or 1030 shown in Figs. 10A and 10B. To author or modify dialogue content associated with AOGs, each of the machine learned AOGs may be displayed, at 1065, for editing or authoring content associated with the AOG. As discussed herein, in some embodiments, the learned AOGs may have embedded dialogue content learned via learning and such learned content may be used as the basis for the further refinement. In some embodiments, the dialogue content may be authored anew as shown in Fig. 10 A. During content authoring in the semi-automated mode, when modified dialogue content or new dialogue content are received, at 1070, they are stored appropriately with the relevant AOG at 1075. If there are more AOG to be edits, determined at 1080, the process proceeds to 1065 to display the next AOG. If all AOGs have been processed, modified AOGs based on the editing results are generated at 1085. Based on the modified/enhanced AOGs, codes for carrying out dialogues dictated by the AOGs are generated, at 1090, based on authored dialogue content.

[00149] Fig. 11A illustrates an exemplary code 1110 generated via visual programing and result of the code in presenting scenes related to an S-AOG, in accordance with an embodiment of the present teaching. The codes in 1110 are generated for presenting a scene with a group of objects used to, teaching a child student the concept of adding numbers in connection with an S-AOG related to tutoring math. The codes 1110 may be generated, via visual programming, based on dialogue content associated with the AOG and authored via a semi- automated content authoring tool. As seen, the codes 1110 program to present a group of different types of objects, i.e., 1120-1 for pumpkin and 1130-1 for strawberry as well as how many to be presented for each type (1 pumpkin and 2 strawberries). Execution of such codes may then generate 1120-3 and 1130-3 which present different number of different types of product in accordance with the dialogue content authored. The presentation is created to allow the dialogue to proceed aiming at the intended goal of teaching a student to understand numbers and/or the concept of adding. So, such created presentations may be followed by questions, asked by a machine agent to a student user, on how many pumpkins, how many strawberries, what is the total number of fruits in the pictures.

[00150] Fig. 11B illustrates exemplary codes generated via visual programming based on dialogue content in a T-AOG obtained via semi-automated content authoring, in accordance with an embodiment of the present teaching. Codes 1150 as shown implement a conversation represented by a T-AOG 1160 with AND and OR branches by traversing the T-AOG and taking specific paths (or T-PG) based on the actual progression of the dialogue. Such codes are automatically generated based on authored T-AOG and dialogue content associated with each of the nodes in the T-AOG. For example, in a particular conversation, a machine agent may ask a user whether he/she has cavities at 1130. If the answer to this question is yes or not know, the machine agent traverses the conversation by following to node 1140. Otherwise, the conversation proceeds to ask diabetic related question at 1150. If the user’s answer to diabetic question is yes/not-know, the path to 1170 is traversed and otherwise, the path to 1160 is traversed. In this manner, via either automated AOG learning or with semi-automated content authoring combined with visual programming, the codes for T-AOGs can be more efficiently developed.

[00151] As discussed herein, content authoring may be done via an authoring tool which allows a human to modify existing dialogue content learned from past dialogue data or enter new dialogue content to enrich existing AOGs. In some embodiments, dialogue content may also be authored based on what a user says and does to provide not only speech data and/or the manner by which the speech is to be delivered. For instance, instead of modifying or entering new text in an interface, a human may simply speak the dialogue content to create authored content and, possibly also with certain expressions, facial, tone, and physical acts, to convey how the authored content is to be delivered. That is, both speech content and meta data about the speech may be authored in accordance with what a human does. Complying with such meta data may enable delivery of dialogue content with intended emotions, e.g., whether angry, happy, or animated.

[00152] Fig. 12A depicts an exemplary high level configuration of a system 1200 for content authoring based on multimodal inputs from a user, in accordance with an embodiment of the present teaching. In the illustrated embodiment, a human 1210 is engaged in authoring content related to AOGs via different means. As discussed above, in some embodiments, the human 1210 may create textual dialogue content via his/her computer/device by typing in authored dialogue content via a visual programming interface 1205. In addition to that, the present teaching also allows the human 1210 to author content via other means with instructions describing the manner by which the authored content is to be delivered to a user during a user-machine dialogue. For example, a different means to author content is for the human 1210 to author the dialogue content by uttering the content (instead of typing). In some embodiments, when uttering the authored content, the human 1210 may also perform certain activities which may be translated into instructions on the manner to deliver the dialogue content. For instance, the human 1210 may utter the content in a special tone with a certain volume, speed, and pitch, with a particular facial expression to express a certain emotion, or act some physical movements, all of which may be translated into instructions so that the authored content may be delivered in the manner that the human 1210 demonstrated.

[00153] In using speech to author content, the utterance is captured by an audio sensor 1220 and then analyzed by an automated speech recognizer (ASR) 1240 to convert the speech signal into text as the authored content. At the same time, the acoustic signal may also be analyzed to extract acoustic features that can be used as acoustic based instructions for rendering the authored content. Such information (features of the utterance instead of content of the utterance) may be analyzed by an audio/visual (AV) based instruction generator 1260 and used to generate, e.g., acoustic-related rendering instructions. For instance, the human 1210 may speak the content with anger or high volume at a fast speed with a high pitch. If the human 1210 utters the content with facial expressions, such visual signals may be captured by a camera 1230 and then be processed by the A/V based instruction generator 1260 to convert the visual signals into expression instructions so that the author content may be delivered by a robot agent with certain facial expressions. For instance, the human 1210 may speak the content with a surprising facial expression.

[00154] In some embodiments, the acoustic and facial features may be combined to derive rendering instructions. The human 1210 may have a big smile on the face while uttering the content with characteristic of excited voice. Both the acoustic and visual information may be simultaneously captured by the audio/visual sensors 1220 and 1230 and used to derive acoustic and expression related rendering instructions. In some applications, a robot device may have a display on its head as its face and then a designated expression may be rendered based on expression instructions while the robot is uttering a piece of dialogue content based on acoustic- related rendering instructions.

[00155] In some situations, a robot agent may have body parts that can be controlled based on instructions to perform, e.g., certain gestures such as waiving a hand, tilting the head, making a fist, upper body leaning forward, etc. The present teaching as disclosed herein also facilitates the generation of physical movement related instructions based on the acts of the human 1210. Such physical movement related instructions may be used to control a robot agent to perform certain physical acts as part of the expression to be exhibited while rendering some authored dialogue content. Instructions for such physical movement(s) may be generated automatically as part of content authoring content. As illustrated in Fig. 12 A, the camera 1230 may capture the actions of the human 1210 and a movement instruction generator 1250 may analyze the physical movement of the human 1210 and generate instructions as meta data associated with the authored content.

[00156] Instructions generated based on acts of a human intended to instruct a robot to achieve certain acoustic/visual/physical characteristics while speaking may be referred to as A/V/P instructions. Fig. 12B illustrates exemplary types of meta data that may be generated and stored with dialogue content authored, according to an embodiment of the present teaching. Meta data or instructions associated with a piece of authored dialogue content may be directed to acoustic features, facial features, ... , and physical features. The acoustic features include speed, pitch, tone, and/or volume to be used to convert a piece of dialogue content in its text form to its speech form. Facial features may include happy expression, sympathetic expression, concerned expression, ... , etc. Physical features may include raise an arm, make a fist, tilt head, ... , or lean the body, each of which may be further specified, based on the physical acts of the human 1210, as to left, right, forward, and backward.

[00157] A/V/P instructions so generated based on the acts of the human 1210 may be stored with their corresponding pieces of dialogue content, e.g., text content 1 may be associated with A/V/P instructions 1270-1; text content 2 may be associated with A/V/P instructions 1270-2; text content 3 may be associated with A/V/P instructions 1270-3; text content 4 may be associated with A/V/P instructions 1270-4, ... , etc. In this manner, whenever a robot agent is to utter a particular piece of dialogue content, it may access the associated A/V/P instructions and then control the robot to utter the content with the acoustic characteristics (e.g., tone, pitch, speed, volume, etc.) with certain facial expression (e.g., smiling, frowning, or sad) and physical features (e.g., upper body leaning forward, finger pointing to sky, or jump). In this way, the dialogue content can be authored in an enriched manner with an efficient automatic or semi-automatic manner.

[00158] Fig. 12C is a flowchart of an exemplary process of content authoring system 1200 for authoring content based on multimodal inputs from a user, in accordance with an embodiment of the present teaching. In content authoring, the system 1200 first receives, at 1205, multimodal inputs from different sensors. Such multimodal sensors may include audio, visual, and other types of sensors. Based on received audio signals, ASR is performed to generate, at 1215, dialogue content authored via speech. In addition, various types of acoustic features, such as pitch, volume, tone, etc., may also be estimated, 1225, based on the received audio signal and used to generate, at 1235, acoustic related rendering instructions associated with the authored piece of dialogue content. At the same time, based on the received visual signal, such visual input is analyzed, at 1245, to extract different visual features. Such extracted visual features may then be used to estimate facial expressions, if any, in order to generate, at 1255, relevant expression related rendering instructions for the authored piece of dialogue content. Similarly, the extracted visual features may also be used to further estimate, at 1265, physical acts, if any, that the human 1210 performed in order to generate, at 1275, corresponding physical act related rendering instructions. Such automatically generated rendering instructions with respect to the dialogue content authored via speech may then be associated with, at 1285, the dialogue content for storage. The process then proceeds to the next piece of dialogue content if there is more, determined at 1290, until all pieces of dialogue content related to an underlying T-AOG are authored.

[00159] Fig. 13 is an illustrative diagram of an exemplary mobile device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments. In this example, the user device on which the present teaching is implemented corresponds to a mobile device 1300, including, but is not limited to, a smart phone, a tablet, a music player, a handled gaming console, a global positioning system (GPS) receiver, and a wearable computing device (e.g., eyeglasses, wrist watch, etc.), or in any other form factor. Mobile device 1300 may include one or more central processing units (“CPUs”) 1340, one or more graphic processing units (“GPUs”) 1330, a display 1320, a memory 1360, a communication platform 1310, such as a wireless communication module, storage 1390, and one or more input/output (I/O) devices 1340. Any other suitable component, including but not limited to a system bus or a controller (not shown), may also be included in the mobile device 1300. As shown in Fig. 13 a mobile operating system 1370 (e.g., iOS, Android, Windows Phone, etc.), and one or more applications 1380 may be loaded into memory 1360 from storage 1390 in order to be executed by the CPU 1340. The applications 1380 may include a browser or any other suitable mobile apps for managing a conversation system on mobile device 1300. User interactions may be achieved via the I/O devices 1340 and provided to the automated dialogue companion via a network.

[00160] To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform(s) for one or more of the elements described herein. The hardware elements, operating systems and programming languages of such computers are conventional in nature, and it is presumed that those skilled in the art are adequately familiar therewith to adapt those technologies to appropriate settings as described herein. A computer with user interface elements may be used to implement a personal computer (PC) or other type of workstation or terminal device, although a computer may also act as a server if appropriately programmed. It is believed that those skilled in the art are familiar with the structure, programming and general operation of such computer equipment and as a result the drawings should be self-explanatory.

[00161] Fig. 14 is an illustrative diagram of an exemplary computing device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments. Such a specialized system incorporating the present teaching has a functional block diagram illustration of a hardware platform, which includes user interface elements. The computer may be a general purpose computer or a special purpose computer. Both can be used to implement a specialized system for the present teaching. This computer 1300 may be used to implement any component of conversation or dialogue management system, as described herein. For example, conversation management system may be implemented on a computer such as computer 1400, via its hardware, software program, firmware, or a combination thereof. Although only one such computer is shown, for convenience, the computer functions relating to the conversation management system as described herein may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load.

[00162] Computer 1400, for example, includes COM ports 1450 connected to and from a network connected thereto to facilitate data communications. Computer 1400 also includes a central processing unit (CPU) 1420, in the form of one or more processors, for executing program instructions. The exemplary computer platform includes an internal communication bus 1410, program storage and data storage of different forms (e.g., disk 1470, read only memory (ROM) 1430, or random access memory (RAM) 1440), for various data files to be processed and/or communicated by computer 1400, as well as possibly program instructions to be executed by CPU 1420. Computer 1400 also includes an I/O component 1460, supporting input/output flows between the computer and other components therein such as user interface elements 1480. Computer 1400 may also receive programming and data via network communications.

[00163] Hence, aspects of the methods of dialogue management and/or other processes, as outlined above, may be embodied in programming. Program aspects of the technology may be thought of as“products” or“articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Tangible non-transitory“storage” type media include any or all of the memory or other storage for the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.

[00164] All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, in connection with conversation management. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible“storage” media, terms such as computer or machine“readable medium” refer to any medium that participates in providing instructions to a processor for execution.

[00165] Hence, a machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, which may be used to implement the system or any of its components as shown in the drawings. Volatile storage media include dynamic memory, such as a main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.

[00166] Those skilled in the art will recognize that the present teachings are amenable to a variety of modifications and/or enhancements. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution— e.g., an installation on an existing server. In addition, the fraudulent network detection techniques as disclosed herein may be implemented as a firmware, firmware/software combination, firmware/hardware combination, or a hardware/firmware/ software combination.

[00167] While the foregoing has described what are considered to constitute the present teachings and/or other examples, it is understood that various modifications may be made thereto and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.