Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ADVANCED CONVERSATIONAL RECOMMENDER SYSTEM
Document Type and Number:
WIPO Patent Application WO/2021/145823
Kind Code:
A1
Abstract:
Disclosed is a method for refining item recommendations including: (a) receiving a session initiation from the user; (b) ranking, recommender component(s) (RC) and based on a conversation history of the user, candidate items (candidate items) and item attributes (candidate attributes) for the user; and (c) determining, using conversational component(s) (CC) and based on the ranked candidate items, ranked candidate attributes and conversation history, either to: ask a said ranked candidate attribute; or recommend at least one said ranked candidate item. In the former case, a candidate attribute is asked (the asked attribute). In the latter case, a candidate item is recommended (the recommended item). The method also includes receiving acceptance or rejection of the asked attribute or recommended item by the user, feeding the acceptance or rejection to the RC and updating the conversation history, and either exiting or repeating from step (c).

Inventors:
LEI WENQIANG (SG)
HE XIANGNAN (SG)
MIAO YISONG (SG)
KAN MIN-YEN (SG)
CHUA TAT-SENG (SG)
Application Number:
PCT/SG2021/050022
Publication Date:
July 22, 2021
Filing Date:
January 13, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NAT UNIV SINGAPORE (SG)
International Classes:
G06F16/332; G06N20/00; G06Q30/00
Foreign References:
CN110033851A2019-07-19
CN110069657A2019-07-30
US20170091846A12017-03-30
CN107833117A2018-03-23
Other References:
SUN YUEMING, ZHANG YI: "Conversational recommender system", THE 41STINTERNATIONAL ACM SIGIR CONFERENCE, 8 June 2018 (2018-06-08), pages 235 - 244, XP058411144, DOI: 10.1145/3209978.3210002>
Attorney, Agent or Firm:
DAVIES COLLISON CAVE ASIA PTE. LTD. (SG)
Download PDF:
Claims:
Claims

1. A method for refining item recommendations for a user, the method comprising the steps of:

(a) receiving, at a processor, a session initiation from the user;

(b) ranking, using one or more recommender components (RC) and based on a conversation history of the user; candidate items for the user (candidate items); and item attributes for the user (candidate attributes);

(c) determining, using one or more conversational components (CC) and based on the ranked candidate items, ranked candidate attributes and conversation history, either to: ask a said ranked candidate attribute; or recommend at least one said ranked candidate item;

(d1) if the CC determines to ask a ranked candidate attribute, asking a said candidate attribute (the asked attribute);

(d2) if the CC determines to recommend a ranked candidate item, recommending a said candidate item (the recommended item):

(e) receiving, at the processor, acceptance or rejection of the asked attribute or recommended item by the user, feeding the acceptance or rejection to the RC and updating the conversation history; and (f1) exiting; or (f2) repeating from step (c).

2. The method of claim 1 , wherein ranking the candidate items comprises ranking items based on a probability of acceptance by the user, wherein the conversation history comprises at least one of: a lower score (thus inherently with lower probability) for an item that has been associated with a previous user interaction; and a higher score (thus inherently with higher probability) for an item that has been previously recommended and is not associated with a previous user interaction.

3. The method of claim 1 or 2, wherein ranking candidate items comprises optimising the RC based on a first pairwise training instance Di that uses a first negative sample V, comprising all items presented to the user that are not interacted with, and a second pairwise training instance Då that uses a second negative sample comprising all items, except for items presented to the user, having one or more of the candidate attributes.

4. The method of claim 3, wherein the second negative sample comprises all items, except for items presented to the user, having all the candidate attributes.

5. The method of claim 3 or 4, wherein optimising the RC comprises propagating a loss into the RC, the loss being based on the specific preference of the user given at least one of one or more recommended items and the candidate items.

6. The method of claim 5, wherein the loss is calculated according to: in which u is the user, vis an item associated with a set of attributes Pv, y is a predicted likelihood u will like v, s is a sigmoid function, ' is an element of V- and [ f½> !|Q is a parameter to prevent overfitting.

7. The method of any one of claims 1 to 6, wherein step (c) comprises encoding information from the RC and the dialogue history into a state vector, wherein the CC determines to either ask a said ranked candidate attribute or recommend at least one said ranked candidate item, based on the state vector.

8. The method of claim 7, wherein the state vector is a concatenation of at least two of: entropy information of each candidate attribute among all attributes of the candidate items; an encoding of user preferences of the user on each candidate attribute; an encoding of the conversation history; and an encoding of a length of a list of the candidate items.

9. The method of any one of claims 1 to 8, further comprising updating a factorisation machine for use by the RC in step (b), by optimising a Bayesian Personalised Ranking (BPR) loss.

10. The method of any one of claims 1 to 9, wherein step (f1 ) occurs upon acceptance of the recommended item or upon receiving, at the processor, a session exit by the user.

11 . A recommender engine for refining item recommendations for a user, comprising: memory; at least one processor; one or more recommender components (RC); and one or more conversational components (CC), wherein the memory stores instructions that, when executed by the at least one processor, cause respective ones of the at least one processor, RC and CC to:

(a) receive, at the at least one processor, a session initiation from the user;

(b) rank, using the RC and based on a conversation history of the user; candidate items for the user (candidate items); and item attributes for the user (candidate attributes);

(c) determine, using the CC and based on the ranked candidate items, ranked candidate attributes and conversation history, either to: ask a said ranked candidate attribute; or recommend at least one said ranked candidate item;

(d1 ) if the CC determines to ask a ranked candidate attribute, ask a said candidate attribute (the asked attribute);

(d2) if the CC determines to recommend a ranked candidate item, recommend a said candidate item (the recommended item):

(f) receive, at the at least one processor, acceptance or rejection of the asked attribute or recommended item by the user, feed the acceptance or rejection to the RC and update the conversation history; and (f1 ) exit; or

(f2) repeat from step (c).

12. The engine of claim 11 , wherein the RC ranks the candidate items by ranking items based on a probability of acceptance by the user, wherein the conversation history comprises at least one of: a lower score (thus inherently with lower probability) for an item that has been associated with a previous user interaction; and a higher score (thus inherently with higher probability) for an item that has been previously recommended and is not associated with a previous user interaction.

13. The engine of claim 11 or 12, wherein the RC ranks candidate items by optimising the RC based on a first pairwise training instance Di that uses a first negative sample V, comprising all items presented to the user that are not interacted with, and a second pairwise training instance Då that uses a second negative sample comprising all items, except for items presented to the user, having one or more of the candidate attributes.

14. The engine of claim 13, wherein the second negative sample comprises all items, except for items presented to the user, having all the candidate attributes.

15. The engine of claim 13 or 14, wherein optimising the RC comprises propagating a loss into the RC, the loss being based on the specific preference of the user given at least one of one or more recommended items and the candidate items.

16. The engine of claim 15, wherein the loss is calculated according to: in which u is the user, vis an item associated with a set of attributes Pv, y is a predicted likelihood / will like v, a is a sigmoid function, ' is an element of V and is a parameter to prevent overfitting.

17. The engine of any one of claim 11 to 16, wherein the CC determines either to ask a said ranked candidate attribute or to recommend at least one said ranked candidate item, based on a state vector produced by encoding information from the RC and the dialogue history.

18. The engine of claim 17, wherein the state vector is a concatenation of at least two of: entropy information of each candidate attribute among all attributes of the candidate items; an encoding of user preferences of the user on each candidate attribute; an encoding of the conversation history; and an encoding of a length of a list of the candidate items.

19. The engine of any one of claims 11 to 18, wherein the instructions further cause the at least one processor to update a factorisation machine for use by the RC in step (b), by optimising a Bayesian Personalised Ranking (BPR) loss.

20. The engine of any one of claims 11 to 19, wherein step (f1) occurs upon acceptance of the recommended item or upon receiving, at the processor, a session exit by the user.

Description:
ADVANCED CONVERSATIONAL RECOMMENDER SYSTEM

Field of Invention

This invention relates to a conversational recommender system, and also to a method for refining item recommendations for a user.

Background

Recommender systems are emerging as an important means of facilitating users’ information seeking. Typically, recommender system development has solely leveraged offline historical data to build the recommender model (henceforth, the static recommender system). As a result, the recommender is inherently limited where offline behaviour does not match online user behaviour. User preferences can be diverse and often change with time. As such, it is difficult to deduce from offline data the exact intent of a user when they use a service, even when the training data is sufficient.

The offline static recommendation task is formulated as an estimation of the affinity score between a user and an item. This is usually achieved by learning user preferences through historical user-item interactions such as clicks and purchases. Representative methods for performing this function use Matrix Factorization (MF) and a Factorization Machine (FM). Neural FM and DeepFM have improved FM representation ability with deep Neural Networks. However, such static recommendation methods suffer from the intrinsic limitation of being unable to capture user dynamic preferences.

This intrinsic limitation motivates online recommendation. Its target is to adapt the recommendation results with user online actions. This has typically been modelled as a multi-arm bandit problem. However, such methods focus on the exploitation trade off in cold-start settings. Also, in warm start scenario, capturing user dynamic preference is also crucial as preference drift is common. The mathematical formation of the multi-arm bandit problem also limits such methods to recommending a single item each time.

Recently, the conversational recommender system (CRS) has been proposed for dynamic acquisition of user preferences through conversations. However, existing CRSs tend not to model user history and, in any event, are only capable of handling a single-round of interaction (e.g. recommendation).

It would be desirable to address or alleviate at least one of the above difficulties.

Summary

Disclosed herein is a method for refining item recommendations for a user, the method comprising the steps of:

(a) receiving, at a processor, a session initiation from the user;

(b) ranking, using one or more recommender components (RC) and based on a conversation history of the user; candidate items for the user (candidate items); and item attributes for the user (candidate attributes);

(c) determining, using one or more conversational components (CC) and based on the ranked candidate items, ranked candidate attributes and conversation history, either to: ask a said ranked candidate attribute; or recommend at least one said ranked candidate item;

(d1 ) if the CC determines to ask a ranked candidate attribute, asking a said candidate attribute (the asked attribute);

(d2) if the CC determines to recommend a ranked candidate item, recommending a said candidate item (the recommended item):

(e) receiving, at the processor, acceptance or rejection of the asked attribute or recommended item by the user, feeding the acceptance or rejection to the RC and updating the conversation history; and (f1 ) exiting; or (f2) repeating from step (c).

Ranking the candidate items may comprise ranking items based on a probability of acceptance by the user, wherein the conversation history comprises at least one of a lower score (thus inherently with lower probability) for an item that has been associated with a previous user interaction; and a higher score (thus inherently with higher probability) for an item that has been previously recommended and is not associated with a previous user interaction.

Ranking candidate items may comprise optimising the RC based on a first pairwise training instance Di that uses a first negative sample V-, comprising all items presented to the user that are not interacted with, and a second pairwise training instance D 2 that uses a second negative sample comprising all items, except for items presented to the user, having one or more of the candidate attributes.

The second negative sample may comprise all items, except for items presented to the user, having all the candidate attributes.

Optimising the RC may comprise propagating a loss into the RC, the loss being based on the specific preference of the user given at least one of one or more recommended items and the candidate items.

The loss may be calculated according to: in which u is the user, vis an item associated with a set of attributes P v , y is a predicted likelihood u will like v, a is a sigmoid function, ' is an element of V ~ and is a parameter to prevent overfitting.

Step (c) may comprise encoding information from the RC and the dialogue history into a state vector, wherein the CC determines to either ask a said ranked candidate attribute or recommend at least one said ranked candidate item, based on the state vector.

The state vector may be a concatenation of at least two of entropy information of each candidate attribute among all attributes of the candidate items; an encoding of user preferences of the user on each candidate attribute; an encoding of the conversation history; and an encoding of a length of a list of the candidate items. The method may further comprise updating a factorisation machine for use by the RC in step (b), by optimising a Bayesian Personalised Ranking (BPR) loss.

Step (f1 ) may occur upon acceptance of the recommended item or upon receiving, at the processor, a session exit by the user.

Disclosed herein is also a recommender engine for refining item recommendations for a user, comprising: memory; at least one processor; one or more recommender components (RC); and one or more conversational components (CC), wherein the memory stores instructions that, when executed by the at least one processor, cause respective ones of the at least one processor, RC and CC to:

(a) receive, at the at least one processor, a session initiation from the user;

(b) rank, using the RC and based on a conversation history of the user; candidate items for the user (candidate items); and item attributes for the user (candidate attributes);

(c) determine, using the CC and based on the ranked candidate items, ranked candidate attributes and conversation history, either to: ask a said ranked candidate attribute; or recommend at least one said ranked candidate item;

(d1 ) if the CC determines to ask a ranked candidate attribute, ask a said candidate attribute (the asked attribute);

(d2) if the CC determines to recommend a ranked candidate item, recommend a said candidate item (the recommended item):

(f) receive, at the at least one processor, acceptance or rejection of the asked attribute or recommended item by the user, feed the acceptance or rejection to the RC and update the conversation history; and (f1 ) exit; or

(f2) repeat from step (c).

The system can be used to perform the method as described above. The resulting recommender (which is realised by implementing the above method or using the above system) is implemented in a multi-round conversational scenario and collects user’s preference data in a CRS session with a user simulator, uses a multi task learning technique to learn user’s preference towards item and attribute simultaneously - e.g. using BPR loss in a FM model. Learning from these interactions can result in fewer conversation rounds or turns, with a higher level of recommendation hits. Enhanced item prediction directly increases successful recommendations and attribute prediction ensure the CC can ask more targeted questions to rapidly reduce the candidate item space and increase the chance for the RC to make successful recommendations.

Description of Figures

Embodiments of the present invention will now be described, by way of non-limiting example only, with reference to the accompanying drawings in which:

Figure 1 is a method for refining item recommendations for a user;

Figure 2 shows a flow diagram showing a method used for refining item recommendations for a user;

Figure 3 shows success rate of compared methods at different conversation turns on Yelp and LastFM (RQ1); and

Figure 4 is a schematic diagram showing components of an exemplary computer system for performing the methods described herein.

Detailed Description

The method and systems described herein recognise that traditional CRSs pose challenges to development. The settings are simplified - e.g. attributes are not asked - the conversation with the user is under- or undeveloped - e.g. being restricted to one turn of asking and one of recommending. Such methods do not build a dialogue policy to decide when to ask or make recommendations.

Present methods are developed for real applications that are more complex than those previously envisaged or accommodated. The CRS should strategically ask attributes and make recommendations in multiple rounds, achieving successful recommendations in the fewest turns. In present methods and system the recommender component (RC) and the conversational component (CC) closely support each other. This is to achieve the goal of accurate recommendation in fewer turns. In this context, the following description decomposes the task into three key problems, namely, what to ask, when to recommend, and how to adapt with user feedback. The present methods and systems provides a new three-stage solution involving estimation, action and reflection - for example, ranking, recommending/asking, and updating based on feedback - to satisfy the three problems in a unified framework employed in a multi-round conversational scenario.

An example method 100 for refining item recommendations for a user is described with reference to Figures 1 and 2. The method 100 broadly comprises:

102: session initiation;

104: ranking candidate items for the user and item attributes for the user;

106: determining whether to ask an attribute or recommend and item;

108: asking an attribute;

110: recommending an item;

112: receiving feedback; and either

114: exiting; or

116: repeating from step 106.

Step 116 is borne out of the fact that the CRS used herein inherently adopts a multi round setting - i.e. multi-round conversation. The CRS may converse with a user to recommend items based on the user's conversation history - e.g. click history dwell time on particular items which may indicate that those items are attractive but insufficiently so to have resulted in purchase - i.e. not an absolute rejection, but a near acceptance and thus losses resulting from such minor rejection may be deemphasised (i.e. scaled down) for updating/learning.

In further detail, session initiation step 102 involves receiving, at a processor (e.g. processing components 410 of Figure 4), a session initiation from the user. This can occur when a user lands on a page or enters a portal through which the conversation can be controlled and items can be recommended. Steps 104 to 110 then control to presentation of a CRS generated input to the user conversation. Step 104 involves ranking, using one or more recommender components (RCs) and based on a conversation history of the user, candidate items for the user (herein after referred to as candidate items), and item attributes for the user (hereinafter referred to as candidate attributes). Notably, in the first round the conversation history may be empty or null. In each successive round, the conversation history is updated. Moreover, a "round" may be considered the performance of steps 106 to 112. In other embodiments, a round or trial is one recommendation. The proposed method 100 considers conversational recommendation as an inherently multi-round scenario, where a CRS interacts with the user by asking attributes and recommending items multiple times until the task succeeds or the user leaves. To distinguish the two, this invention terms the setting single-round where the CRS only makes recommendations once, ending the session regardless of the outcome. Step 106 involves determining, using one or more conversational components (CCs) and based on the ranked candidate items, ranked candidate attributes and conversation history, either to ask a candidate attribute ranked at step 104, or recommend a candidate item ranked at step 104. Step 108 involves asking a ranked candidate attribute if the CC determines to do so, and similarly for step 110 in relation to recommending items.

With further regard to step 106, at each round, the CRS is allowed to choose two types of actions — either explicitly asking whether a user likes a certain item attribute or recommending at least one item (e.g. a list of items). In a session, the CRS may alternate between these actions multiple times, with the goal of finding desirable items for a user while minimizing the number of interaction rounds. There may be a single action per round - i.e. asking a candidate attribute or recommending an item or list of items. This multi-round setting is more challenging than the single round setting, as the CRS needs to strategically plan its actions. To achieve this functionality, the deep interactions between the CC and RC are modelled and developed. The CC is considered the component responsible for interacting with the user, while the RC is responsible for estimating user preference (e.g. generating the recommendation list).

Thus, the interaction between CC and RC is determined by various queries addressed by steps 104 to 112. A CRS needs to determine what to ask - i.e. attribute or item. To ask an attribute the CRS must first choose the attribute with which to interrogate the user. In some instances, step 108 may involve asking a binary query - for example, in making music recommendations the CRS may ask “Would you like to listen to classical music?" and expect a binary yes/no response. If the answer is yes, it can focus on items containing the attribute, benefitting the RC by reducing uncertainty in item ranking. However, if the answer is no, the CRS expends a conversation turn with less gain to the RC. In other instances, step 108 may involve asking an enumerated query - i.e. a question that elicits an enumerated response such as “Which music genre would you consider? I like pop, funk The type of query performed at step 108 may be dictated by domain requirements. For the purposes of illustration, methods described with reference to the drawings will focus primarily on the basic single attribute case. Step 108 may, however, involve asking two or more, i.e. a plurality, of attributes - the two or more attributes may be related (e.g. classical music and Chopin), may specify mutually distinct sets (e.g. Chopin, Daft Punk) or any other relationship or lack thereof as needed. For the purpose of exposition, the present disclosure also avoids open questions that do not constrain user response.

To recommend the correct items (i.e. items the user is likely to buy or select) with fewer turns, the CC may carefully consider whether the user will like the asked attribute. However, this scrutiny if performed by the RC. The RC considers the user’s historical behaviour when developing recommendations and ranking items and attributes. This is used to determine when to recommend items. Once sufficient certainty in a particular recommendation is reached, the CC should push the recommendations generated by the RC. For example, step 110 may be performed once the candidate space is small enough, and/or when asking additional questions is determined to be unlikely to progress the user conversation - i.e. to be less useful or helpful from the perspective of either information gain or user patience - and/or when the RC is confident that the top recommendations will be accepted by the user. Determining the appropriate timing should take both the conversation history of the CC and the preference estimation of the RC into account.

Importantly, to understand how to behave during the next round, or in a new session with the user, the method 100 needs to adapt to the user's online feedback - i.e. feedback during use of a system (e.g. system 400) running method 100. Step 112 thus involves receiving acceptance or rejection by the user of the attribute asked at step 108 or item recommended at step 110, and feeding the acceptance or rejection to the RC and updating the conversation history. After each turn, the user gives feedback, e.g., yes/no on the queried attribute, or accept/reject the recommended items - or an enumerated response in instances where enumerated queries are posed. For step 108: in the event of a positive, or “yes”, response on the attribute both a user profile (e.g. a history of past interactions of the particular user) and item candidates need to be updated so as to generate better recommendations. This can be performed by online, but preferably offline, training of the RC to take such updates into account. In the event of a negative, or “no”, response on the attribute, the CC needs to adjust its strategy accordingly. For step 110: if the recommended items are rejected (noting that step 110 involves recommending at least one item, but may involve recommending two or more (i.e. a plurality) of items such as a list of items) the RC model needs to be updated to incorporate the negative signal. Using transfer learning, it can be shown that the impacts on one of the CC and RC affect the other.

To model this interaction between CC and RC, an estimate, action and response (EAR) model is used. Under "Estimation", which aligns with step 104, predictive models are built (e.g. offline) to estimate user preferences on items and item attributes. In one embodiment, a FM is trained using user profiles and item attributes as input features. The FM may be jointly optimised on the two tasks of item prediction and attribute prediction. Moreover, the method 100 leverages adaptive training on conversation data with online user feedback on attributes. Under "Action", which aligns with steps 106 to 110, a conversational strategy is learned from which to determine whether to ask or recommend and, if asking, what attribute to ask. To do this, a policy network is learned or trained using reinforcement learning. This can involve optimizing a reward of shorter turns and successful recommendations based on the FM (or other model as appropriate) estimation of user preferred items and attributes, and the dialogue or conversation history. Under "Reflection", corresponding to step 112, the CRS is adapted or updated based on the user’s online feedback. Specifically, when a user rejects the recommended items, new training triplets are constructed by treating the items as negative instances and the FM is updated in an online manner. Therefore, the method 100 comprehensively explores a multi-round CRS scenario that is more realistic than single round methods, or methods that do not learn between the RC and CC. The present method 100 focuses on dialogue strategy in a multi-round recommendation scenario and designs a model based on the concept that CC and RC, the two essential components, need to achieve deep interactions.

To learn a policy for guiding behaviour between rounds, reinforcement learning(RL) is used. The RL model is guided by rich statistics from both RC and CC. In the state vector of the RL model, various features may be encoded. For example the state vector may encode one or more of the conversation history from CC, attribute entropy, attribute preference, and candidate item list lengths from RC. Moreover, per step 112, the method 100 and thus the system 400 implementing that method 100, reflect on rejected items. When a user rejects an item or a list of items, it implies that current prediction for the user is inaccurate. Consequently, the RC is updated per step 112. This involves adjusting the estimation for the user by retraining the FM (or other) model using the redacted/rejected items as negative samples. This reflection technique is able to alleviate the inaccuracy of offline training of RC and hence improve the overall performance of the system.

To formalise the above concept, let u e ΊI denote a user u from the user set ΊI and v e ^denote an item v from the item set V. Each item v is associated with a set of attributes v which describe its properties, such as music genre “classical” or “jazz” for songs in a database, or tags such as “nightlife”, “serving burgers”, or “serving wines” for businesses in a different database. The set of all attributes is denoted as P. A specific attribute is denoted as p. A CRS session is started per step 102 with the user u specifying a preferred attribute p°. The CRS, using the RC, filters out (i.e. rejects from the candidate item list, or demotes them down the list) candidate items that do not contain the preferred attribute p°. Then in each turn or round t (t = 1, 2, ..., T; T denotes the last turn of the session), the CRS needs to choose an action: recommend (i.e., step 110) or ask (i.e., step 108).

Assuming the ACTION is “recommend”, let the recommended item list generated at step 104 be denoted as Vt Vend the action for step 106 be a r ec. Then, the user on performance of step 110 examines whether V t contains their desired item. If the feedback is positive, this session succeeds and can be terminated per step 114. Otherwise, the proposed method marks P as rejected and moves to the next round - i.e. from step 112 to step 104.

Assuming the ACTION is “ask”, let the asked attribute be denoted by p t e P and the action as a aSk (p*)), u P on performed of step 108 the user specifies whether they prefer items that contain the attribute p or not - this is an example of a binary query. If the feedback is positive, the proposed method adds p t into P u to denote the preferred attributes of the user in the current session - note: preferred attributes may shift between sessions. Otherwise, the proposed method marks p as rejected. Regardless of rejection or acceptance, the method 100 moves to the next turn.

The method 100 naturally forms an interaction loop (Figure 1 ) where the CRS may ask zero to many questions before making recommendations. A session terminates if a user accepts the recommendations or leaves due to impatience. The proposed method 100 sets the main goal of the CRS as making desired recommendations within as few rounds as possible. At step 104, the system starts working at the estimation stage where the RC ranks candidate items and item attributes for the user (candidates attributes), so as to support the action decision of the CC. After the estimation stage, the system moves to the action stage (i.e. step 106) where the CC decides whether to choose an attribute to ask, or make a recommendation according to the ranked candidates and attributes, and the dialogue history. If the user likes the attribute asked by the RC, the CC at stage 112 feeds this attribute back to the RC to make a new estimation again. Otherwise, the system at step 106 stays at the action stage, updates the dialogue history (i.e. CC) and chooses another action. Once a recommendation is rejected by the user, the CC at step 106 sends the rejected items back to RC, triggering the reflection stage where the RC adjusts its estimations. After that, the system enters the estimation stage (i.e., step 104) again. It will be appreciated that step 114 occurs upon acceptance of the recommended item or upon receiving, at the processor, a session exit by the user. During a multi-round conversational scenario the CC interacts with a user u and accumulates evidence on the user's preferred attributes, denoted as P u = {pi, på, .., p n }. Importantly, different from traditional recommendation methods, the RC here needs to make use of P u to accurately predict both the user’s the preferred items and preferred attributes. These two goals exert a positive influence on EAR, where the first directly contributes to success rate of recommendation, and the second guides the CC to choose better attributes to ask users. Better selection of attributes shortens the conversation.

To explore the recommendation method, and to illustrate adaptation of the method to achieve both item and attribute prediction goals simultaneously, a factorization machine (FM) is chosen as the predictive model due. An FM considers all pairwise interactions between input features. This is costly and may introduce undesirable interactions that negatively affect the proposed two goals. To reduce this cost, only those interactions that are useful to the achieving the two goals are retained. Given user u, the user's preferred attributes in the conversation P u , and the target item v, the likelihood of u liking v in the conversation session is determined by: where u and v denote the embedding for user u and item v, respectively, and p\ denotes the embedding for attribute p, e P u . Bias terms are omitted for clarity but may be included as will be understood by the skilled person in view of present teachings. The first term u T v models the general interest of the user on the target item, a common term in FM model. The second term åv T pi models the affinity between the target item and the user's preferred attributes. While i/s attributes PP may be incorporated into FM, this is generally not necessary. The item embedding v may already encode i/s attribute information.

Ranking the candidate items at step 104 may comprise ranking items based on a probability of acceptance by the user, wherein the conversation history comprises at least one of a lower score (thus inherently with lower probability) for an item that has been associated with a previous user interaction; and a higher score (thus inherently with higher probability) for an item that has been previously recommended and is not associated with a previous user interaction. To be more specific, ranking candidate items involves training the model - presently the FM. To train the FM, the pairwise Bayesian Personalized Ranking (BPR) is optimised. Specifically, given a user u, it is assumed the interacted items (e.g., visited restaurants, listened music) should be assigned higher scores than those not interacted with. The loss function of traditional BPR is: where Di is the set of pairwise instances for BPR training, Z>1 := {(u, v, v’) / v’eV ~ }, where v is the interacted item of the conversation session (i.e., the ground truth item of the session), V ~ := V\V + denotes the set of non-interacted items of user u and V denotes the items interacted by u. s is the sigmoid function, and l © is the regularization parameter to prevent overfitting.

The RC intends to accurately rank the items that contain the user preferred attributes. Ranking candidate items may comprise optimising the RC based on a first pairwise training instance Di that uses a first negative sample V, comprising all items presented to the user that are not interacted with, and a second pairwise training instance D2 that uses a second negative sample comprising all items, except for items presented to the user, having one or more of the candidate attributes. For example, if u specifies “Mexican restaurant” as his preferred attribute, a good CRS needs to rank the user's preferred restaurants among all available Mexican restaurants. To capture this, the method 100 may involve sampling two types of negative examples:

V*- := V\V+, %- := T W where V ~ is the same negative samples as in the traditional BPR setting, i.e., all non- interacted items of u, V cand denotes the current candidate items satisfying the partially known preference P u in the conversation, and V, is the subset of V cand that excludes the observed items V+ . That is to say, the second negative sample comprises all items, except for items presented to the user, having all the candidate attributes. The two types of pairwise training instances are defined as:

The proposed method 100 then trains the FM model by optimizing both Di and I>2. in which u is the user, v is an item associated with a set of attributes Pv, y is a predicted likelihood u will like v, o is a sigmoid function, v 1 is an element of V- and is a parameter to prevent overfitting, where the first loss learns u’s general preference, and the second loss learns u’s specific preference given the current candidates. The second loss for training can be very important for the model to rank well on the current candidate attributes. This is important for the present CRS (i.e. system 400) since the candidate items dynamically change with user feedback along the conversation.

The task of the second goal of accurate attribute prediction is formulated separately. The prediction of attribute preference is mainly used in the CC to support the action on which attribute to ask. As such, u’s preferred attributes in the current session are taken into account by: which estimates u’s preference on attribute p, given u’s current preferred attributes P u -

BPR loss is also employed to train the model. To employ the BPR loss, it is assumed that the attributes of the ground truth item v (of the session) should be ranked higher than other attributes: where the pairwise training data V3 is defined as: å¼ = {(a, p, p # )!?> € V v ,p* E V\Vv\ and where ^denotes item i/s attributes.

Method 100 then performs joint training on the two tasks of item prediction and attribute prediction. This enables the learning task for one goal to improve the learning task for the other, and vice versa, since the parameters are shared. The multi-task training objective is: where L is the overall learning task, comprising the learning task for the item Ut em and the learning task for the attributes L a ttr. Specifically, method 100 first trains the model with Litem. After convergence, method 100 optimizes the model using L a ttr. The proposed method iterates the two steps until it achieves convergence under both losses. Empirically, 2 to 3 iterations are sufficient for convergence.

After the estimation stage 104, the action stage 106 finds the best strategy for when to recommend. Stage 106 comprises encoding information from the RC and the dialogue history into a state vector. The CC determines to either ask a said ranked candidate attribute or recommend at least one said ranked candidate item, based on the state vector in accordance with a learned transition policy, the policy being learned through RL. The state vector is a bridge for the interaction between the CC and RC. The proposed method encodes information from the RC and dialogue history into a state vector, providing it to the CC to choose actions. In principle, the state vector is a concatenation of at least two of: entropy information of each candidate attribute among all attributes of the candidate items; an encoding of user preferences of the user on each candidate attribute; an encoding of the conversation history; and an encoding of a length of a list of the candidate items. Each of the vector components captures an assumption on asking which attribute could be most useful, or whether now is a good time to push a recommendation. In some embodiments, the state vector is a concatenation of four component vectors that a encode signal from different perspectives according to: s = Sent Q Sp re Q S s 0 S/ where Sent encodes the entropy information of each attribute among the attributes of the current candidate items ¾and, s re encodes u’s preference on each attribute, Shis encodes the conversation history, and sien encodes the length of the current candidate list. The length of | ¾and| is divided into ten (or some other number as suitable for a specific application) categorical (binary) features to facilitate the RL training.

Regarding Sent, intuitively, asking attributes with large entropy will help reduce the candidate space. Reducing the candidate space benefits the goal of finding desired items in fewer turns. Its size is the attribute space size \J , where the /-th dimension denotes the entropy of the attribute m. s pre is also of size \P\, where each dimension is evaluated on the corresponding attribute. Intuitively, attributes with high predicted preference are likely to receive positive feedback. This also helps to reduce the candidate space. The size of Shis is the number of maximum turns T, where each dimension t encodes user feedback at turn t. Specifically, -1 represents recommendation failure, 0 represents asking an attribute that u disprefers, and 1 represents successfully asking about an attribute that u desires. This state is useful to determine when to recommend items. For example, if the system has asked about a number of attributes for which u approves, step 110 can be performed with high confidence. Regarding sien, intuitively, if the candidate list is short enough EAR should turn to recommending to avoid wasting more turns - i.e. the CC should perform step 110 in preference to step 108. Spre, Sent and sien are all derived from the RC component. This is an important differentiator from existing CRSs. The CC needs to take information from the RC to decide on the dialogue action - step 108 or step 110. Existing CRSs make decisions based solely on the belief tracker that it used to record the preferred attributes of the user. This makes existing CRSs less informative and thus less effective especially when the number of attributes is large.

The conversation action is chosen by a policy network in the proposed CC. To illustrate the efficacy of the state vector in usage of the policy network, a simply policy network will be adopted — a two-layer multi-layer perceptron. The perceptron can be optimized using a standard policy gradient method. The perceptron of the illustrative embodiment contains two fully-connected layers and maps the state vector s into the action space. The output layer is normalized to be a probability distribution over all actions by softmax. The action space includes all attributes P and a dedicated action for recommendation. The action space is defined as A = {a re c u {a aSk (p)lp £ P}}, which is of size |^ |+1 .

After the CC takes an action at each round, it receives an immediate reward from the user (or user simulator). This will guide the CC to learn the optimal policy for optimizing long term reward. In the present method 100, multiple rewards are designed. In the illustrate embodiment, there are four kinds of rewards. These rewards are r SU c, a strongly positive reward when the recommendation is successful, r aSk , a positive reward when the user gives positive feedback on the asked attribute, r qUi t, a strongly negative reward if the user quits the conversation, r pre v, a slightly negative reward on every turn to discourage overly lengthy conversations. Other rewards can be incorporated based on dwell time, e.g. where the user has dwelled more than a predetermined time on providing feedback, a weakly positive reward may be for accepted attribute/item or a weakly negative reward for rejected attribute/item. The intermediate reward r ? at turn t is the sum of the above four rewards, r ? = r SU c + r aSk + fquit + fprev

The policy network is denoted as T7 a f / s*), which returns the probability of taking action at given the state s f . Here atec/l and s f denotes the action to take and the state vector of the f-th turn, respectively. To optimize the policy network, the standard policy gradient method is used, formulated as follows: where Q denotes the parameter of the policy network, a denotes the learning rate of the policy network, and Rt is the total reward accumulating from turn t to the final turn

T: Rt = r t „ where g is a discount factor which discounts future rewards over immediate reward.

This stage also implements the interaction between the CC and RC. It is triggered when the CC pushes the recommended items Ά to the user but gets rejected, so as to update the RC model per step 112 for better recommendations in future turns. In a traditional static recommender system training scenario, true negative samples are absent since users do not explicitly indicate what they dislike. In the proposed conversational case, the rejection feedback is an explicit signal on user dislikes. These explicit rejections are highly valuable to utilize. This also indicates that the offline learned FM models improperly assign high scores to rejected items. To leverage on this source of feedback, the rejected items Ά are treated as negative samples. This has the benefit of constructing more training examples to refresh the FM model. Following the offline training process, the proposed method 100 at stage 112 comprises updating a factorisation machine for use by the RC, by optimizing the BPR loss as: where 1,4 : ^ ’ *' * ) i * fc H* A å A T Note that this stage is performed in an online fashion, where access to the ground truth positive item is not accessed. The historically interacted items V+ are therefore treated as the positive items to pair with the rejected items. All examples in D 4 are then put into a batch and for performance of batch learning with gradient descent. Empirically, it takes 3 to 5 epochs to converge, which is suitably efficient for online use. Experiments

The present method 100, and system 400 that implements that method 100 are built on the principle that interaction between the CC and RC is valuable in refining conversation actions to converge to a favourable outcome (i.e. acceptance of a recommended item) in fewer rounds of conversation. To validate this principle, the whole system was validated to examine the overall effect brought by the interaction. An ablation study was then performed to investigate the effect of the interaction on each individual component.

Experiments were conducted on two datasets: Yelp for business recommendation and LastFM for music artist recommendation. The data were cleaned and split for training, testing and validation.

Both binary question questions (on LastFM) and enumerated questions (on Yelp) were considered. To enable the enumerated question setting, a two-level taxonomy was built on the attributes of the Yelp data. For example, the parent attribute of {“wine", “beer", “whiskey”} is “alcohol”. Thus, for enumeration, parent attributes may be built on attributes (i.e. child attributes). In the enumerated question setting, the system choose one parent attribute to ask. This is to say, the size of the output space of the policy network is changed to be X + 1 , where X is the number of parent attributes containing all child attributes. Asking an attribute (i.e. asking if the user likes the attribute) per step 108 may involve asking (e.g. displaying on a display) the parent attribute, or asking both the parent attribute and child attributes associated with the parent attribute. If all child attributes are displayed with the parent, the user feedback may comprise a selection of one or more (e.g. multiple) child attributes.

Because the conversational recommendation is a dynamic process, a user simulator is created to enable CRS training and evaluation. A conversation session is simulated for each observed interaction between users and items. Specifically, given an observed user-item interaction (u, v), vis treated as the ground truth item being sought and its attributes are the oracle set of attributes preferred by the user in this session. At the beginning, an attribute is randomly selected from the oracle set as the user’s initialization to the session. Then the session goes in the loop of steps of method 100 involving the model acting (steps 108 and 110), the simulator responding (step 112) and the model updating. The max turn T of a session is set to 15 and standardize the recommendation list length TA as 10.

Both offline and online training were run. The RC (e.g., FM) was built and the policy network (PN) was initialized by optimizing performance with the offline dialogue history. The strategy for determining which attribute to ask about involved selecting the attribute with the maximum entropy. For each round, the CC at step 106 chose the recommendation action with probability 10/maxQVI, 10) where V is the current candidate set. The intuition is that the confidence of recommendation grows when the candidate size is smaller. The ground-truth item and oracle attributes were given higher rank in the ranked candidate items and ranked candidate attributes given the attributes confirmed by users in dialogue histories. The policy was trained to mimic the rule-based strategy on the history. Online training of the policy network was then conducted through the system 400 interacting with the user.

The success rate (SR@t) was used to measure the ratio of successful conversations, i.e., recommend the ground truth item by turn t. The average turns (AT) needed to end the session were reported. Larger SR denotes better recommendation and smaller AT denotes more efficient conversation. For offline training, the area under curve (AUC) score was used, which is a surrogate of the BPR objective, and one-sample paired t- tests were conducted to judge statistical significance.

Figure 3 shows the recommendation Success Rate * (SR * ) @t at different turns {t = 1 to 15), SR * denotes the comparison of each method against the strongest baseline CRM, indicated as y = 0 in the Figure 3. The proposed model significantly outperforms other methods, confirming that considering extensive interactions between the CC and RC is an effective strategy to build conversational a recommender system.

Interestingly, Figure 3 indicates that in Yelp, EAR’S gain over CRM enlarges in Turns 1-3, shrinks in Turns 4-6 and widens again afterwards. Flowever, in LastFM it has a steadily increasing gain. This interesting phenomenon reveals that the proposed EAR system can learn different strategies in different settings. In the Yelp dataset, the CRS asks enumerated questions where the user can choose finer-grained attributes, resulting a sharp reduction in the candidate space. The strategy that the EAR system learns is more aggressive: it attempts to ask attributes that can sharply shrink the candidate space and make decisive recommendation at the beginning turns when it feels confident. If this aggressive strategy fails, it changes to a more patient strategy to ask more questions without recommendations, causing less success in the medial turns (e.g., Turns 5-7). However, this strategy pays off in the long term, making recommendation more successful in the latter half of conversations (e.g., after Turn 7).

As can be seen, the attribute-aware BPR significantly boosts the performance of item ranking, being highly beneficial to rank the ground truth item high. Interestingly, it harms the performance of attribute prediction. This suggests that the attribute-aware BPR loss guides the model to specifically fit to item ranking in the candidate list. Without an even optimization enforced for the attribute prediction task, it may suffer from poor performance. This implies that it would be useful to explicitly optimize the attribute prediction task.

Figure 4 is a block diagram showing an exemplary computer device 400, in which embodiments of the method 100 may be practiced. The computer device 400 may be a personal computer, laptop, server system with client terminal, mobile computer device such as a smart phone, a wearable device, a palm-top computer, and multimedia Internet enabled cellular telephones, an on-board computing system or any other computing system, a mobile device such as an iPhone TM manufactured by AppleTM, Inc or one manufactured by LGTM, HTCTM and SamsungTM, for example, or other device.

As shown, the mobile computer device 400 includes the following components in electronic communication via a bus 406:

(a) a display 402;

(b) non-volatile (non-transitory) memory 404;

(c) random access memory ("RAM") 408;

(d) N processing components 410;

(e) a transceiver component 412 that includes N transceivers; and (f) user controls 414.

Although the components depicted in Figure 4 represent physical components, Figure 4 is not intended to be a hardware diagram. Thus, many of the components depicted in Figure 4 may be realized by common constructs or distributed among additional physical components. Moreover, it is certainly contemplated that other existing and yet-to-be developed physical components and architectures may be utilized to implement the functional components described with reference to Figure 4.

The display 402 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, FIDMI, micro-projector and OLED displays).

In general, the non-volatile data storage 404 (also referred to as non-volatile memory) functions to store (e.g., persistently store) data and executable code. The system architecture may be implemented in memory 404, or by instructions stored in memory 404.

In some embodiments for example, the non-volatile memory 404 includes bootloader code, modem software, operating system code, file system code, and code to facilitate the implementation components, well known to those of ordinary skill in the art, which are not depicted nor described for simplicity. For example, memory 404 may be a computer-readable medium containing computer program code, the code comprising instructions that, when executed by a processor (e.g. processing components 410) cause the computer system 400 to perform the method 100.

In many implementations, the non-volatile memory 404 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the code from the non-volatile memory 404, the executable code in the non-volatile memory 404 is typically loaded into RAM 408 and executed by one or more of the N processing components 410.

The N processing components 410 in connection with RAM 408 generally operate to execute the instructions stored in non-volatile memory 404. As one of ordinarily skill in the art will appreciate, the N processing components 410 may include a video processor, modem processor, DSP, graphics processing unit (GPU), and other processing components.

The transceiver component 412 includes N transceiver chains, which may be used for communicating with external devices via wireless networks. Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme. For example, each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.

The system 400 of Figure 4 may be connected to any appliance 418 necessary for the system 400 to carry out its functions.

It should be recognized that Figure 4 is merely exemplary and in one or more exemplary embodiments, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code encoded on a non-transitory computer-readable medium 404. Non-transitory computer-readable medium 404 includes both computer storage medium and communication medium including any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available medium that can be accessed by a computer.

It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.

Throughout this specification and the claims which follow, unless the context requires otherwise, the word “comprise”, and variations such as “comprises” and “comprising”, will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps. The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.