Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MULTI-TASK LEARNING FOR BAYESIAN MATRIX FACTORIZATION
Document Type and Number:
WIPO Patent Application WO/2013/012990
Kind Code:
A1
Abstract:
A method of multi-task learning for Bayesian matrix factorization includes receiving a plurality of datasets including a plurality of items rated by a plurality of users, each dataset representing a different task (401), receiving a parameter set (402), determining a posterior distribution for the parameter set given the datasets (403), wherein the posterior distribution is approximated by a factorization distribution (404), for determining a plurality of feature vectors, and outputting the feature vectors as a training model (405), wherein the trained model predicts a user's rating of an item.

Inventors:
YUAN CHAO (US)
Application Number:
PCT/US2012/047304
Publication Date:
January 24, 2013
Filing Date:
July 19, 2012
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SIEMENS CORP (US)
YUAN CHAO (US)
International Classes:
G06N7/00
Other References:
RUSLAN SALAKHUTDINOV ET AL: "Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo", PROCEEDING ICML '08 PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 9 July 2008 (2008-07-09), pages 880 - 887, XP055047261, ISBN: 978-1-60-558205-4, Retrieved from the Internet [retrieved on 20121210]
IAN PORTEOUS ET AL: "Bayesian Matrix Factorization with Side Information and Dirichlet Process Mixtures", PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 15 July 2010 (2010-07-15), pages 563 - 568, XP055047278, Retrieved from the Internet [retrieved on 20121210]
Attorney, Agent or Firm:
CONOVER, Michele L. et al. (170 Wood Avenue SouthIselin, New Jersey, US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method of multi-task learning for Bayesian matrix factorization comprising:

receiving a plurality of datasets including a plurality of items rated by a plurality of users, each dataset representing a different task (401);

determining a posterior distribution of the datasets (403), wherein the posterior distribution is approximated by a factorization distribution, for determining a plurality of feature vectors (404), wherein each item and each user is assigned a feature vector; and outputting the feature vectors as a training model, wherein the trained model predicts a user rating of an item (405).

2. The method of claim 1, further comprising receiving a plurality of hyperparameters of the factorization distribution.

3. The method of claim 2, wherein the hyperparameters include

a concentration of a Dirichlet distribution,

a degree of freedom of a Wishart distribution,

a scale matrix of the Wishart distribution,

a plurality of scalar parameters for defining a plurality of cluster parameters, a number of clusters, and

a number of users.

4. The method of claim 3, further comprising: determining an inverse covariance matrix for the items and an inverse covariance matrix for the users given the degree of freedom of the Wishart distribution and the scale matrix of the Wishart distribution;

determining a plurality of cluster distributions of the items and a plurality of cluster distributions of the users given the concentration of the Dirichlet distribution, wherein the plurality of cluster distributions of the items and the plurality of cluster distributions of the users are determined independently of the inverse covariance matrix for the users and the inverse covariance matrix for the items; and

assigning a cluster label to each item and each user according to the plurality of cluster distributions of the items and the plurality of cluster distributions of the users, respectively.

5. The method of claim 4, further comprising:

determining a plurality of clusters of the items based on the cluster labels of the items, the inverse covariance matrix for the items and the plurality of cluster distributions of the items; and

determining a plurality of clusters of the users based on the cluster labels of the users, the inverse covariance matrix for the users and the plurality of cluster distributions of the users.

6. The method of claim 5, further comprising sampling the plurality of clusters of the items and the plurality of clusters of the users to generate the user rating of the item.

7. The method of claim 6, wherein the user rating of the item is conditioned by an inverse variance of a rating noise and a weighting vector.

8. The method of Claim 1, wherein the determination of the posterior distribution is performed iterative ly for each of the plurality of items.

9. The method of claim 1, wherein the factorization distribution determines the feature vectors from an inverse covariance matrix and a cluster mean, wherein the cluster mean is determined independent of the inverse covariance matrix.

10. The method of claim 1, wherein the training model is applied to a test data set, wherein the trained model predicts the user rating of the item missing from the test data set.

11. A computer program storage medium embodying instructions executable by a processor to perform a method for multi-task learning for Bayesian matrix factorization, the method comprising:

receiving a plurality of datasets including a plurality of items rated by a plurality of users, each dataset representing a different task (401);

determining a posterior distribution of the datasets (403), wherein the posterior distribution is approximated by a factorization distribution, for determining a plurality of feature vectors (404), wherein each item and each user is assigned a feature vector; and

outputting the feature vectors as a training model, wherein the trained model predicts a user rating of an item (405).

12. The computer program storage medium of claim 1, wherein the method further comprises receiving a plurality of hyperparameters of the factorization distribution.

13. The computer program storage medium of claim 12, wherein the hyperparameters include a concentration of a Dirichlet distribution,

a degree of freedom of a Wishart distribution,

a scale matrix of the Wishart distribution,

a plurality of scalar parameters for defining a plurality of cluster parameters, a number of clusters, and

a number of users.

14. The computer program storage medium of claim 13, wherein the method further comprises:

determining an inverse covariance matrix for the items and an inverse covariance matrix for the users given the degree of freedom of the Wishart distribution and the scale matrix of the Wishart distribution;

determining a plurality of cluster distributions of the items and a plurality of cluster distributions of the users given the concentration of the Dirichlet distribution, wherein the plurality of cluster distributions of the items and the plurality of cluster distributions of the users are determined independently of the inverse covariance matrix for the users and the inverse covariance matrix for the items; and

assigning a cluster label to each item and each user according to the plurality of cluster distributions of the items and the plurality of cluster distributions of the users, respectively.

15. The computer program storage medium of claim 14, wherein the method further comprises: determining a plurality of clusters of the items based on the cluster labels of the items, the inverse covariance matrix for the items and the plurality of cluster distributions of the items; and

determining a plurality of clusters of the users based on the cluster labels of the users, the inverse covariance matrix for the users and the plurality of cluster distributions of the users.

16. The computer program storage medium of claim 15, wherein the method further comprises sampling the plurality of clusters of the items and the plurality of clusters of the users to generate the user rating of the item.

17. The computer program storage medium of claim 16, wherein the user rating of the item is conditioned by an inverse variance of a rating noise and a weighting vector.

18. The computer program storage medium of Claim 11, wherein the determination of the posterior distribution is performed iteratively for each of the plurality of items.

19. The computer program storage medium of claim 11, wherein the factorization distribution determines the feature vectors from an inverse covariance matrix and a cluster mean, wherein the cluster mean is determined independent of the inverse covariance matrix.

20. The computer program storage medium of claim 11, wherein the training model is applied to a test data set, wherein the trained model predicts the user rating of the item missing from the test data set.

Description:
MULTI-TASK LEARNING FOR B A YE SI AN MATRIX FACTORIZATION

CROSS-REFERENCE TO RELATED APPLICATION This is a non-provisional application claiming the benefit of U.S. provisional application serial number 61/509,713, filed July 20, 2011, the contents of which are incorporated by reference herein in their entirety.

BACKGROUND

1. Technical Field

The present disclosure relates to collaborative filtering (CF), and more particularly to a Bayesian matrix factorization in a multi-task setting.

2. Discussion of Related Art

Collaborative filtering (CF) is an effective approach for recommender systems, which predict a rating of an item given by a user. CF is equivalent to completing a rating matrix by filling in missing ratings based on the existing ratings. Unlike other content-based approaches, CF doesn't require side information (attributes) of users or items to be known.

CF methods can be classified into two categories: memory-based methods and model- based methods. Memory-based methods typically require less training than model-based methods. During testing, a rating may be interpolated from existing ratings from similar users or items. How to calculate similarity from the rating matrix is a challenging problem. Model-based methods learn a model from the rating matrix during training and apply this model for prediction. Exemplary model-based methods include multi-nomial rating models, matrix factorization models, principal component analysis (PCA) models, and multi-task learning models.

Most CF methods are single-task approaches: a CF algorithm may be trained and tested on the same dataset. Most broadly used datasets such as the MovieLens and EachMovie datasets are well established; they typically contain ratings collected over a long period of time. For example, in the 1M MovieLens dataset, the average number of ratings per user is 165.

It is equally important for a CF method to work well for a newly created dataset in which available ratings are limited, leading to a sparse and thus more challenging rating prediction problem.

Multi-task learning (MTL) is a method for learning related single tasks jointly by sharing representations among tasks. MTL may improve the performance of each individual task with insufficient training data. Transfer learning is another related mechanism that generalizes knowledge acquired in training with ample training data to a difficult task with inadequate training data. Both MTL and transfer learning have been used in CF applications. However, these methods make a strong assumption that an object (e.g., user or item) is present in different datasets and the correspondence is known. Therefore, combining multiple datasets is similar to accumulating more ratings for the object. This assumption may not be valid.

BRIEF SUMMARY

According to an embodiment of the present disclosure, a method of multi-task learning for Bayesian matrix factorization includes receiving a plurality of datasets including a plurality of items rated by a plurality of users, each dataset representing a different task, determining a posterior distribution of the datasets, wherein the posterior distribution is approximated by a factorization distribution, for determining a plurality of feature vectors, wherein each item and each user is assigned a feature vector, and outputting the feature vectors as a training model, wherein the trained model predicts a user rating of an item. The method may further include receiving a plurality of hyperparameters of the factorization distribution.

The hyperparameters may include a concentration of a Dirichlet distribution, a degree of freedom of a Wishart distribution, a scale matrix of the Wishart distribution, a plurality of scalar parameters for defining a plurality of cluster parameters, a number of clusters, and a number of users.

The method may further include determining an inverse covariance matrix for the items and an inverse covariance matrix for the users given the degree of freedom of the Wishart distribution and the scale matrix of the Wishart distribution, determining a plurality of cluster distributions of the items and a plurality of cluster distributions of the users given the concentration of the Dirichlet distribution, wherein the plurality of cluster distributions of the items and the plurality of cluster distributions of the users are determined independently of the inverse covariance matrix for the users and the inverse covariance matrix for the items, and assigning a cluster label to each item and each user according to the plurality of cluster distributions of the items and the plurality of cluster distributions of the users, respectively.

The method may further include determining a plurality of clusters of the items based on the cluster labels of the items, the inverse covariance matrix for the items and the plurality of cluster distributions of the items, and determining a plurality of clusters of the users based on the cluster labels of the users, the inverse covariance matrix for the users and the plurality of cluster distributions of the users.

The method may further include sampling the plurality of clusters of the items and the plurality of clusters of the users to generate the user rating of the item. The user rating of the item may be conditioned by an inverse variance of a rating noise and a weighting vector.

The determination of the posterior distribution may be performed iteratively for each of the plurality of items. The factorization distribution may determine the feature vectors from an inverse covariance matrix and a cluster mean, wherein the cluster mean is determined independent of the inverse covariance matrix.

The training model may be applied to a test data set, wherein the trained model predicts the user rating of the item missing from the test data set.

According to an embodiment of the present disclosure, a method for multi-task learning for Bayesian matrix factorization may be embodied in a computer program storage medium embodying instructions executable by a processor to perform the method.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred embodiments of the present disclosure will be described below in more detail, with reference to the accompanying drawings:

FIG. 1 is a graphical model representation of Bayesian matrix factorization model with a Dirichlet process mixture;

FIG. 2 is a graphical model representation of a multi-task learning method according to an embodiment of the present disclosure;

FIG. 3 is a graph is a prior distribution of a function according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram showing a method for collaborative filtering according to an embodiment of the present disclosure;

FIG. 5 A depicts Table I and II showing NMAE results for an exemplary STL method on MovieLens data and EachMovie data;

FIG. 5B depicts Table II showing NMAE results for an exemplary MTL method on MovieLens data and EachMovie data, respectively; and FIG. 6 is a diagram of a system for collaborative filtering according to an embodiment of the present disclosure.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

According to an embodiment of the present disclosure, a single-task work of Bayesian matrix factorization with Dirichlet process mixtures (BMF-DPM) is extended to a multi-task setting. A multi-task learning (MTL) method is described that improves the prediction accuracy of each single task, even in the presence of sparse rating matrices. Further, an automatic relevance determination is introduced into the BMF-DPM model such that the dimension of a latent feature vector is determined automatically, increasing a speed of the method. A variational Bayesian method may be employed for training, which is faster than a Gibbs sampling used by other Bayesian matrix factorization approaches. Moreover, the method need not use side information (e.g., user or item identity), increasing the applicability of the method.

Throughout the detailed description, italic letters to denote scalars, bold and lowercase letters to denote vectors, bold and uppercase letters to denote matrices. We use 1 : N as an abbreviation for the list of 1, 2, N.

A MTL extension of the Bayesian matrix factorization models with Dirichlet process mixture shares parameters among different datasets. Although an exemplary MTL model is described in terms of movie rating datasets herein, other types of datasets such as book rating and news rating can also be combined. Different datasets may introduce more between-task variations, for example, due to differences in rating ranges, rating environment and the nature of items. Additional parameters or additive models counting for these changes are possible extensions of the described method. According to an embodiment of the present disclosure, the Bayesian matrix factorization model with Dirichlet process mixture (BMF-DPM) may be extended for dimension reduction of feature space and applied in a multi-task method.

The matrix factorization models may be used to decompose the N x M rating matrix X = U T V, where U is a D x Nuser feature matrix and Vis a D x M item feature matrix. Each column of U or V can be viewed as the latent feature vector for a user or an item,

respectively. After U and V are learned, the corresponding rating ry will be the inner product between the z ' -th column of U and y-th column of V: r tj = ufv j . Tri-factorization decomposes the rating matrix into three latent matrices including U and V with various conditions added, for example, U and V being binary matrices.

In the context of the movie rating datasets, matrix factorization models assume that users and movies are represented by latent feature vectors u ; - and v . , respectively, where i =

1 : N and j = 1 : M. A rating (e.g., a missing rating in the matrix) may be predicted by the inner product u v .. Objects may be grouped into clusters such that objects in the same cluster can share similar feature vectors and thus have similar ratings. Based on this idea, BMF-DPM has a generative process for clusters as shown in FIG. 1.

K u ~ Dir(a/K) nr v ~ Dir(a/L) ^

A ti fe ~ WOo, S 0 ) A vl ~ W(i>o, S 0 )

~ N{t*o, (iS Aufc) - 1 ) μ ν1 - ΛΊ 0 , (A)A„j) _1 ), (3)

where k = 1 : K and 1 = 1 : L.

Referring to FIG. 1, n u 101 and π ν 102 are the cluster distributions satisfying a

Dirichlet distribution Dir . ). A uk 103 and A 104 are the inverse covariance matrices following a Wishart distribution W{ . ). Cluster means μ Λ 105 and μ ν1 106 are generated from a Gaussian distribution. Eqs. (2) and (3) may be referred to as a Gaussian- Wishart distribution.

Once these parameters are determined, for a user i = 1 : N 107 or an item = 1 : M 108, its cluster label z ui 109 or Zyj 110 is sampled from the corresponding multinomial distribution (treated the same as a categorical distribution)

z ui ~ Multi(n u ) z vj ^ Mt ti(7c v ) .

Then feature vectors for each user i and each item j are sampled from the corresponding clusters u z 111 and V 112 as

In view of the foregoing, a movie y rated by user i has a rating generated by

where γ is the inverse variance of the rating noise.

If the rating part in Eq. (6) is ignored the remaining Eqs. (l)-(5) describe a generative mixture model for an object (user or item). When the number of clusters K = 1 , the mixture model becomes a single Gaussian model and the FIG. 1 model degrades to the Bayesian matrix factorization (BMF) model. When K =∞, it becomes a Dirichlet process mixture

(DPM) model turning FIG. 1 into the BMF-DPM. One advantage of the DPM is that the actual number of clusters (no more than Nor M) can be automatically determined. This contrasts to a regular Gaussian mixture model where the number of clusters must be given and fixed. In practice, K may be set to a large number, bigger than the actual number of clusters.

The BMF-DPM model is more challenging than the DPM model, because the feature vectors are observed in the latter case. However, in BMF-DPM, both user and item feature vectors are latent and need to be inferred from ratings that are the only variables visible. Inference so far has been limited to Gibbs sampling that is relatively easy to implement. However, Gibbs sampling has a scalability problem and can be slow for medium-size datasets. In addition, both the BMF and BMF-DPM didn't address how to select D, the dimension of the feature vector. Therefore, these models are not fully non-parametric.

According to an embodiment of the present disclosure, BMF-DPM may be implemented with an automatic relevance determination. More particularly, a dimension reduction may be applied to BMF-DPM based on the concept of automatic relevance determination (ARD). ARD may be used for other non-parametric Bayesian methods including determining the number of kernel functions and the number of principal components.

According to an embodiment of the present disclosure, a weighting vector w £ is introduced and Eq. (6) may be extended to a triple-vector inner-product rating model

d—1 (?)

The weighting vector w has the following property: for a redundant or irrelevant dimension d , the corresponding weight w d should be close to zero, such that this dimension can be ignored in Eq. (7). To encourage this behavior, a Gaussian prior may be given for each w d

where X d is the inverse variance and has a Gamma distribution the Gamma function. Eqs. (8) and (9) are also represented in FIG. 2. The marginal probability of w d 201 can now be calculated by marginalization of X d 202 1

P{w d \X d )P(X d )dX d ex

(10)

FIG. 3 shows the probability density function 301 of this distribution with the constant hyperparameters α λ = 0.01 and b = 0.0001. α λ and b are typically set to low values. They can be ignored in P(w d ) when w d is far away from zero such that

P(w d ) o 1 / When w d is close to zero, P(w d ) becomes sharply peaked. Due to this property, w d has the tendency to be zero when there is no evidence to support a non-zero value. This is in spirit similar to the Bayesian Lasso where P(w d ) has a Laplace distribution given by P(w d ) oc e λ ^ . A Laplace distribution is also highly peaked at zero, but it is not a conjugate prior for the Gaussian distribution and special treatment is needed to make the inference tractable.

One advantage of the exemplary dimension reduction strategy is computational efficiency. If a dimension becomes irrelevant such that w d ~ 0 , the corresponding dimension may be removed not only from the feature vectors u z and v 7 , but also from all other cluster parameters μ Λ , μ ν! , A uk and A vl . This makes the method faster during each iteration.

Referring to the BMF-DPM model in FIG. 2, as an initial matter the term (t) represents the number of datasets, such as in the rating shown as r. . Referring further to FIG. 2, the cluster means μ Λ 203 and μ νί 204 are generated from a Gaussian distribution such as,

(11)

independent from the inverse covariance matrix A uk 205 and A vl 206. In the FIG. 1 model, the same A uk is used for both cluster mean μ Λ and feature vector u z . In this way the model of FIG. 2 is flexible; the inverse covariance matrix A uk 205 and the cluster mean μ Λ 203 are independent in the determination of the feature vector u z 207. Similarly, the feature vector v 7 208 is determined using independent inverse covariance matrix A vl 206 and the cluster mean μ ν1 204. Stated simply, μ Λ is generated independently from A uk and μ ν1 is generated independently from A vl . Moreover, the inverse variance γ 209 of rating η 210 is a random variable in FIG. 2 with a Gamma prior

P(7) = Γ(7|ο 7 , 6 ) . ^

The exemplary BMF-DPM-ARD model of FIG. 2 is now complete following Eqs. (1), (2), (4), (5), (7)-(9), (11), (12). All hyperparameters (open nodes in FIG. 2) are generic: the same settings can be used in different datasets and different applications (e.g.,

classification, regression). The hyperparameters are fixed or constant parameters. This makes the model fully non-parametric. The hyperparameters may be defined as follows:

a is a concentration of a Dirichlet distribution DirQ.

v 0 and S 0 are the degree of freedom and scale matrix of the Wishart distribution W(), respectively.

μ 0 and β 0 are two scalar parameters used in defining Eq. (3).

K and L are the number of clusters for users and items, respectively.

The remaining parameters are random parameters that may be learned during training.

Note that the side information is not needed in the model as shown in FIG. 2.

Referring now to an exemplary multi-task learning method for BMF-DPM-ARD; FIG. 2 and FIG. 4 show an exemplary extension of the BMF-DPM-ARD to multi-task learning. There are T different datasets 401, each representing a different task. Task t = 1 : T has its own set of users, items and ratings. All tasks share the same set of cluster parameters, the same inverse variance γ and the same weighting vector w. Although users and items can be quite different in different datasets, the interaction between user and item can be similar. Through the sharing of latent parameters, different tasks are linked and their training data are indirectly shared for other tasks. This helps to address the data sparsity issue.

Previous multi-task work considers sharing the latent feature vectors u z or v between different tasks. This is a direct way of sharing representation. However, to achieve this, the identity of a user or an item is assumed so the correspondence of the same object can be found in different datasets. This implicitly uses the side information and can become restrictive when such information is not disclosed. Furthermore, it may not be possible to find such correspondences if items are of different types, e.g., movies in one task and books in another task.

According to an embodiment of the present disclosure, a model is generic, only sharing cluster-related parameters among different tasks. Turning now to an exemplary variational Bayesian inference method for learning a model, such as that shown in FIG. 2, a parameter set Φ 402 includes cluster parameters { π ίι , π ν , μ Λ , μ ν1 , A uk A vl | k = 1 : Κ, I = t = 1 : Γ} , dimension

weighting parameters [w d , d \ d = \ D} and rating noise parameter γ.

Let Θ = [ a, μ 0 , Λ 0 , v 0 , S 0 , α γ , b y , a A ,b A } denote the hyperparameter set. Given T training rating matrices X^ 'T the task of learning is to determine the posterior distribution

Ρ φ | χ (1:Γ) , Θ) for parameter set Φ 03.

Because this inference is not analytically tractable, a variational Bayesian method may be used. The posterior distribution Ρ^Φ \ X (1 'T may be approximated by a factorization distribution 404

K L

Π Q uk)Q{* k)Π Q(/*«i)<3(A ul ) t=i i=i j=i (13)

Each parameter has the same type of posterior distribution Q(.) as its conjugate prior. To determine the distribution for a parameter φ ; , the posterior mean of log ^ (1Γ) ,Φ | is determined over all parameters except φ. : (ogP^&^ j, Φ | Θ^Φ / φ ί . Here, the variational

Bayesian method performs an integration over all other parameters Φ / φ Γ

In every iteration, variational inference for cluster parameters and cluster indicators z (t) and z (t) has a time complexity of OQ ' ii'lK + ML IK). Feature vectors u w and v w need

0(T(N + MR)D-). Inferring γ and w takes ζΤ Τ θ-).

N and stand for the maximum values of N" i r) and i^ 1'1 ^, respectively. N and M denote the maximum number of observed ratings per item and per user, respectively.

Similarly, JO'J denote the maximum number of observed ratings in the rating matrix J0 l'T It may be assumed N , M > K,L,D.

If the rating matrices are fully observed, then N = N, M = M and NM. In this worst case, the total time complexity for all parameters will be Of ' NMQ }. In terms of storage, cluster parameters require 0((K + L)D ' ~), cluster indicators need 0(T{NK ÷ ML) and feature vectors require 0(T W + M)D ] . After training, the feature vectors of a trained model may be saved 405. Learned parameters may also be saved.

The approximation may be initialized, wherein K = L= 10 and D = 30.

Hyperparameters maybe set as follows; a = 10.0, μ 0 = 0, Λ 0 = 0.011, v 0 = D + 0.1 = 30.1,

So =1001, α γ = α λ =0.01, b y = b x = 0.0001. I is the identity matrix. The posterior mean may be set as w d = l, starting all D dimensions equally relevant. The posterior means of cluster centers and latent feature vectors are randomly generated. Note that K = L = 10 represents the maximum number of clusters. The actual number may be smaller. During iteration, if a cluster k doesn't have a single feature vector supporting it - J j > Oj , this cluster and its associated parameters μ Λ and A uk may be removed. When the posterior mean of a dimension weight w d approaches zero, the corresponding dimension is also removed.

It is possible to estimate all hyperparameters Θ via an expectation-maximization EM algorithm. The EM algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step for creating a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step for determining parameters maximizing the expected log-likelihood found on the E step. These parameter- estimates may be used to determine the distribution of the latent variables in the next E step. Here, the variation Bayesian method may be plugged into the E-step; in the M-step, Θ is estimated. However, since all hyperparameters are not problem-specific, the EM algorithm is not used described further.

The following is a description of an exemplary implementation according to an embodiment of the present disclosure implemented using C++ and run on a personal computer (PC) with Intel 2.33GHz Quad-Core CPU and 4GB RAM. All tests are run in a single thread (thus only one core is actually used).

Referring to the evaluation criterion; first, we randomly partition all users into two disjoint sets. In other words, the original dataset (a N x M rating matrix) is randomly split into a weak generalization dataset, a ¾ x M rating matrix and a strong generalization dataset, a N s x M rating matrix, where N= N w + N s . The ratings for each user are also randomly split into a training set and a test set. Using a leave-one-out strategy, only one rating per user is held- out for testing and all the remaining ratings are for training. We extend this to a percentage- based testing strategy: for a specified fraction number / e [0, 1] , / of the ratings are used for training and the remainder 1— of the ratings are for testing. The original leave-one-out strategy is treated the same as when / = 1.00.

A system is first trained on the weak generalization dataset with training ratings and tested on the weak generalization dataset with test ratings. This is referred to as a weak generalization test. After this, the trained system is given the strong generalization set with training ratings, based on which the algorithm updates itself. Then the updated system is tested on the strong generalization set with test ratings. This is called the strong

generalization test.

The purpose of the strong generalization is to check whether a method implemented by the system can be easily generalized to unseen users. Therefore, the system was not retrained using both the weak and strong generalization training datasets, but updated only using the strong generalization training set. Accordingly, during the evaluation all item side parameters and item feature vectors are fixed and the user side parameters and user feature vectors are updated.

We use root mean squared error (RMSE) to evaluate different methods. Normalized mean absolute error (NMAE) is also considered. The NMAE is determined by normalizing the mean absolute error (MAE) by a factor that depends on the range of rating. The factor is 1.6 for the rating set {1 : 5} and 1.944 for the rating set {1 : 6} .

The 1M MovieLens data include 1 million ratings for N = 6040 users and M = 3952 movies with ratings ranging between 1 and 5. Thus, NMAE = MAE/1.6. The weak generalization user set size is N w = 5000 and the remaining is N s = 1040. The EachMovie data contain 2.6 million ratings for 74424 users and 1648 movies. The rating is from 1 to 6. Users with fewer than 20 ratings were discarded. This left N = 36656 users, which are split by N w = 30000 and N s = 6656.

We refer to embodiments of the present disclosure as the multi-task learning (MTL) method, where the number of tasks T= 2. The MTL is trained on both MovieLens and EachMovie data jointly. We also consider a single-task learning (STL) version of our method, where T = 1. The STL method is trained on each dataset independently.

In the evaluation, we vary the percentage of ratings used for training from f = 0.05, 0.10, 0.25, 0.50, 0.75 to 1.00. When / = 0.05, only 5% of the ratings of each user is used for training (the rest 95% for testing). This simulates the situation when the dataset is just created and the available ratings are very limited. When f reaches 1.00, the dataset is complete, which is the only case most prior work methods have been tested on. For each f, each dataset is randomly split into weak and strong generalization, training and testing ratings ten times. The final score is the average of these ten tests.

Table I and II (FIG. 5) show the NMAE results for a STL method and a MTL method on the MovieLens data and EachMovie data, respectively. In general when more ratings are available, both methods achieve better results. When training data are quite limited / = 0.05, the MTL approach clearly performs better than the STL approach (e.g., 0.4683 vs. 0.4777 in the weak generalization test on MovieLens data). In such cases, the STL method is likely to overfit the data. In contrast, the MTL method combines the training data from related datasets by sharing parameters and thus produces better results. As the percentage of training ratings increases to / = 1.00 , the MTL still outperforms the STL, but the improvement becomes less appreciable. This is because when adequate training data are present, the overfitting problem of the STL will be mitigated. The improvements of MTL over STL are statistically significant, under a one-tailed paired t-test with confidence level at 0.05, for all tests with the exception of the weak generalization test on the EachMovie data with / = 1.00.

Prior methods are limited to the f = \ .00 case (the last column in Tables I and II) is considered.

It should be understood that other methods can be extended to a multi-task learning approach according to the teachings of the present disclosure. For example, for the approaches based on probabilistic principal component analysis such as NPCA and GP-LVM, a Gaussian distribution is assumed either on the row (user) or on the column (item). Without knowing the correspondences of all users or items across different datasets, such a Gaussian distribution cannot be shared between different tasks. In addition, unlike the MTL method, these algorithms are one-sided (user and items are treated differently) and thus can only be used in strong generalization tests for either new users or new items, but not both.

We initialize the method under evaluation with the feature dimension D = 30. With automatic relevance determination, dimension is reduced automatically. More training data support more dimensions (parameters). When / > 0.75 (approximately), all D = 30 dimensions become relevant and cannot be reduced. We also conduct the following test for / = 0.05 without dimension reduction: all w d are fixed at 1, but D is varied from 30 to 60.

The results are similar to the dimension reduction result. This suggests that our dimension reduction doesn't affect performance, but we can greatly speed up the iteration by removing irrelevant dimensions.

The number of iterations needed by our methods ranges between 30 and 50. When / = 0.05, the average training time is 0.026, 0.120 and 0.149 hours with dimension reduction (with an average D = 5) for STL of MovieLens data, STL of EachMovie data and MTL of both, respectively. Without dimension reduction, the time increases to 0.048, 0.179 and 0.266 hours, respectively. When / = 1.00 , these three numbers become 0.166, 0.542 and 1.015 hours, respectively. The 0.542 hours on the EachMovie data is better than the 0.69 hours required by NPCA, which is, to the best of our knowledge, the fastest top algorithm on this data set under the same setting.

The number of clusters for both users and items are initialized at K = L = 10. After convergence, the actual number of clusters for users ranges between 3 and 5, and for items ranges between 4 and 7. We observe that all user clusters are shared by both MovieLens and EachMovie datasets. This implies that users in these two datasets have similar distributions. While some clusters exclusively belong to EachMovie, this may be due to some special movie categories contained in the EachMovie dataset.

A MTL method according to an embodiment of the present disclosure may be adapted into a transfer learning method. For example, a STL method is trained only using the source dataset. The same STL algorithm may then be re -trained on the target dataset with all but object parameters obtained in the first step fixed.

Alternatively, instead of fixing parameters in the second step, the STL may be initialized with the learned parameters. The second step will then be a regular STL.

It is to be understood that embodiments of the present disclosure may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. In one embodiment, a software application program is tangibly embodied on a non-transitory computer-readable storage medium, such as a program storage device or computer-readable storage medium, with an executable program stored thereon. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture.

Referring to FIG. 6, according to an embodiment of the present disclosure, a computer system (block 601) for multi-task learning for Bayesian matrix factorization includes, inter alia, a CPU (block 602), a memory (block 603) and an input/output (I/O) interface (block 604). The computer system (block 601) is generally coupled through the I/O interface (block 604) to a display (block 605) and various input devices (block 606) such as a mouse, keyboard, medical scanners, power equipment, etc. The display (block 605) may be implemented to display predicted ratings. The support circuits can include circuits such as cache, power supplies, clock circuits, and a communications bus. The memory (block 603) can include random access memory (RAM), read only memory (ROM), disk drive, tape drive, etc., or a combination thereof. The present invention can be implemented as a module (block 607) of the CPU or a routine stored in memory (block 603) and executed by the CPU (block 602) to process input data (block 608), e.g., including the training datasets. For example, the data may include image information from a camera, which may be stored to memory (block 603) As such the computer system (block 601) is a general purpose computer system that becomes a specific purpose computer system when executing the routine of the present disclosure.

The computer platform (block 601) also includes an operating system and micro instruction code. The various processes and functions described herein may either be part of the micro instruction code or part of the application program (or a combination thereof) which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.

It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures may be implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the system is programmed. Given the teachings of the present disclosure provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present disclosure. Having described embodiments for multi-task learning for Bayesian matrix factorization, it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in embodiments of the present disclosure that are within the scope and spirit thereof.