Constrained Probabilistic Matrix Factorization with Neural Network for Recommendation System

. In order to alleviate the problem of rating sparsity in recommendation system, this paper proposes a model called Constrained Probabilistic Matrix Factorization with Neural Network (CPMF-NN). In user modeling, it takes the inﬂuence of users interaction items into consideration. In item modeling, it utilizes convolutional neural network to extract the item latent features from the corresponding documents. In the process of fusion of latent feature vectors, multi-layer perceptron is used to grasp the nonlinear structural characteristics of user-item interactions. Through extensive experiments on three real-world datasets, the results show that CPMF-NN achieves good performance on diﬀerent sparse data sets.


Introduction
Recommendation is one of the effective methods to solve the problem of information overload and realize personalized information service. Collaborative Filtering (CF) is a commonly used technology for recommendation. However, with the increasing number of users and items, the user-item ratings used in collaborative filtering is becoming more and more sparse which hinder the application of CF [1].
In recent studies, researchers usually try to alleviate the problem of rating sparsity from the view of user and item latent feature modeling. Salakhutdinov et al. [2] proposed a model called Constrained PMF (CPMF) on the basis of Probabilistic Matrix Factorization (PMF) who integrates the items that the users have rated into user latent feature modeling in order to obtain a more accurate user latent feature vector, and thus get a better recommendation result in the condition of sparse datasets. Wang et al. [3] combined collaborative filtering and probabilistic topic model together and proposed a model called collaborative Topic Regression (CTR) which extracts item latent features from the item documents by Latent Dirichlet Allocation (LDA). Wang et al. [4] think that CTR cannot extract item latent feature effectively. Therefore, a Collaborative Deep Leaning (CDL) is proposed by combining Bayesian stacked denoising autoencoder (Bayesian SDAE) and PMF. In the view of Kim et al. [5], CTR and CDL cannot fully capture document information as they assume the bag-of-word model that ignores the contextual information of documents. So, Convolutional Matrix Factorization (ConvMF) which integrates Convolutional Neural Network (CNN) into PMF was proposed. ConvMF leveraged CNN to capture the contextual information of documents, so as to obtain more accuracy representation of item latent features and more accuracy predicted ratings. The above researches show that integrating item documents into item modeling can improve the recommendation effect.
In the above studies, although CPMF took the items that the users have rated into user modeling, it still placed spherical Gaussian priors on item latent feature vectors with the same parameters as PMF does, so its item modeling can still be further improved. In the other hand, CTR, CDL and ConvMF made some advance in item modeling by extract item latent features from item document, but they also placed spherical Gaussian priors on user latent feature vectors with the same parameters as PMF does. As a result, it always leads to inaccurate predicted ratings of some users on sparse datasets. Salakhutdinov et al. [2] points out that over such a spherical Gaussian priors, once the model has been fitted, the users with few ratings will have feature vectors that are close to the prior mean, or the average user, so the predicted ratings for those users will be close to the item average ratings. As a result, it still leads to inaccurate predicted ratings of some users on sparse datasets.
In view of the fact that the above researches cannot make improvement in user and item modeling at the same time, this paper proposes a model called Constrained Probabilistic Matrix Factorization with Neural Network (CPMF-NN), which achieves some enhancement in the follow three aspects. In user latent feature modeling, CPMF-NN takes the items that the users have rated into account so that users with different rated items will own Gaussian priors with different parameters. In item latent feature modeling, CNN is used to extract item latent features from the item documents. In the fusion of user and item latent feature, different from linear fusion method of traditional matrix factorization, CPMF-NN takes the advantage of Multi-Layer Perceptron (MLP) to realize a nonlinear fusion method that ultimately improves the accuracy of the predicted ratings.
The work of this paper is organized as follows. Section 2 introduces the framework, optimization methodology and the method of parameter updating. Section 3 introduces the datasets and experiments. Finally, in section 4, we summarize the work of this paper and look to the future work.

Constrained probabilistic matrix factorization with neural network
Like PMF, CPMF-NN obtains user and item latent feature vectors by factorizing the user-item rating matrix, and it decomposes user and item latent feature vectors as well. The framework of CPMF-NN is shown in Fig 1. It consists of three parts and is briefly described as follows. The first part, which is similar to PMF and is the basis of CPMF-NN, is shown in part (a) of Fig 1. Suppose there are n users and m items. Let R ∈ R n×m denotes the user-item rating matrix, and the integer ratings R ij ∈ {1, 2, 3, 4, 5} refers to the rating of user i for item j. The purpose of CPMF-NN is to factorize the rating matrix into the user latent feature matrix U ∈ R d×n and item latent feature matrix V ∈ R d×m , and it hopes that where d denotes the dimension of latent feature vectors, R ij denotes the predicted rating of user i for item j and f () denotes the fusion function. Similar to the idea of PMF, CPMF-NN decomposes user and item latent feature vectors as well. But different from PMF, CPMF-NN takes a nonlinear fusion function rather than a linear function.
The second part is shown in part (b) of Fig 1. CPMF-NN decomposes each user latent feature vector U i into a sum of 2 terms: offset term X i [2, 6] and preference term P i . P i is the mean of constrained vectors of items that user i has rated. Let I ∈ R n×m denotes the indicator matrix with elements I ih is equal 1 if user i rated item h and 0 otherwise.
The third part is shown in part (c) of Fig 1. CPMF-NN decomposes each item latent feature vector V j into a sum of 2 terms. The first term is item latent feature term F j that extracted from the corresponding item document via CNN. The second term is Gaussian noise which enables us to further optimize the item latent feature vector for predicting ratings. The conditional distribution over the observed ratings is defined as Eq.(1)

User latent feature modeling
In user latent feature modeling, literatures [3][4][5] all placed a zero mean spherical Gaussian prior on the user latent feature vectors. However, such assumptions often lead to inaccurate predicted ratings for some users due to the problem of rating sparsity. The items that one user rated usually reflect the preference of the user. In order to get a more accurate user latent feature, we define U i as the sum of two terms: 1) offset term X i , which is the basic representation of user i. 2) preference term P i , which is another representation part of user i constructed by the users whole rating items. CPMF-NN gives each item another representation besides the latent feature vector, called item constrained vector C h , which is used to construct preference term P i . Specially, the preference term P i of user i is defined as the mean of item constrained vectors of items that user i has rated. With the two terms, we can get the user latent feature vector. here and we also place spherical Gaussian priors on offset terms and item constrained terms as CPMF does.
Taking Eq. [3] and [4] into Eq. [2], for each user we can draw a user latent feature vector . The variance of U i will getting more close to the variance of X i when user i have more rating items. That is to say, the influence of the item constrained vectors will be smaller and even eliminate. On the contrary, the influence will be strong on the users who have rated a few items.

Item latent feature modeling
In item latent feature modeling, considering the sparsity problem, we leverage item documents to obtain item latent feature vectors. Similar to user latent feature modeling, we decompose each item latent feature vector into a sum of two parts: 1) item latent feature term F j which is extracted from the corresponding item document Y j by CNN. 2) Gaussian noise O j for the more accurate representation of item latent feature. With these two parts, we can get the item latent feature vector where F j = CN N (W, Y j ) and we also place spherical Gaussian priors on the weight of CNN and Gaussian prior on the Gaussian noise: Accordingly, the conditional distribution over item latent feature vectors is given by We use the CNN architecture Kim [7] proposed to analyse item documents. Specially, for each item, we take its document Y j = [y 1 , y 2 , . . . , y t ] as the input of CNN in which t denotes the length of document and y . is the word embedding vector of a word in the documents. Then, with a shared weight W j e ∈ R |y.|×x whose window size is x, a convolution feature e j = [e 1 , e 2 , . . . , e t−x+1 ] is generated. In the pooling layer, we use max-pooling to get the document feature representation e = [max(e 2 ), max(e 2 ), . . . , max(e j )]. Finally, projecting e by a nonlinear activation and we can get the feature presentation of each document F j = tanh(W 2 (tanh(W 1 e + b 1 )) + b 2 ) = CN N (W, y j ).

Fusion of latent features
In order to get the predicted ratings, we define a fusion function f () to fuse the user and item latent feature vectors. The framework of fusion is shown in table 2. The process of fusion can be given in the form ofR ij = f (U i , V j ) with user and item latent feature vectors as the input and rating as the output. Different from the traditional linear fusion method such as inner product, CPMF-NN realizes a nonlinear fusion method based on MLP: , where denote element-wise product. Firstly, taking user latent feature vector U i and item latent feature vector V j as the input of MLP respectively. In particular, it can be formulated as follows.
where x denotes the input (U i or V j ) of MLP. L k , a k , W k and b k respectively denote the output, activation function, weight and bias of hidden layer k where k = 1, 2, 3 and x * (U * i or V * j ) denotes the output of MLP. Then, taking x = U * i V * j as the input of MLP for the purpose of predicting rating. Finally, we can get the output of the last layer as the predicted ratingR ij .
Compared to the traditional linear fusion, the nonlinear fusion method proposed in this paper can catch the nonlinear feature of interactions between users and items and enhance the accuracy of predicted ratings. We take back propagation algorithm to optimize the weight and bias of hidden layers and take ReLU as activation function since it is proved to be non-saturated [8]. In addition, ReLU encourages sparse activations, being well-suited for sparse data and making the model less likely to be overfitting [9].
Optimization methodology To optimize the parameters such as U , V and the weight of CNN, maximum a posterior (MAP) estimation is employed. Since computing the full MAP is intractable, maximizing MAP is equivalent to minimizing the log-likelihood as follows. Kim et al.[4], we adopt coordinate descent to optimize X i , C h and V j . It optimizes one variable while fixing the remaining variables. As a result, the variables can be updated as follows. (10) where I i is a diagonal matrix with I ij as its diagonal elements, and I d , I j and I h are same defined as I i .
As for the weight W of CNN, we use back propagation algorithm to optimize as E can be seen as a squared error function with L 2 regularized terms when other variables are temporarily constant.

Experimental environment and datasets
The experiments are implemented on a E5-2620 CPUs work station and a Tesla P100-PCIE GPU work station. The development environment are Python 2.7, Tensorflow 1.3.0 and Keras 2.0.5, and the development tool is PyCharm.
We experimented with three publicly accessible datasets: Movielens 1m (ML-1m), Movielens 100k (ML-100k) and Amazon Instant Video (AIV). The value of user-item ratings in each dataset is 1 to 5. ML-1m and ML-100k are movie ratings datasets widely used in recommendation, we obtained the plot summary from IMDB as the document of each movie, and removed some movies from the datasets cause the absence of plot summary of these movies on IMDB. AIV is an instant video ratings dataset with reviews on each video. Because of the large scale of the AIV, we removed the videos with less 5 ratings and with reviews more than 10000 words. We randomly split each dataset into a training set (80%), a validation set (10%) and a test set (10%). As a result, the Statistics of each dataset are showed in table 1.
All the item documents are preprocessed as follows: 1) the maximum length of documents is set to be 300, 2) remove the stop words, 3) calculate the ti-idf value of each word, 4) remove the corpus-specific stop words with document frequency higher than 0.5, 5) selecte the top 8000 words as a vocabulary, 6) remove all non-vocabulary words from documents.

Baselines and parameter settings
We compared CPMF-NN with the following two baselines.
• PMF [2]: Probabilistic matrix factorization is a classical collaborative filtering method. It is the basis of the ConvMF proposed by Kim et al. [5] and the basis of CPMF-NN model proposed in this paper.
• ConvMF [5]: ConvMF extracted item latent feature from item document by CNN and integrated CNN into PMF model. Compared to ConvMF, CPMF-NN involves user interaction items in user modelling and fuses the user and item feature latent vectors in a nonlinear way.
We set the dimension of latent feature vector to be 64 in experiments. Table  2 shows other parameters setting which are set according to experience.

Evaluation protocols
We adopt root mean squared error (RMSE) and Recall as the protocols for each model on the three real-world datasets. RMSE is a popular matric and it measures the error between the real ratings and the predicted ratings, and is defined as follows.
where n is the number of user-item ratings in the test dataset.
Recall is a measure of classification accuracy, which indicates the ability of the model to predict a particular item the user like or dislike. It is defined as follows.
where N denotes the user number in test set, Z i denote the set of recommendation items of user i in test set and T i denote set of the real items of user i in test set. Fig 4 show the performance of each model on the three real-world datasets. As shown in Fig 3, the trend of RMSE of different models on different datasets is consistent. The classical CF method PMF is greatly influenced by the sparsity of the datasets. When it occurs to Amazon Instant Video dataset whose sparseness is much more than the other two, the RMSE value of PMF raises obviously compared to ConvMF and CPMF-NN. It indicates that using CNN to extract the item latent features from the documents can effectively alleviate the sparsity problem. On different dataset, the improvement of CPMF-NN over the best competitor are 5.5470%, 7.4747% and 9.7166%. It proves that it is helpful to alleviate the sparsity by taking users interaction items and nonlinear fusion method into consideration. Fig 4 gives the overall Recall performance of each model on the three datasets. On the ML-100k dataset, when T op n=2 and T op n = 4, the corresponding Recall value of the CPMF model is slightly lower than that of the ConvMF Fig. 3. The overall RMSE performance model. It should be pointed out that when T op n ≥ 6, the Recall value of the CPMF-NN is higher than the ConvMF. On the ML-1m dataset, the experimental results of CPMF-NN have a slight improvement compared to PMF and ConvMF. On the AIV dataset, the improvement of CPMF-NN is obvious compared to the baselines. The results on three different sparse datasets prove that the sparsity problem can be effectively alleviated by improving user modeling, item modeling and fusion method.  CPMF-NN proposed in this paper commitment to alleviate the sparsity problem by combining the traditional PMF model with deep learning. In user modeling, it considers the influence of the items that users have rated, and realizes them by adding item constrained vectors to user latent feature vectors. In item modeling, CPMF-NN extracts item latent features from the item documents by CNN. In the last, it fuses user and item latent feature vectors to get predicted ratings with the structure of MLP.

Fig 3 and
With the development of internet, it is becoming easier and easier for us to access multimodality data such as the context, reviews and images about users and items. How to effectively take advantage of these multimodal data is the direction of our future work.