A Content-Based Deep Hybrid Approach with Segmented Max-Pooling

. Convolutional matrix factorization (ConvMF), which integrates convolutional neural network(CNN) into probabilistic matrix factorization(PMF), has been recently proposed to utilize the contextual information and achieve higher rating prediction accuracy of model-based collaborative filtering (CF) recommender systems. While ConvMF uses max-pooling, which may lose the feature’s location and frequency information. In order to solve this problem, a novel approach with segmented max-pooling(ConvMF-S) has been proposed in this paper. ConvMF-S can extract multiple features and keep their location and frequency information. Experiments show that the rating prediction accuracy has been improved.


Introduction
Recommender systems have drawn more and more attention in the last decade.They can help people get useful information from "the ocean of information", and can be found in many fields of our life.For example, Alibaba and Amazon use recommender systems to recommend products to their users in their e-commerce platforms.Facebook and Tencent Weibo apply recommender systems in their social networks.
Collaborative filtering(CF) is one of the main methods to build recommender systems [1].Recently, combined with CF, there are more and more efforts to apply deep learning in recommender systems [2][3][4][5][6][7][8].Due to the exploding growth of the number of users and items, the sparseness of relationships between users and items can be extremely high, which deteriorates the prediction accuracy of the CF recommender systems.In order to alleviate this problem, auxiliary information such as description documents of items, which are easily available from various sources, have been utilized to enhance the rating prediction accuracy.Especially, convolutional neural network(CNN) has been integrated into probabilistic matrix factorization(PMF) to develop convolutional matrix factorization model (Con-vMF).
Convolution and pooling are of the most important stages in CNN.And maxpooling is the most common sub-sampling operation of pooling layer.It only keeps the maximum feature from each feature vector obtained from convolution layer, which has the following disadvantage: (1) The location information of the features is totally lost.In fact, the location information is kept in convolution layer.(2)Sometimes, certain features may appear frequently.The more frequently it appears, the stronger it is.But max-pooling also loses this frequency information.
In order to address this problem, we propose a new approach with segmented max-pooling, which is called ConvMF-S to improve the ConvMF.

Related Work
The great success achieved by convolutional neural network in computer vision has inspired the recent effort to apply deep learning method in NLP.Since 2014, significant work in this field have been published.Kalchbrenner [9] has proposed a CNN model for sentence modeling, which uses dynamic k-max pooling as a global pooling operation over linear sequences.Besides, he [10] has also proposed an extended CNN for processing sequences.The resulting network has two core properties: it runs in time that is linear in the length of the sequences and it sidesteps the need for excessive memorization, which can solve the problem that the pooling layer may lose some information (whether the information is useful or useless).Chen [11] has proposed a CNN model for event extraction, which uses a dynamic multi-pooling layer according to event triggers and arguments to reserve more crucial information.Lei [12] has proposed a non-linear discontinuous CNN for text modeling, which nonlinearly transforms the convolutional layer.The multi-column CNN model introduced by Dong [13] uses multiple columns of CNN to learn the representations of different aspects of questions.Ma [14] exploits various long-distance relationships between words, and presents a dependency-based convolution framework.Johnson [15] studies CNN on text categorization, the author directly applies CNN to highdimensional text data, which leads to directly learning embedding of small text regions for use in classification.
More recently, CNNs have also been applied in recommender systems.Several hybrid methods have been proposed for recommender systems that utilize auxiliary information, particularly, the reviews and abstracts of items.Kim [16] has presented ConvMF, a robust document context-aware hybrid method which seamlessly integrates CNN into probabilistic matrix factorization(PMF) in order to capture contextual information in description documents for the rating prediction while considering Gaussian noise differently through using the statistics of items.While its max-pooling layer extracts only the maximum contextual feature from each contextual feature vector.So the information of feature strength is lost.Meanwhile, the location that feature appears is also important, which is also ignored in ConvMF.In order to address the former limitation of ConvMF, we pro-pose an approach with segmented max-pooling, which can keep multiple features when pooling and reflect the location information of features.

Convolutional Matrix Factorization
In essence, CNN is a classifier because its object is to address classification task, such as image recognition, label predicting for words, phrases or documents.
While the object of recommender is a regressive task.So traditional CNN is not suitable for recommender tasks.
Convolutional matrix factorization can address the above issue through seamlessly integrating CNN into PMF.The probabilistic model of ConvMF is shown in figure 1.
Figure 2 illustrates the CNN architecture for ConvMF, which is composed of four layers: embedding layer, convolution layer, pooling layer and output layer.
(1)Embedding layer The object of the embedding layer is to transform a raw document into a dense numeric matrix for the convolution layer.The document matrix pl DR   can be represented by: where l is the length of the document, and p is the size of embedding dimension for each word w.
where  is a convolution operator, j c bR  is a bias for  [ , , , , , ] (3)Pooling layer The pooling layer extracts representative features from the convolution layer, and also deals with variable lengths of documents via pooling operation that constructs a fixed-length feature vector.Max-pooling is utilized here to reduce the representation of a document into a fixed-length vector.The maximum contextual feature from each contextual feature vector can be expressed as: (4)Output layer High-level features obtained from the previous layer could be converted at output layer.The produced document latent vector can be expressed as: where where W denotes all the weight and bias variables and j X denotes a raw docu-ment of item j , and j s denotes a document latent vector of item j .

Improved ConvMF with Segmented Max-pooling(ConvMF-S)
Convolution and pooling are of the most important stages in CNN.And maxpooling is the most common sub-sampling operation of pooling layer.It only keeps the maximum feature from each feature vector obtained from convolution layer.One of the advantage of max-pooling is that it can reduce the number of the features to enhance performance and it can also keep the length of the feature vectors the same which makes it easy to construct the following layers.The architecture of max-pooling is shown in figure 3. The disadvantage of max-pooling has been stated in section 1.In order to deal with this problem, we propose a new approach with segmented max-pooling, which is called ConvMF-S to improve the ConvMF.It divides each feature vector obtained from convolution layer into segments as required and extracts the maximum value from each segments.The architecture of segmented max-pooling is shown in figure 4.
In ConvMF-S, the embedding layer, convolution layer and output layer are the same with ConvMF.The only improvement is in pooling layer, which is described as follows.Suppose  In order to keep more information in pooling layer, we need to divide each contextual feature vector into segments, and extract the maximum contextual feature from these segments.If a contextual feature is divided into s segments, the length of each segmented contextual is represented by Eqn: Then the fixed-length contextual feature vector is converted by extracting maximun contextual features from s segments: Where: max , , ,max , , max( , , ) (10)

ConvMF-S Algorithm
Integrating CNN into PMF, our ConvMF-S algorithm can be described as follows.

Table 1. ConvMF-S Algorithm
Input: R : user-item rating matrix, X: description documents of items , extracts feature values using segmented max-pooling to create the contextual feature vector f d .4: Flatten the pooling results to make it to be one-dimensional .5: Output the document latent vectors.6: Use the document latent vectors as the mean of Gaussian noise of an item to initialize the item feature matrix .7: Initialize the user feature matrix and fit the rating matrix R with item feature matrix.8: Output the result of RSME.

Experiments
In this section, we evaluate the performance of ConvMF-S algorithm compared with PMF and ConvMF.

Experimental Environment and Datasets
We use ml-100k dataset obtained from Movielens, which contains 100,000 ratings on 1682 movies from 943 users.And we randomly divide it into training set(80%), validation set(10%) and test set(10%).
We also obtain documents of corresponding items from IMDB.The obtained documents are preprocessed as follows: (1) Set maximum length of input documents to 300; (2) Remove stop words; (3) Calculated TF-IDF score for each word; (4) Remove corpus-specific stop words of which the document frequency are higher than 0.5; (5) Select top 8000 distinct words as a vocabulary; (6) Remove all non-vocabulary words from input documents.

Word Vectors Pre-training with Word2vec
One of the most critical issues of contextual-based deep hybrid recommender systems is how to utilize text data more efficiently to generate high-quality features.This involves text analysis tasks in NLP.Therefore, Our word embedding vectors are initialized with word2vec [17], a very popular pre-trained word embedding model.And we pre-train our word vectors on IMDB, which contains 50000 labeled comments and 50000 unlabeled comments.
Each comment in IMDB is kept as a single file.So we merge these comments as a dataset.The format of the merged dataset is shown in table 2. The field id in table 1 represents the file name of the comment.Left side of the underline is the movie ID, and right side is the rating of the movie from user.The contents of review are processed by removing HTML labels, punctuations and numbers, transforming them into lowercase, splitting them into individual word and rejecting repeated words.

Experimental Results
In our experiments, RMSE(Root Mean Squared Error) is adopted as the evaluation measure, which is related to the objective functions of prediction models.
Firstly, we compare the performance of these three algorithms based on the numbers of iterations, which is illustrated in figure 5.

Fig. 5. Comparison of numbers of iterations
From figure 5, it can be seen that PMF converges quickly during the first 15 iterations, the RMSE value tends to be stable after 15th iterations.ConvMF converges quickly during the first 20 iterations.After the 20 th iteration, the model is still converging, but the convergence speed is slowed down.ConvMF tends to be stable and the RMSE value does not change when the number of iterations exceeds 30.ConvMF-S is superior to PMF and ConvMF at the beginning of the model training, indicating that the segmented max-pooling effectively improves CNN's ability to analyze document data.ConvMF-S's final iterative result is also superior to the other two algorithms which further proves that the improved method can effectively improve the recommender quality.
Secondly, we compare the performance of the three algorithms on training sets with different percentages(20%,40%,60% and 80%).The result is shown in figure 6.Finally, we compare the performance of ConvMF-S with embedded pretrained word vectors and without embedded pre-trained word vectors.The result is shown in figure 7. From figure 7, we can see that there is no obvious difference whether word vectors are embedded.While the model with embedded word vectors is still converging after 30 iterations.And the final result of the model with embedded word vectors also surpasses the model without embedded word vectors.

Conclusion
In this paper, we introduces a novel content-based deep hybrid approach with segmented max-pooling, which we call ConvMF-S.The segmented max-pooling can preserve the location information and frequency information while extracting features.Experiments show that the performance of recommendation is improved.Future work may include using distributed technology to deal with the situation in which the document data is extremely large or the selected dimension is especially high.

Fig. 1 .
Fig. 1.Probabilistic model of ConvMF The left dotted part is PMF and the right dashed part is CNN.Suppose we have N users and M items, and observed ratings are represented by 22 systems predict ratings accurately document

Fig. 2 .
Fig. 2. CNN architecture for ConvMF (2)Convolution layer The convolution layer is responsible for extracting contextual features.A contextual feature j i cR  is extracted by jth shared weight j p ws c WR   whose window size ws determines the number of surrounding words: a non-linear activation function.Then, a contextual feature vector j1 c vectors of each document are returned as output:

W
is the weight matrix and j i c is a contextual feature extracted by the jth filter in convolution layer.The length of the document is l.Processed by the convolution layer, a document is represented as c n contextual feature vec- tors, and each contextual feature vector has variable length, which is represented by 1 l ws  .

1 : 2 :
Embed the one-hot encoded word vectors to generate word sequence pl D Process D with filters of three different window size(3,4,5) to extract contextual feature

Fig. 6 .
Fig. 6.Comparison on training sets with different percentages From figure 6, it can be seen that the RMSE values of the three algorithms get smaller as the percentage of the training set becomes higher.The recommended results of the other two algorithms are better than the PMF algorithm, just because adding document information can improve accuracy.ConvMF is better than ConvMF-S when the percentage of the training set is beyond 40%.When the percentage of the training set rises exceed 40%, ConvMF-S surpasses ConvMF.Finally, we compare the performance of ConvMF-S with embedded pretrained word vectors and without embedded pre-trained word vectors.The result is shown in figure7.

Fig. 7 .
Fig. 7. Effect of the embedded word vectors

Table 2 .
Format of the Merged Dataset