An Improved Collaborative Filtering Recommendation Algorithm for Big Data

. With the increase of volume, velocity, and variety of big data, the traditional collaborative ﬁ ltering recommendation algorithm, which recommends the items based on the ratings from those like-minded users, becomes more and more inef ﬁ cient. In this paper, two varieties of algorithms for collaborative ﬁ ltering recommendation system are proposed. The ﬁ rst one uses the improved k-means clustering technique while the second one uses the improved k-means clustering technique coupled with Principal Component Analysis as a dimensionality reduction method to enhance the recommendation accuracy for big data. The experimental results show that the proposed algorithms have better recommendation performance than the traditional collaborative ﬁ ltering recommendation algorithm.


Introduction
With the explosive increase in available data on the web and the rapid advances of information technology, big data has become a hot research topic in the field of data mining. Generally, it is commonly used to describe the exponential growth and availability of structured and unstructured data. Nowadays, many governmental and industrial communities become interested in the high potential of this innovative technology. However, it is very difficult for such communities to find relevant contents, recommender systems appear to solve present problems. Recommender system is defined as a decision making strategy for users under complex information platforms [1] in which it can effectively recommend the required information to end-users. Various techniques for developing recommender systems have been proposed, which can use either content-based filtering, collaborative filtering or hybrid methods [2][3][4][5]. In particular, the collaborative filtering recommendation algorithm (CFRA) is popular and has been used by many providers and consumers of big data such as: eBay, Amazon and Facebook.
Recently, many researches have reported that applying k-means as clustering technique in collaborative recommender systems can significantly enhance the performance of traditional CFRA [6]. Moreover, it has been proved that using Principal Component Analysis (PCA) as a dimensionality reduction method can significantly improve the clustering techniques [7], therefore, it is necessary to conduct dimensions reeducation before formally conducting clustering tasks. Hence, in this paper, we propose two varieties of algorithms for an effective collaborative filtering recommendation system. The first one uses the improved k-means clustering technique while the second one uses the improved k-means clustering technique coupled with PCA as a dimensionality reduction method to enhance the recommendation accuracy for big data. The experimental results show that the proposed algorithms have better recommendation performance than the traditional collaborative filtering recommendation algorithm.
The rest of this paper is organized as follows: Sect. 2 discusses some related works. Section 3 presents the collaborative filtering recommendation algorithm. Section 4 explains in details the proposed approach. Section 5 describes the experimental results. Finally, Sect. 6 concludes this study and proposes the plans for future work.

Related Work
In the recent years, the philosophy of big data attracts great attention from several official organizations including governments, universities, and industries in which the recommender systems are introduced to help them to find what they need via a mechanism that can make prediction depending on different criteria. One of the recommender strategies that can provide several kinds of recommendation is the open source project Apache Mahout [8]. It is primary enables free scalable implementation of machine learning methods [9,10]. Another free and open source scalable library of recommender system is MyMediaLite [11], which addresses both common rating and item prediction from positive-only feedback. The rating prediction can be a scale of 1 to 5 stars while the item prediction from positive-only implicit feedback can be purchase actions or from clicks. In [12], the authors propose a keyword-aware service recommendation method, named KASR, to indicate users' preferences and generate appropriate recommendations on MapReduce [13] for big data applications. In [14], Lee et al. propose an adaptive recommendation algorithm, ACFSC, that is focused on scalable clustering to solve the problem of scalability by composing neighborhood based on reducing time complexity. They also address the problem of sparsity by making items' and users' feature vectors incrementally learning. CSRS [15] is a customized service recommendation system for Big Data. It uses the MapReduce framework and focuses on service recommendation method to create proper recommendations based on users' preferences. In [16], Zarzour et al. propose a new collaborative filtering recommendation algorithm based on dimensionality reduction and clustering techniques. They use clustering k-means algorithm and Singular Value Decomposition (SVD) to cluster similar users and reduce the dimensionality, respectively. In [17], the authors use k-means algorithm to cluster users according to their interests and then voting algorithm to generate prediction in recommender systems.

Collaborative Filtering Recommendation Algorithm
In the field of recommender systems, the collaborative filtering recommendation algorithm (CFRA) is the most successful recommendation method. The behind idea of CFRA is to provide for an active user recommendations or predictions by first looking for users who share the same rating patterns with him and then using the ratings from those like-minded users found to calculate a prediction for him. In other words, CFRA can suggests new similar items or predict the interest of a certain item for an active user based on their previous likings and the preferences of other similar users. More technically, it uses a user-item rating matrix that includes the preferences for items by users for matching users with relevant performances obtained by employing a similarity function between theirs profile to make recommendations or predict the ratings of selected items [18,19].
To compute the similarity between users or items, there are several similarity measure functions. One of the most popular methods is by using Pearson Correlation Coefficient (PCC), which is defined as follows: Once the similarity is computed, the most N nearest users are selected as a group of similar users called neighborhood and predicted ratings of unrated item can be then computed. The recommendation formula is presented as follow: The main steps of the collaborative filtering recommendation algorithm (CFRA) are as follows: Step 1: Input the matrix M[m, n] of user-item rating data, active user, K; Step 2: Calculate the similarity between users by using Pearson Correlation Coefficient (PCC) and generate the similarity matrix S[m, m]; Step 3: Calculate the similarity between the active user and the clusters; Step 4: Select the first n similar users of the active user; Step 5: Calculate the prediction values of active user to every cluster by using the formula (2); Step 6: Choose the top N items of users as recommendations; Step 7: Output the recommendations.

K-means Based-Collaborative Filtering Algorithm
In this paper, two varieties of algorithms for collaborative filtering recommendation system are proposed. The first one uses directly the k-means clustering technique while the second one uses the k-means clustering technique after performing the PCA method. PCA aims at reducing the dimensions of the big data by extracting the most important information from the data. It can make big data mining more useful and get similar results by the reduction of dimensions [20].

K-means Algorithm
In data mining, K-means is considered as one of the most widely used method of clustering [21] in which it generates automatically a set of clusters based on a collection of datasets in easiest way. The main aim of k-means is to make the similarity inter-points of the same cluster be high, while the similarity inter-clusters be low. The steps of the algorithm are as follows: Step 1: Input dataset, clusters number and K; Step 2: Select randomly initial clustering centers which is the initial value of K; Step 3: Calculate the distances between centers and objects then assign objects to the most nearest cluster; Step 4: For each cluster, calculate the average as new partition centers; Step 5: Use the new partition centers to redistribute points into new clusters; Step 6: Repeat Steps 4 and 5 until the algorithm converge to a stable partition; Step 7: Output K clusters.

CFRA-Km: A Collaborative Filtering Recommendation Algorithm Based on K-means Clustering
The general k-means algorithm is now personalized in order to take into consideration the recommendation requirements as well as the perdition of unknown ratings for a given active user. The specific steps are as follows: Step 1: Input the matrix M[m, n] of user-item rating data, active user, K; Step 2: Calculate the similarity between users by using Pearson Correlation Coefficient (PCC) and generate the similarity matrix S[m, m]; Step 3: Use the matrix S[m, m] as dataset and select randomly initial clustering centers which is the initial value of K; Step 4: Calculate the distances between centers and objects then assign objects to the most nearest cluster; Step 5: For each cluster, calculate the average as new partition centers; Step 6: Use the new partition centers to redistribute points into new clusters; Step 7: Repeat Steps 5 and 6 until the algorithm converge to a stable partition; Step 8: Calculate the similarity between the active user and the clusters; Step 9: Select the first n similar clusters of the active user; Step 10: Calculate the prediction values of active user to every cluster by using the formula (2); Step 11: Choose the top N items of users as recommendations; Step 12: Output the recommendations.

Reducing the Dimension by PCA
One of the purposes of a PCA is the analysis of big data for eliminating noises and finding patterns to reduce the dimensions of the data without loss of relevant information. To do this, it converts a collection of observations of possibly correlated variables into a collection of values of principal components by using a linear transformation called orthogonal transformation. In general, the quantity of the obtained principal components is less than or equal to the quantity of original variables. Therefore, PCA is used as a statistical method to reduce not only the dimension of the user-user ratings matrix but also to reduce the loss of information by employing eigenvalue decomposition of data covariance matrix to obtain principal components of dataset with their weights. The general steps of PCA are as follows: Step 1: Input the dataset; Step 2: Normalize the data in the dataset; Step 3: Calculate the covariance of the corresponding matrix; Step 4: Calculate the eigenvectors of the covariance matrix; Step 5: From matrix multiplication, translate the data to be in terms of the principal components.

CFRA-Km-PCA: A Collaborative Filtering Recommendation Algorithm Based on K-means Clustering and PCA
The first version of our k-means clustering-based collaborative filtering recommendation algorithm does not consider the effect of the dimensions reduction which may significantly influence the prediction results and make them inaccurate. Thus, PCA is applied before conducting the k-means clustering and performing the prediction step to reduce the dimension of the dataset and improve the performance of the prediction results. In other words, the collaborative filtering recommendation algorithm based on K-means clustering and PCA called CFRA-Km-PCA combines the advantages of PCA method with those of k-means clustering technique. The specific steps of CFA-Km-PCA are as follows: Step 1: Input The matrix M[m, n] of user-item rating data, active user, K; Step 2: Calculate the similarity between users by using Pearson Correlation Coefficient (PCC) and generate the similarity matrix S[m, m]; Step 3: Normalize the data in the obtained S[m, m]; Step 4: Calculate the covariance of the corresponding matrix; Step 5: Calculate the eigenvectors of the covariance matrix; Step 6: From matrix multiplication, translate the data to be in terms of the principal components.
Step 7: Use the obtained principal components matrix as dataset and select randomly initial clustering centers which is the initial value of K; Step 8: Calculate the distances between centers and objects then assign objects to the most nearest cluster; Step 9: For each cluster, calculate the average as new partition centers; Step 10: Use the new partition centers to redistribute points into new clusters; Step 11: Repeat Steps 5 and 6 until the algorithm converge to a stable partition; Step 12: Calculate the similarity between the active user and the clusters; Step 13: Select the first n similar clusters of the active user; Step 14: Calculate the prediction values of active user to every cluster by using the formula (2); Step 15: Choose the top N items of users as recommendations; Step 16: Output the recommendations.

Experimentation Results and Evaluation
To evaluate the performance of the k-means clustering-based collaborative filtering recommendation algorithm with and without using PCA compared to traditional collaborative filtering recommendation algorithm, experimentations were conducted on real big data. The experimental dataset was obtained from Netflix [22] which contains over 17,770 movies rated by approximately 480 000 users. In this dataset, there are over 100 million ratings ranging from 1 to 5 stars. A random sample was chosen and 80% of these data were also randomly used for training, and the remaining data were selected to test the performance of the considered algorithms.
In the performance evolution of recommender systems, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are the most widely used. Therefore, we used those metrics to evaluate the performance of recommendations in CFRA, CFRA-Km, and CFRA-Km-PCA algorithms.
The formulas of RMSE and MAE are shown as follows, respectively. Figure 1 shows the experimental results in terms of RMSE metric for the proposed algorithms. As we can see from the graph, the RMSE results of the proposed CFRA-Km and CFRA-Km-PCA is low in the whole neighbors range compared to that for the CFRA algorithm. More precisely, the CFRA-Km-PCA achieves better results than both other algorithms. Figure 2 shows the experimental results in terms of MAE metric for the three algorithms. In the same way, we can observe from the graph that the MAE results of the proposed CFRA-Km and CFRA-Km-PCA is low in the whole neighbors range  compared to that for the CFRA algorithm and the CFRA-Km-PCA achieves better accuracy than both other algorithms.
From Figs. 1 and 2, we can conclude that the proposed algorithms, CFRA-Km and CFRA-Km-PCA, have better performance than the traditional algorithm CFRA in terms of RMSE and MAE. We can also conclude that the combination of PCA method with K-means clustering technique improved significantly the recommendation performance, which indicates that CFRA-Km-PCA is better algorithm for using in recommendation system for big data.

Conclusion and Future Work
In this paper, we have presented two kinds of improved collaborative filtering algorithms intended to enhance the prediction accuracy in the big data context. The first algorithm uses only the k-means clustering technique, while the second algorithm combines the advantages of both k-means clustering technique and PCA method. PCA was adapted to conduct dimensions reduction before formally conducting clustering tasks, which improved significantly the performance of k-means clustering-based collaborative filtering recommendation algorithm. The recommendation algorithms were evaluated in terms of RMSE and MAE metrics. The experimental results showed that the CFRA-Km-PCA achieved better results than both other algorithms, CFRA and CFRA-Km.
In the future, we will apply our algorithms to other datasets, and study the mechanism of the dimensions reduction coupled with other clustering techniques for improving recommendation precisions.