Semantic-Based Recommendation Method for Sport News Aggregation System

. News on the Internet today plays an important role in helping people access daily information around the world. News aggregators are websites that collect and provide content from different sources in one location for easy viewing. However, the increasing number of news on the Internet makes it difficult for readers when they desire to access news they are concerned. One solution to this issue is based on employing recommender systems. In this research, we propose a novel method for news recommendation based on a combination of semantic similarity with content similarity between news and implement it as a feature of semantic-based news aggregators BKSport. Experimental results have shown that, a combination of both kind of similarity measures will result in better recommendation than when using either measure separately.


Introduction
The development of the Internet has brought a sharp increase in the number of news websites and the Web becomes a popular platform for broadcasting news. News aggregators are websites that collect news from various sources and provide an aggregated view of the events taking place in all over the world. Unfortunately, a critical issue of news aggregation systems is that large number of daily published news obstructs readers when they want to find the ones relevant to their particular interests. A possible solution to this problem is the use of recommender systems as they can traverse the space of choices and predict the potential usefulness of news for each reader.
There have been many researches on news recommendation methods which are based on a certain similarity measure, probably similarity between news with each other, known as Global Recommendation System (GRS), or similarity between personal interests of readers and news, known as Personal Recommendation System (PRS) [2,5]. In GRS, news recommended are news with the highest similarity with news that readers are reading. On the other hand in PRS, news recommended for readers are news with the highest similarity with personal interests of readers, which is modeled based on the history of posts that readers have read. Collaborative filtering (CF) is a widely applied technology in PRS development. With explosion of news on the Web, designing novel approach for effective new recommendation to suggest news closer and more relevant to readers is still a matter of concern. In this research, we focus on proposing a news recommendation method according to just global recommendation system model by enhancing results from existing works.
The most important task in developing GRS systems is to build a model to calculate similarity between news. Recent research works on news similarity measuring center on two prominent approaches: content-based similarity and semantic-based similarity. In content-based approach, similarity of news is calculated based on vocabulary statistics appeared in content of news and almost all recommended news only focus on a subject that target news is about. In contrast, in semantic-based approach [1], similarity of news is usually based on a knowledge base available to exploit semantic relationship between elements appeared in these news. Therefore, recommended news will likely expand the subjects than that of content-based approach. Both approaches have some weaknesses limit, which limit their effectiveness in news recommendation. Our approach is a hybrid one in the sense that it combines content-based recommendation and semantic-based recommendation. In concrete, similarity of news is a linear combination of content-based similarity and semantic-based similarity. The experimental results indicate that this combination brings news results suggest more effective than using either measure separately.
This work is in a part of development research of News Aggregation System BKSport [11] that is based on Semantic Web technology, aiming to effectively handle the amount of sports news gathered from various sources on the internet. Therefore, it inherits results obtained in our previous research such as ontology and knowledge base in the sport domain, methods for named entity recognition and semantic relationships extraction between entities in the news.
The rest of the paper is organized as follows. Section 2 describes previous works related to measuring semantic similarity between news. Section 3 presents more in details of our proposed method. In Section 4, we present the experiments and the evaluation we performed using the implementation of the proposed recommender. Subsequently, advantages and disadvantages of this method, as well as corrective measures and future research lines are concluded in Section 5.

Related work
Traditionally, many content-based recommenders [7,9] use term extraction methods like TF-IDF (Term Frequency-Inverse Document Frequency [10]) in conjunction with the cosine similarity measure in order to compare the similarity between two documents. TF-IDF is used to measure the importance of a word in a document based on its frequency of occurrence in the entire document dataset (or corpus). After calculating TF-IDF value for each word in document, this metric is combined with Cosine measure or Jacard measure to calculate similarity between two documents. TF-IDF value of the word appeared in document is calculated by the following formula: is number of occurrences of the word in document and | | is total number of document in the dataset.
Then, document is represented as a vector obtaining dimensional vector (With is the size of dictionary), value of each element of vector is TF-IDF value of the word. If the word in the dictionary does not belong to news, value of corresponding element in the vector is 0.
In semantic-based approach, previous studies have explored relationship between components between news with each other to calculate semantic similarity. In the study carried out by Batet et al. [4], a measure based on the exploitation of the taxonomical structure of a biomedical ontology is proposed for determining the semantic similarity between word pairs. Method proposed by Michel Capelle et al. [6] exploited element of similarity between components (words or named entities) in news thereby calculating similarity between two news. To measure the similarity between two components, their proposed method relies on: -WordNet Dictionary tree when components are words -denoted by -PMI measure when components are named entitiesdenoted by . This measure relates to the statistical frequency of occurrence of components and co-occurrence between them Final formula combines two and measures to calculate semantic similarity between two news as follows (α is correction parameter): = × + 1 − × Also exploiting the relationship between components in two news with each other, Frasincar et al. [8] presented a number of news recommendation methods in semanticbased approach. Similar to Capelle [6], their work aims to a personalized recommendation system. However user profile of the reader is also built based on the news that the reader has read and calculating similarity between user profile and a news is the same as calculating similarity between two news. Methods presented in this research used ontology and knowledge base to exploit semantic relationship between concepts, which are classes in the ontology. Experiment showed that Ranked Semantic Recommendation 2 is the most effective among them. However, it remains certain limitations that we will show in the following parts and propose method to overcome.

Similarity between news items
There are two main approaches in calculating similarity between text news items as content-based and semantic-based. Each approach has its own advantages and disadvantages. We aim to combine these two approaches by combining content-based similarity measure and semantic-based similarity measure with the expectation to overcome limitations of each approach, making recommendation more effective.

Semantic-based similarity
To calculate semantic similarity, we exploit mutual semantic relations between components in news item. These relations are determined based on ontology and knowledge base that we have built. We extract and analyze components in the news items including: entities, types of entities and semantic annotations. The next sections will present how to exploit these components in calculating semantic similarity between news items.

Semantic relation between entities
Specifically, in order to exploit relations between entities for calculating similarity between news items, we extend Ranked Semantic Recommendation 2 method as approved by Frasincar et al. [8]. In this method, the authors also used ontology and knowledge base to exploit the relations between entities. However, the method remains some limitations such as: -It only considers direct relations between entities without considering indirect relations. -It does not consider the importance of entities as they appear in various positions in the news item (title, description, etc.) To overcome these above limitations, in Section 3.1.1.1, we present a method to calculate the relation weight between entities based on ontology and knowledge base. In addition, we combine the statistical method of co-occurrence of entities in the same news items in determining relation weight between entities, which is presented in Section 3.1.1.2. Finally, we present the method in which uses relation weights between entities in determining semantic similarity between news items in Section 3.1.1.3.

Relation weight between entities based on ontology and knowledge base
Aleman-Meza et al. presented the methods to calculate the ranking of Semantic Association based on Semantic Path between the two entities in order to determine the relation weight between entities [3]. Specifically, they define Semantic Association and Semantic Path as follows: Definition: if two entities 1 and can be connected together by one or more sequences 1 , 1 , 2 , 2 , 3 , 3 , … , −1 , −1 , in an RDF graph; here, , 1 ≤ ≤ , is entities and , 1 ≤ ≤ is relations in ontology, then we say there exists semantic relation between 1 and . Sequence 1 , 1 , 2 , 2 , 3 , 3 , … , −1 , −1 , is a Semantic Path.
Then, there exists a semantic path between two entities Lionel Messi and Luis Suarez as follows:

there exists a semantic relation between Lionel Messi and Luis Suarez.
Based on the properties of semantic path, we identify a path rank value to show the relation weight between two entities at both ends of the path. Because there might be multiple semantic paths between two entities, we get the highest path rank value to represent relation weight. Aleman-Meza et al. [3] used four characteristics of a semantic path to calculate path rank, corresponding to four following weights: Applying in news recommendation in football, we found that Path Length Weight and Trust Weight are two meaningful and appropriate weights. For this reason, we only use these two weights to determine path-rank of a semantic path.

Path Length Weight
Length of a semantic path 1 , 1 , 2 , 2 , 3 , 3 , … , −1 , −1 , is the number of entities and relations in the path (exclude 1 and ). We can see that, when two entities remain indirect relation with each other through which the more there are entities and relations, the lower similarity between these two entities is. Consequently, path-rank of a semantic path must be inversely proportional to the length of that path. The Path Length Weight is defined in [3] as below: In which: is the length of semantic path. For example, we have two semantic paths: - <Luis-Suarez> 1 has length of 7, we obtain: ( 1 ) = 1 = 1 7 2 has length of 3, we obtain: From there, we can see that similarity between Lionel Messi and Luis Suarez is higher than that between Lionel Messi and Karim Benzema.

Path Relation Weight
There are many different relations defined in the ontology. Every relation represents a different meaning therefore also represents a different relation weight between entities. Some relations show close association, some other relations express loose association. For example, we have two triplets in the knowledge base as below: -<Luis-Enrique> <managerOf> <Barcelona-FC>.
Here, there exist two relations which are relation <managerOf> and relation <playFor>. We can see that, relation <managerOf> shows more closer than relation <playFor>, because each team has only one single manager at a certain time; however, may have a lot of players. Therefore, we assign weight of <managerOf> higher than <playFor>. And for this reason, from above triplets, we conclude <Barcelona-FC> has higher similarity with <Luis-Enrique> than <Luis Suarez>. Weight of relations is in the range (0, 1]. Path Relation Weight of an overall path P is defined in [3] as below:

Relation weight between two entities is based on ontology and knowledge base
Combining two weights and by a pair of coefficients and , we define the path rank of a semantic path as below: in the above formula is also similarity value between two entities based on ontology and knowledge base.

Relation weight between entities based on statistics of co-occurrence in the same news items
According to the idea of the Michel Capelle et al. on PMI measure [6], if two entities co-occur in the same news items many times; these two entities have high similarity to each other. We count co-occurrence of named entity pairs in a dataset on football news to calculate weights PMI. The formula is defined as below: ( 1 ) × ( 2 ) In which: -is the number of news items available in the dataset. - ( 1 , 2 ) is the number of news items in the dataset that two entities and co-occur. - ( 1 ) is the number of news items in the dataset containing entity 1 , and ( ) l is the number of news items in the dataset containing entity 2 . As such, for each any entity pair, we have two values to calculate relation weights: Weight (calculated based on semantic path) and weight (calculated based on statistics of co-occurrence of entity pairs). Before combining these two weights with each other, we normalize them as below: and corresponding are maximum value and minimum value in the value chain .
Finally, we combine these two values together by a pair of coefficients and to calculate similarity of each entity pair as below: 1 , 2 = × + × + By convention, when 1 ≡ 2 then 1 , 2 = 1.

Method for calculating similarity between news items based on relation between entities
First of all, we define set of entities related to entity is a set containing entities that have similarity where is greater than 0 and denoted as below: Suppose there is a news item A, set of recognizable named entities in news item A is denoted as below: With each entity in set A, we build a set of entities related to corresponding to = { 1 , 2 , 3 , … , }. Grouping all sets ( ) together ( : 1 → ), we obtain set of all entities not included in A, but related to A: Finally, we group two sets A and R to obtain set called as expansion set of news item A: = ∪ In the next step, we calculate ranking value for each entity in the set . Each rating value will characterize the relevance of the entity corresponding to news item A. These ranking values should satisfy some properties: -(1) If the more times an entity appears in the news item, the greater that entity's ranking value is. -(2) If the greater of entities in the news item that an entity is relevant to, the greater that entity's ranking value is. -(3) Ranking value also depends on appearance position of the entity in the news item. Regarding property (3), we determine an entity that can appear in the different positions of the news item, as follows: title, description, bolder-text (bold text, image title, etc.) and content. We also identify importance weight for these positions respectively as below: To calculate the ranking value for each entity in the set , based on Ranked Semantic Recommendation 2 technique [8], we also represent entities in a matrix, in which the first row represents entities in the set and the first column represents entities in the set A. Matrix takes the following form: In above matrix, we calculate the value as below: In which ( ) is importance weight of the entity in the news. This weight is calculated as follows: Suppose is an entity appeared in the news item, and , , , are respectively numbers of occurrences of in the title, description, bolder-text and content of the news item. We define the importance weight of entity as below: Finally, as the formula defined in [8], the ranking weight of each entity in the set is calculated by: Assume is a vector containing above calculated ( ) values. We normalize values of each element in in the range [0, 1]. Normalization formula is expressed as follows: In which MAX and MIN are maximum value and minimum value respectively of elements in vector . If = ≠ 0 then = 1, with every value of . As a result, taking all the steps above will obtain a vector for each news. Final step is calculating similarity between any two news based on their vectors.
Suppose we have two news A, B and two corresponding vectors , . Because these two vectors can have different number of dimensions, we define the similarity between two vectors , (also similarity between two news A and B) as a variation of cosine similarity as below: In which , corresponding are values , ( ) in vectors , .

Types of entities appeared in the news items
A reader who is interested in a subject is more likely to be also interested in other subjects of the same type. For example, if a reader is reading the news about football teams, then that reader tends to continue reading other news items about football teams rather than news items about players or stadiums. Therefore, if two news items have similarity in the types of entities, similarity of these two news items will be higher.

Fig. 1. An example of similarity between news based on types of entities in the news
In ontology, each named entity is defined in the knowledge base will belong to a certain object class defined. These classes can be regarded as the type of entity. For example, two entities Lionel Messi and Luis Suarez in the knowledge base have the same type, because they belong to class FootballPlayer; however, both are not the same type with entity Barcelona-FC because this entity belongs to FootballTeam. Statistics of entity types appeared in the news items is similar to statistics of entities. Two different entities can be of the same type. Appearance position of entities also affects association weight between entity type and corresponding news item. These weights will be calculated based on appearance frequency and appearance position of entities of that type. Suppose, we calculate association weight for entity type for a news item . Given that is entities of class appeared in news item , we define the association weight of entity type with news item as below: We build a vector for news item with elements as weights similar to building vector based on entity in section 3.1.1.3. Elements in each vector will be normalized before using variations of the formula for calculating similarity between vectors used in section 3.1.1.3. This value is denoted by − .

Semantic annotations of the news items
Semantic annotations here are triplets in the form of <subject> <predicate> <object>. In which subject and object are two entities. These semantic annotations also play an important role because they represent somewhat content that news item is talking about.

Fig. 2. An example of similarity between news items based on semantic annotations of news
A news item may contain many triplets and a triplet may appear several times. Triplets appeared several times in the news item will be important triplets, showing main contents that news item mentions. Moreover, appearance position of these triplets in the news item also expresses their importance. The importance of positions in the news item (title, description, bolder-text, content) is similar to that presented in the previous section. The more common triplets of two news items, the higher their similarity is.
With each triplet, we denote , , , are numbers respectively of occurrences of this triplet in title, description, bolder-text, content. We use the same formula as the one for calculating importance weight of the entities in Section 3.1.1.3. to compute importance weight of each triplet in the news item. Then we represent these weights as elements of a vector then use vector normalization formula to put these weights in the range [0, 1]. To calculate similarity between news items based on semantic annotations, we use a variation of Cosine formula as described in Section 3.1.1.3. to compute the distance between two vectors. This value is denoted by − . Thus, we use three parameters to determine semantic similarity between news items, based on the following factors: -Relations between named entities, -Types of entity in the news items, -Semantic annotations of the news items.
Each of these three parameters has different meanings in determining semantic similarity between news items. We combine these three parameters together to determine the final value showing semantic similarity between news items. To combine these three parameters, we use a set of three parameters including , , to express the level of importance of each of the above parameters. We define the final formula for calculating semantic similarity between two news items is as below:

Content-based similarity
With news recommendation method in which only uses semantic similarity as proposed above, we may encounter some problems as: -Insufficient or incorrect identification of named entities that appear in the news item. -Insufficient semantic annotations of the news item. Occurrence of above limitations is caused by limited information in the ontology and knowledge base. This is unavoidable since the construction of ontology and knowledge base must be done manually or semi-automatically, so a lot of efforts need to be made. Furthermore, the evolution of real world knowledge, for example when new players come or players change their clubs, makes it difficult to timely update. To overpass these limitations, we combine the proposed semantic similarity and content similarity of two news items.
In this section we describe the content-based similarity which is computed using TF-IDF weight of words in the news item combined with cosine measure. Words with high TF-IDF weight are often important words, showing main contents of the news item. So, we are only interested in words with high TF-IDF weight. Steps to build a set of important words of the news item include: -Step 1: Eliminate stop words. Stop words are words that do not make sense in the representation of contents of the news, such as: "a", "an", "the", etc. -Step 2: Standardize words into infinitive form. Verbs or nouns often exist in many different forms depending on the context, although they still express the same meanings. For example, "make", "makes" and "made". So, we will change them into infinitive form. -Step 3: Calculate TF-IDF for each word in the news (After being standardized in Step 2). -Step 4: Sort and select top words with the highest TF-IDF based on defined threshold. After above steps, we obtain a set of words with the highest TF-IDF. We represent news item in the form of a vector containing values as TF-IDF value of words in the above set. Similarity measure between two news A and B with two important word sets , and two corresponding vectors , will be calculated based on variation of Cosine formula as below: In which: -, are corresponding words in two sets , .

News recommendation algorithm with combined similarity
To combine semantic similarity with content similarity − of two news items, we use pair of weights and . We define the combination formula as below: Assume that n t is the average number of tokens in a news item and n is the number of news items in dataset C. We see that, in step 1, the complexity of named entity recognition and semantic annotation of a news item is O(n c n t ), where n c is the total number of classes, entities and properties in ontology and knowledge base. Therefore, for n news items in the set C and a news item A, the time complexity of step 1 is O(nn c n t ).
Step 2 transfers n+1 news items into vector TF-IDF. As we had computed the IDF for all tokens in the dictionary before running the algorithm, the time complexity of transferring a news item into a vector TF-IDF equal to the time complexity of calculate TF values for all tokens in that news item, O(n t ). Consequently the complexity of step 2 is O(nn t ). On the other hand, step 3 is repeated n times for each element in C. The steps from 3.1 to 3.4 are the multiplication of the pair of vectors TF-IDF, therefore, the time complexity of each iteration is O(n t ) and the time complexity of step 3 is O(nn t ). The time complexity of the sort algorithm in step 4 is O(nlogn). As a result, the time complexity of the proposed algorithm is O(nn c n t + nlogn).

Experiment scenario
The goal of this chapter is to evaluate and compare the effectiveness of three news recommendation methods: -Only use semantic similarity between news items.
-Only use content similarity between news items. -Combine both above similarities. The evaluation of the different methods is performed by measuring precision. Because we did not build an online system yet, so we use offline evaluation method for evaluation. For offline evaluation, we choose N=100 news items (symbolized as set ) from a number of famous sports websites such as http://www.skysports.com/ , http://www.espnfcasia.com/ , http://sports.yahoo.com/ and then we ask collaborators to rate that a news item as relevant or non-relevant with another one. After that, we have an experiment dataset in which each news item will have (0 ≤ ≤ − 1) related news items and ( − 1 − ) unrelated news items. We separately run methods above for each news item in set and also generate news items with the highest similarity with it, then compared with news items that collaborators have identified in experiment dataset. For example, consider the news item 1 , collaborators discover 5 news items in the remaining 99 news items related to 1 , then algorithm automatically run also generated 5 corresponding news items, then compared them with 5 news items that collaborators have identified. Symbol: is the number of news items that the algorithm precisely recommends for news item .
is the number of news items that the algorithm imprecisely recommends for news item .
is the number of related news items that the algorithm not recommend for news item . We define precision for a news item , using the following formula: Follow the way that we implement, we obtain = , then = ( ). There for we only concern about to evaluate these above methods. Finally, we define the final precision of the method as the average of precisions for the entire news items in the experiment dataset.

Experiment parameters
Certain parameters are employed to determine the importance of the components when these components are combined together. In this experiment, we set the value of parameters totally based on our point of view. For instance: -Weights of relations in the ontology to calculate was assigned based on our perception on the relevance of each relation: = 0.8, = 0.6, = 0.5, … -and are two parameters used when combining semantic similarity measure and content similarity measure between news items. As we consider the importance of content similarity is higher than the one of semantic similarity in news recommendation, we choose = 1, = 2.

Experiment results and evaluation
After running three separate methods for set containing 100 news items as experiment scenario as presented in section 4.1, we obtain precision result of each method shown in Table 1.  Table 1 indicated that, for the experiment data containing 100 news items, the semantic-based recommendation method is not as precise as the content-based recommendation method. Meanwhile, if combining the content-based similarity method and semantic-based similarity method, it will bring the best results. This can be explained as follows:

Assessment of experiment results
-When using only the semantic-based similarity (semantic-based approach), it is mainly dependent on the entities in the news items. Therefore, in some case, the algorithm recommends correct news items about the relevant entities but the completely different topic. For some collaborators, they will seem as irrelevant. -Following the content-based approach, the recommended news item's topic is usually quite close to the target news item. However, this method does not have the ability to expand the topic. If we have two news items about Barcelona club in which the first news item is about the play of the Club and the second one is about the transfer of the Club's players, the content-based approach will determines that the similarity of these news items is low. -When combining the content-based similarity and semantic-based similarity, the recommended news will overcome the limitations of each separated measure, leading to more efficient recommendation.

Conclusions and future work
In this research, we presented a recommendation method based on the combination of the content-based similarity and semantic-based similarity of the news items. The semantic-based measure is calculated based on the semantic relation among objects. It enables the recommendation not only stopping at the suggestion of the similar topic news items or news items rounding a key object of the target news item, but also being able to recommend the news items of other objects that these objects have a semantic relation with other ones in the target news item. However, the similarity measure is mainly focused on the entities and not considered the context mentioned in the news item. The content-based measure will overcome the weakness of semantic-based measure by extracting from the news item the words having the highest TF-IDF value and these words are characterized the main context mentioned in the news item. We evaluated and compared the precision of the proposed method and the recommendation method when using only either measure separately. The experimental results showed that the combination of the two similarities helps to promote the effectiveness of both and overcome the weaknesses of each other method, ultimately increasing the better recommendation. However the proposed method remains some limitations such as its dependency on the adequacy of the knowledge base and ontology. Determining the weights in such a way so that the combination of the measures achieves the highest efficiency is also a difficult problem to be solved of the method.