Combining Statistical Information and Semantic Similarity for Short Text Feature Extension

. A short text feature extension method combining statistical information and semantic similarity is proposed ， Firstly, After defining the contribution of word, mutual information, an associated word-pairs set is generated by comparing the value of mutual information with threshold , then it is taken as the query words set to search for HowNet. For each word-pairs, senses are found in knowledge base HowNet, and semantic similarity of query word-pairs are calculated. Common sememe satisfied condition is added into the original term vector as extended feature, otherwise, semantic relationship is computed and the corresponding sememe is expanded into feature set. The above process is repeated, an extended feature set is finally obtained. Experimental results show the effectiveness of our method.


Introduction
With the explosion of the network new media and online communication, short texts in diverse forms such as news titles, micro-blogs, instant messages, have become the main stream of information exchange. Most of the traditional classification methods are not good at short text classification and failed to accomplish the task effectively. Therefore, how to improve the efficiency of classifying the mass of short text has become the researching focus.
Recently, new classifying methods on short text appeared. Kim [1] proposed a novel language independent semantic (LIS) kernel, which is able to effectively compute the similarity between short text documents. Wang [2] presented a new method to tackle data sparseness problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. Methods mentioned above are mainly pays more attention to the concept and the correlation of texts to obtain the logic structure. Therefore, their classifying performance has been improved a little. Yuan [3] presented a short text feature extension method based on frequent term sets, larger search space of algorithm result in higher time complexity, particularly, when the scale of the background knowledge increased, the dimension of feature word set would increase dramatically.
A short text feature extension method combining statistical information and semantic similarity was proposed to overcome the drawbacks of the above. The flowchart is shown in figure1.  In this section, we briefly introduce some related knowledge from two aspects: contribution of words and mutual information.

Contribution of Words
We define the contribution [4] of words as: where f(w, d) represents the number of the word w in document d, fmax (d) is the maximum number of word occurred in document d. Thus, the contribution of the word w to the class Ck can be defined as the sum of the contribution of the word w to all documents in Ck, which is computed as follow: When k takes different value, the CONTR(w, Ck) denotes the contribution of the same characteristic towards to different category.

Mutual Information
Let T={w1,w2}denote a word-pairs, we can compute mutual information [5] between the word-pairs T and the class C according to the following formula: Where H(C) is the entropy of whole classification system C, H(C|T) is the conditional entropy of C given a word-pairs T.

Feature extension algorithm and weight computing Semantic Similarity in HowNet
HowNet [6] is a common sense knowledge database that reveals the relationship between concepts as well as concepts and attributes. Suppose there are two words w1 and w2, m and n is the number of senses of w1 and w2 respectively. We describe this using the following formula: S1 = { s11, s12,…,s1n}, S2 = { s21,s22,…,s2m}. Word similarity [7] of w1 and w2 is the maximum senses similarity of s1i and s2j: It has been concluded that if ss(w1,w2)>β, CS symbolized the intersection of S1 and S2, CS is not empty. The model is shown in figure 2: w1 w1 w2

Fig. 2. Sense relationship of word-pairs
White circle denotes the senses of w1, black triangle represents the senses of w2, while triangle is common senses of w1 and w2.

Feature Extension Algorithm
The goal of expanding the short text feature set is to describe the topic and content of texts as accurate as possible. A new method identified as FEASS (feature extension algorithm based on semantic similarity) has been proposed in this paper aimed at the above principle.

Experiment Results and Analysis
We conduct three experiments on SVM classifier to evaluate our method, experimental setup and results are described in detail in the following subsections.

Dataset
The experimental data in this paper comes from the China Knowledge Resource Integrated Database (CNKI), we collect 35603 piece of article published from during the period from 2013 to 2015. At last, we keep two thirds pieces of article title for each class as training samples and leave the remaining one third pieces of article title in total as test samples.

Experiment and Analysis
The Influence of Different Parameters.
We carried experiments on SVM when the parameters take different values for the parameter α, β and δ, and choose several representative results to be shown in Table 1.
The results on both classifier are the best while α=0.10, β=0.05, δ=0.25. Classification performance is very poor when the values of α and β is small. There are two reasons for this phenomenon, one is that redundant features have not been screened out, the other is that extending some boring words into features set. Conversely, classification efficiency appear to decline when the values of two parameters is great. The Efficiency of Classification before and after Feature Expanding. Fig.3 shows results of our method before and after feature extension on SVM classifier. We can find that our method achieves 4.24%, 2.90%, 3.39%, 7.78%, 5.85%, 7.17%, 8.66%, 3.23%, 6.06% and 4.58% improvements with F-measure for Finance, Geology, Oceanography, Math, Astronomy, Agriculture, Biology, Physics, Medicalscience, and Computer respectively. Good results of Precision are achieved as well. In this part, we compare the performance of FEASS with FEMFTS [8] (Feature Extension Method using Frequent Term Sets) and SCTCEFE [9] , (Short Text Classification Considering Effective Feature Expansion), they are all state-of-the-art short text feature extension approach.  It can be seen from fig.4 to fig.5 that Precision and F-measure of FEASS algorithms on SVM classifier. The best F-measure reached 84.52, Precision achieved 83.67 respectively, so our algorithm is slightly higher than FEMFTS and SCTCEFE.

Conclusion
In this paper, we propose a feature set extension algorithm for short text classification. We find our method can achieve a good performance, so the feature compensatory method is feasible with the aid of external knowledge base. In the future, we plan to resolve how to find the expansion of the 'key' information in the corpus, and add as little noise as possible in features set to achieve the goal of effective 'extension'.