An Approach to Modelling User Interests Using TF-IDF and Fuzzy Sets Qualitative Comparative Analysis

. Modelling and understanding user interests are particularly important tasks for designing services and building systems for customized solutions in web personalization and recommender systems. User generated content (UGC) constitutes a significant source of information for capturing user interests. This paper, suggests an approach to user profiling that analyses the Term Frequency (TF) and the Inverse Document Frequency (IDF) of selected tourism services by utilising the Fuzzy set Qualitative Comparative Analysis (FsQCA). It analyses a sample of customer reviews that are collected from tourism web sites. This paper considers the amount of money that customers spent during their hotel stay, as the outcome set in the FsQCA analysis. The results produce causal combinations of services that are necessary and sufficient for building customer interests models that best lead to the outcome and argue for the applicability of the FsQCA in modelling user interests.


Introduction
Recommender systems RC utilise techniques spreading from statistics, to AI and machine learning in order to capture user interests, build user and products/services profiles and suggest the most appropriate products or services to them.RC draw on several methods for developing user references models, with user-generated-content (UGC) to represent a source with rich customer information [1,2].Since social media platforms allow users to exchange experience, feedbacks, opinions, complaints, etc., they provide significant information for capturing and understanding user interests [3].Web personalisation is another area where user profiling is necessary for developing customised web interfaces, supporting personalised search [4] that allow users to retrieve search results according to their personal needs.

User Profiling in Tourism
Building user interests models has also been the focus of e-tourism research studies.Drawing on behavioural, socio-economic and demographic data analysis several researchers shed light into understanding people's travel behaviour [3].Indeed, surveys on travellers' preferences have shown that the travel selection process is complex depending among others, on personality and mood related factors, service quality issues, the Word-Of-Mouth (WOM) and the eWOM.Customers often express their experience by publishing their reviews.Sentiment analysis of user reviews provides the means for capturing and modelling users' preferences, emotions and attitudes, thus refining market segregation by grouping customers with similar needs and incentives and predicting customers' travel behaviour more precisely [5].
Collantes and Mokhtarian [6] claim that a variety of personality factors such as: personality traits, travel-related behaviours, lifestyle characteristics, and travel trends, determine the subjective assessment of travelling and tourism services.Other researchers have noticed that travel behaviour is influenced by travel experiences and feelings [7,8].It is also argued that it is important to analyse human behaviour characteristics in order to understand how customers react to alternative transport policies [9].Other travel research studies have analysed environmental factors that influence travel and tourism.Stradling and Anable [10], argue that environmental characteristics, such as workplace, shops and site topography affect travel choices.Several approaches have been proposed for building user interests models.Kim and Chan [11], have proposed a hierarchical model for representing user interests.The user profile is constructing by analysing documents that users have visited on the web.The documents' analysis yields a list of user interests, which subsequently are grouped upon their similarity on the hierarchical interests' model.It is argued that there exist four classes of information contexts that need to be specified when attempting to understand user interests [12].The general information class that refers to personal characteristics such as name, contact details, demographics of the user.
The event class represents user's activities.The preference class refers to user's interests.The social network class explains user's connections and interactions with other users.The preference class is usually discovered by analysing various sources such as relevant documents that the user has published [12,13].Several representational approaches have been proposed for representing user interests.Most frequently though there are three different formats namely: keywords, semantic networks and concept-based representations [14,15].Keywords representing domains of interests are associated with weights indicating the strength of user interests for a particular topic.Polysemy and Synonymy are problems associated with keywords.Semantic networks, address these problems, by representing keywords with nodes that are connected with each other, including co-occurrences.Concept-based representations resemble semantic networks in structure but they differ in having nodes to represent abstract topics rather than keywords [14,15].User profiles can be used in various ways such as: during personalised information retrieval, that is when a system detects relevant documents and information according to users' interests, during re-evaluating the relevance of documents taking into consideration what documents a user has retrieved and during query processing, when a user query can be modified based on user interests [16].
It is argued that filtering and clustering techniques are very useful in reducing the number of concepts that are found on the web in order to be used in formulating user profiles.However, [16], argues that these techniques lack effectiveness for they produce the same structure of interests for users with different needs.Research show that while many systems produce and use user profiles, e.g. in web personalisation, recommender systems there exists no definite procedure for deriving user interests [16][17][18][19].This paper addresses the need for investigating alternative ways of developing user interests' models and suggests the analysis of the TF-IDF with the FsQCA.

Methodology
The aim of the paper is to identify the causal combinations that are necessary and sufficient to represent customer interests.This paper utilises the FsQCA in order to analyse the TF and IDF of UGC and produce causal combinations that best lead to an outcome.The FsQCA is particularly important for investigating intertwined relationships between multiple factors that affect a dependent variable or contribute to the realisation of certain outcome [20].The FsQCA analyses the sets of relationships among causes.In FsQCA variables are modelled as sets.The FsQCA models allow a detailed analysis of how alternative conditions of causes combine and contribute to high membership scores of the outcome [21].FsQCA may detect multiple paths, i.e. alternative causal combinations that can lead to high levels of the same outcome [20,22].Data in this paper is collected from customer reviews published on hotel web sites.Causal combinations may be represented by tourism services terms such as room, view, cleanliness, etc., in the set of selected documents.The outcome set in this paper, is represented by the large amount of money spent by the customer.Other outcome sets can also be considered.Thus, this paper aims to identify the combinations of customer hotel services interests that best reflect customer's spending.A sample of the data collected is analysed in this paper.The steps of the methodology are shown below: 1. Select documents published by user ) ( i u . 2. Identify the terms that will constitute the causal combinations and specify the term that will represent the outcome set.3. Calculate the (TF) and the (IDF) for each identified term.The fuzzy union, is defined as The fuzzy intersection is defined as and the fuzzy complement is calculated as Calculate the consistency and the coverage of the solutions using formulas (2) and ( 3) respectively.


where ) (X is the membership degree of each causal combination and ) (Y is the membership degree of the outcome set.7. Identify best combinations, by selecting the combinations that exhibit a consistently rate above a threshold (in this paper is at 0.8) and the highest possible coverage.Simplify solutions into the final set of causal combinations.
The final causal combinations indicate the hotel services that customers who spend large amount of money consider as the most important.

Data Analysis: Illustrative Example
This paper analyses reviews collected from five (5) hotel customers.Then, for simplicity reasons, five (5) terms representing hotel services are selected from the total set of terms identified in the reviews.The outcome set large amount of money spent (LMSp) by each user during his/her hotel stay is represented as triangular fuzzy numbers (TFN).The membership function can be calculated according to the following equation [25]: where a, m, b are real numbers.The linguistic scales which are used and their corresponding TFNs adopted in this study are shown in table 1 ) ( i u .Then, the weights for each term result from using formula (1).The results are shown in Table 2.
The cells in the truth table take the value (1) or (0) representing true or false.Thus, permutation number 3 is read (Quietness=false, Sea View=false, Staff Friendliness=false, Cultural Activities=true, Restaurant=false).Next the membership degrees for all combination for each user are calculated drawing on the fuzzy sets operations theory.Table 4 shows the membership degrees for the first 17 combinations.
Therefore the consistency for combination number 3=0.733.
Regarding the coverage, by applying formula (6),   ) , min( Y X 1.5 and   Y 2.9 thus coverage=0.37. According to FsQCA the best causal combinations should exhibit as high as possible consistency and coverage.However, the higher the consistency is the lower the coverage.Assuming a threshold value of 0.8 for the consistency firstly and then the higher possible coverage, the analysis results into two causal combinations; the combinations number 12 and 16 extracted from Table 3, are shown in Table 6.A closer look at the combinations reveals that "quietness" is not within the customers interests at all.It is not a necessary service.Thus, restructuring the causal combination the analysis results that customers who spend a large amount of money, show interest in ➢ (Sea View) AND (Staff friendliness) AND (Cultural activities) AND (Restaurant) OR ➢ (Sea View) AND (Cultural activities) AND (Restaurant).
In order to simplify the causal combinations, the "staff friendliness" could be omitted for it does not appear on both combinations.

Conclusions-Future Research
This study suggests that the FsQCA can be used for modelling users' interests.Data selected from customer reviews is analysed by utilising the TF and the IDF.The application of the FsQCA results into useful insights that can be used to understand customer priorities and build customer profiles.Future research can focus on examining the applicability of the FsQCA to handle multiple outcome sets and to specify terms' priorities.When applying the FsQCA method in large data sets with a long list of factors, the truth table and the set of possible causal combinations can become cumbersome to analyse.Thus, future research can focus on combining the FsQCA analysis with other techniques that will be used in pruning the size of the truth table and reduce the causal combinations to manageable size.
Apply the FsQCA and produce User Interests causal combinations.a.Produce the truth table of all possible permutations of the terms considered.Each permutation is a possible causal combination.b.Calculate membership degrees for each combination.Its calculation is performed drawing on the fuzzy sets operations theory.Assume two fuzzy setsA ~and B ~then:

Table 1 .
. Linguistic scales and corresponding TFNs for Large Amount of Money-Spent fuzzy sets

Table 2 .
The term weights and the membership degree for money spent for each Customer  .Table3shows part of the truth

table . Table 3 .
The truth table (part of) show all possible permutations of the terms

Table 4 .
Membership degrees for combinations for each customer

Table 5 .
Causal combinations' Consistency and Coverage

Table 6 .
The two necessary and sufficient causal combinations