Fine-Grained Privacy Setting Prediction Using a Privacy Attitude Questionnaire and Machine Learning

. This paper proposes to recommend privacy settings to users of social networks (SNs) depending on the topic of the post. Based on the answers to a speciﬁcally designed questionnaire, machine learning is utilized to inform a user privacy model. The model then provides, for each post, an individual recommendation to which groups of other SN users the post in question should be disclosed. We conducted a pre-study to ﬁnd out which friend groups typically exist and which topics are discussed. We explain the concept of the machine learning approach, and demonstrate in a validation study that the generated privacy recommendations are precise and perceived as highly plausible by SN users.


Introduction
The tradeoff between privacy and utility in a social network (SN) has been a research problem from the beginning, since SNs are largely used in public. Still, there is no acceptable solution that provides an optimal tradeoff between privacy and utility while keeping the user burden at a minimum. Social network providers tried to tackle this problem by introducing friend lists or circles. Users create one or more lists containing a subset of their online friends, and publish a new post exactly to the people inside these lists. Still, the SN users have the burden of manually setting the appropriate privacy setting for each of these groups in order to achieve a perfect privacy setting. Recent studies have shown that only 17% of all posted content is shared using friend lists [5].
We argue that every single post needs its own privacy setting, and should only be disclosed to a specific list of users, depending on the topic of the post. To decrease the user burden, the privacy settings should be derived automatically, for example by using a machine learning approach. Although most social networks like Facebook or Google+ only allow a binary decision on the privacy settings (e.g. to disclose or not), we think that a user decision on privacy is a decision that is not ultimately binary. A SN user does not only think "I do not at all want my drinking buddies to know that I am a ballet dancer as a hobby" or "I would really like my co-dancers to see the pictures of that ballet contest". There are also some groups of people, like university friends, where a user would say "It is OK if they see it. I do not want to completely cut them off from that information, but I also do not want to draw too much attention to it". In this case, the user would take some middle road, for example by sharing the post and the pictures with the university friends, but hiding them from their timelines.

Related work
Several publications in the past have offered questionnaires to capture privacy attitudes. Starting with Westin scales [3] as a very general form of questionnaire, newer questionnaires like the IUIPC [4] provide a very specific privacy attitude regarding privacy towards online companies. Wisniewski et al. [10] created a privacy scale to observe how social connectedness corresponds with a user's privacy desires on a social network, which we also included in our questionnaire.
There are also other systems that use machine learning for the prediction of privacy settings, for example by labeling some of the friends with privacy permissions and using a supervised learning approach [6,7,2]. Other approaches additionally take the post content into account, by using latent Dirichlet allocation (LDA) and maximum entropy to predict settings for a new post based on the privacy settings chosen in earlier posts [8]. Although the idea seems promising, research has shown that privacy behavior in online social networks does not correspond to actual privacy desires; this is known as the privacy paradox [1]. We therefore decided to capture the privacy attitude using a distinct privacy questionnaire rather than trying to extract it from the user's SN behavior. Furthermore, all approaches so far rely on a binary decision (disclose/undisclose) for a privacy setting, whereas our approach offers five distinct privacy levels.

Approach
In a final implementation of our approach, the post topic is extracted and shown on the left side in Figure 1, while the proposed privacy settings for a selection of friend groups are displayed on the right side. As stated in the introduction, the proposed privacy settings are not only disclose/undisclose, but five different privacy levels as follows: On level 1, everything is disclosed and shown on the wall. Level 2 means the content does not appear on the recipients' news wall, whereas level 3 completely hides comments and graphical content. Level 4 hides the entire post, and level 5 also hides it from the recipient's direct friends, so it cannot be propagated to him by word of mouth. What exactly is hidden, is denoted by the small pictograms next to each friend group.
For suggesting the permissions, we use a machine learning technique called ridge regression. As input features, we use the measures calculated from the answers to the aforementioned two questionnaires [4,10] and the topic of the post, or only the questionnaire answers (called "generic" in Table 1). As an output, we receive for every friend group a privacy level between 1 and 5.
We performed three user studies to first find out which topics are most frequently discussed in people's social activities (online and offline) and which friend groups exist; second, to gather training data for the machine learning algorithm and to validate its precision; and third, to validate the approach in a scenario as III Fig. 1. Envisioned user interface concept of a privacy setting prediction system. realistic as possible, introducing the proposed settings of our machine learning prediction to Facebook users. All studies were performed using online questionnaires; participants were recruited using prolific academic, an online recruiting portal similar to Amazon Mechanical Turk.
For the first, we asked 15 participants to list their friend groups and most frequently discussed topics in their social life in a free-text form. We merged the answers using an axial coding approach [9]. The most frequent topics were (in descending order) family affairs, events, movies, politics, food, work, hobbies, travel, music and sports. The friend groups that were mentioned most frequently were extended and immediate family, work friends, close friends, acquaintances and school/university as well as online and sports friends.
In the second study ("main study"), we let 100 participants first answer the two aforementioned questionnaires, followed by a matrix where they had to enter a privacy level for each topic/friend group pair. We trained and validated the regression with a ten-fold cross validation. The mean squared error (MSE) between the prediction and the actual result can be found in Table 1.
For the third study, called the "validation study", we again let 31 persons fill out the two privacy questionnaires in the first part. But this time, we let them copy and paste ten of their own Facebook posts that match our list of topics, and enter the topic of the post into the questionnaire. The website then proposed a privacy setting, using the ridge regression trained with the data of the former study. The participants were asked to adapt the settings if needed, and answer on a five-point Likert scale whether they would use the system on Facebook. Again we calculated the mean squared error between the adapted and the proposed settings. 67% of the participants stated that they would likely or very likely use our system, supporting the design of our approach. The results in Table 1 show that the trends are similar for both studies: For almost all topics, we can achieve a mean squared error less than one. Hobbies, travel and family are predicted best, whereas sports and politics are hardest to predict; maybe because of the diverse nature of sports, where the exact sport affects whether it is likely to be shared or not. Posts about football are more common and socially accepted than posts about ballet, for example. Politics and work are also hard to predict by privacy attitude; this could be caused by the fact that here, the political interest or the job itself affects whether you want to share your thoughts, rather than a pure privacy attitude. A professor is more likely to share his work with a community than a cleaner would be.

Lessons learned and future work
We did background research to find friend groups and topics that are prsent in people's online and offline social life, and conducted two studies to find out whether it is possible to propose fine-grained privacy settings based on privacy attitude and the topic of the post. We tested and evaluated in two different scenarios: In the main study, users had no proposed setting, and had to enter their desired setting without support. In contrast to this, they had a proposed setting they had to adapt in the validation study. In both cases, we achieved an acceptable precision for most of the topics. Nevertheless, there are some topics like work and politics that seem not to depend on the privacy attitude, but rather on the actual occupation or political interest of the person.
Instead of a binary decision, our approach supports five privacy levels of disclosure, which offer to show only parts of the post to some friend groups, such as only the textual content without images or comments. For this study, we used an example implementation of the privacy levels. In future work, we would like to conduct further studies to determine which parts of the post users would hide depending on the post's sensitivity, and how an optimal implementation of the levels looks. Finally, we would like to offer a prototype of the proposed interface as a Facebook plugin, to be able to check whether the achieved prediction precision is sufficient for everyday use, and whether the tool is accepted by users.