FamilyID: A Hybrid Approach to Identify Family Information from Microblogs

. With the growing popularity of social networks, extremely large amount of users routinely post messages about their daily life to online social networking services. In particular, we have observed that family related information, including some very sensitive information, are freely available and easily extracted from Twitter. In this paper, we present a hybrid information retrieval mechanism, namely FamilyID, to identify and extract family related information of a user from his/her microblogs (tweets). The proposed model takes into account part-of-speech tagging, pattern matching, lexical similarity, and semantic similarity of the tweets. Experiment results show that FamilyID provides both high precision and recall. We expect the project to serve as a warning to users that they may have accidentally revealed too much personal/family information to the public. It could also help microblog users to evaluate the amount of information that they have already revealed.


Introduction
With the growing popularity of online social networks, the data that is publicly available has increased by numerous folds. This data includes personal, employment, education, relationship, and family-related information. Figure 1 shows a microblog example -a tweet message that was broadcasted to the public, and effectively reveals his mother's Twitter ID, birthdate and last name.
Numerous commercial products or research projects have been developed to discover user information from online social networking data. Such information is used to improve the accuracy of advertisement delivery, to make sensible suggestions to users, and to predict events or trends. Moreover, the media industry (radio, movie, television) now highly depends on feedback from public OSN data for market study, user preference analysis, hot topic identification, etc. Although such products/projects may benefit both OSN providers and end users, they pose significant privacy threats to all users, while many of them are unaware of such threats. An online stalker with limited hacking capability but ample time can effectively figure out lots of details about a targeted user with this publicly available data. For instance, message with birthday or anniversary wishes exposes users' age, date of birth and family information.
Extracting family-related information from Twitter is challenging: (1) it is cumbersome to manually identify such posts, as we have discovered that less than 1 % of the tweets are family-related; and (2) although it is possible to develop an automated mechanism to identify family-related tweets, the task is nontrivial, due to the size of data, the use of short text and informal language, and large amount of synonyms. In this paper, we present FamilyID, a multi-phase approach that automatically identifies family-related information from publicly available Twitter data. Our algorithm considers multiple features of tweets, including part-of-speech tagging, term distribution similarity, and semantic similarity. Experimental results show that FamilyID produces good accuracy.
The key contributions of this paper are: (1) We make the first attempt to automatically identify family-related microblogs -they usually disclose sensitive personal information, and they are the primary targets for both adversaries and defenders. (2) The proposed mechanisms exploit multiple lexical and semantic features, with a good balance of efficiency and precision. Our approach could handle large amount of data and provide relatively high accuracy.

Related Work
Private Information Disclosure. People may publicize private information for social advantages [7]. Users' privacy settings violate their sharing intentions, and they are unable or unwilling to fix the errors [11,13] explores three types of private information disclosed in the textual content of tweets. Impersonation attacks are proposed in [2] to steal private (friends-only) attributes.
Information Aggregation Attacks. Information aggregation attacks were introduced in [8,10,17]: significant amount of privacy is recovered when small pieces of information submitted by users are associated. [1] confirms that a significant amount of user profiles from multiple SNSs could be linked by email addresses.
Inference Attacks. Hidden attributes are inferred from friends' attributes with a Bayesian network [4,5] developed a model to predict user's birth year (i.e., age). Unknown user attributes could be accurately inferred when as few as 20 % of the users are known [14]. Friendship links and group membership information can be used to identify users [16] or infer sensitive hidden attributes [18].
Microblog Mining. Knowledge discovery in social networks is a hot research area. For instance, methods have been proposed to identify user attributes, such as gender, age, location [12], location type [9], activities [6], personalities [15], etc. There are also proposals to make predictions based on information and activities in social networks, e.g., to predict stock rates based on user tweets [3].

Problem Definition and Solution Overview
The goal of this research is to identify family-related posts from microblogs. Due to the volume of the data, manually reading each tweet and classifying it is an almost impossible task. Formally, the objective of the research is: For each tweet, efficiently and accurately identify whether it is related to one or more family members of the message owner (the user who posted the message).

Fig. 2. Overview of the FamilyID approach
As illustrated in Fig. 2, we first use a customized crawler to collect user information and messages from Twitter. Each message is pre-processed to remove all the special characters and other unwanted contents, such as multimedia data (images, audio and video files). Each message from a user (denoted as the owner of the account/tweet) is processed through three steps: pattern matching, lexical (phrase) similarity measurement, and semantic similarity measurement. These steps are used to predict the likelihood of each tweet being family-related.

Data Collection
Using the twitter4j API, we have collected 150 twitter users' information, including username, screen name, friends (follower and following) list, tweets and tweets time-stamp. Twitter does not have the concept of friends. Hence, we considered the intersection of the followers list and the following list as the friends list. We have randomly selected users with the following criteria: (1) Users with more than 1500 followers are omitted as they have higher chances of being celebrities. Tweets of celebrities are not used in this research, since they demonstrate significantly different styles and contents from tweets of regular users. (2) Users with fewer than 2000 tweets are not crawled. (3) Users with majority of tweets in foreign languages (anything other than English) are discarded.

Pre-Processing
Messages from Twitter are extremely noisy. We develop several heuristics to preprocess raw tweets: (1) Term Expansion. Twitter users like to use abbreviations and very informal terms that do not exist in the dictionary. Certain steps in  Hence, we construct a table for Twitter  term expansion for family-related terms (some examples are shown in Table 1).
(2) URL Truncation. Tweets sometimes have URLs embedded in them. Since these URLs are not utilized in pattern matching, lexical similarity or semantic similarity assessments, we truncate all URLs. (3). Stop Words. FamilyID does not remove stop words, since words like "my", "our" are important in predicting family relationships. (4) Special Characters. All special characters other than the English words and numbers are truncated. Although we do not process numbers, we keep them for future use, e.g., to identify patterns related to year.

Pattern Extraction and Matching
In Sects. 3.4-3.6, we present a series of operations to identify family-related tweets. The design philosophy is to first employ computationally inexpensive methods to eliminate the majority of irrelevant tweets, and then refine the results with methods that are more effective but expensive.
Iterative Pattern Discovery. The first step in family-related tweet identification is to discover natural language patterns that are highly likely to mention family member(s). We first employ the Stanford NLP tagger for part-of-speech tagging on all crawled tweets. Next, we extract N-Gram histograms (N = 2, 3, 4) across the dataset to collect the common patterns containing family terms. Pattern discovery is performed in an iterative manner: for each discovered pattern, we attempt to relax it, and validate the relaxed pattern on the dataset. PRP$ JJ NN Pattern Matching. Every POS-tagged tweet is matched against the seed patterns. With a matched pattern, the tweet has the potential to contain familyrelated information. Note that pattern matching is the first filter in the whole process, it leads to lot of noise outputs since many phrases could match one of our seed patterns. For instance, phrases such as "my dear dog", "my sweet neighbor" are matched to the PRP$ JJ NN pattern, although they have nothing to do with family members.

Lexical Similarity Assessment
This phase finds if a pattern-matched tweet contains family-related words. We first create a seed tweet set covering all possible relationships and frequent nonrelationship components from the patterns. We then employ the UMBC ebiquity text similarity system to calculate the lexical similarities for pairs of tweets. Stanford WebBase Corpus is used to find possible synonyms of the given words. Table 2 shows some examples of similarities computed in FamilyID. Lexical similarity assessment effectively eliminates most of the noise from pattern matching. In particular, messages such as "my dog", "my neighbors", are effectively eliminated. However, tweets such as "my dear dog is my best companion" pass the pattern matching phase ("my dear dog" matches PRP$ JJ NN), and the lexical similarity assessment phase, due to the existence of terms "dear", "best", "companion". Since such tweets are clearly not family-related, we need another layer of semantic analysis to handle them.

Semantic Similarity Assessment
Semantic similarity assessment, which is relatively slower, is the last step to remove irrelevant tweets that have passed through the first two filters.
To generate a seed set for this model, we first take a seed such as "my little sister", and ran the sliding window algorithm on it. This is a recurring model that matches patterns in windows' length of up to 5. It replaces each word in the seed, and finds substitutions for the word, as shown below: To calculate semantic similarity, we employ the UMBC GetStsSim API. This API takes 2 text snippets and returns a value between 0 and 1 as a similarity measure. Every candidate tweet is compared with the seed tweets, to measure the pairwise semantic similarity. As shown in Table 3, similarity score of 0.75 or above indicates an almost perfect match, while similarity score of 0.6 or above indicates relatively similar texts. Tweets with the highest similarity scores higher than the threshold are finally labeled as family-related. As shown in the previous example, tweet "my dear dog is my best companion" passes first two phases. When we evaluate its semantic similarity with the seed tweets in this phase, the highest similarity score is 0.33, which indicates that it is not similar with any of the seeds. In this way, this message is labeled as non-family-related.

Experimental Results
Tweet Identification. First, we have performed tweet identification on the collected dataset (150 Twitter users, more than 450,000 tweets). On average, Fam-ilyID has identified approximately 30 tweets from each user as family-related, as shown in Fig. 3 (users are sorted by total number of tweets crawled). Less than 1 % of the tweets are identified to be related to family members. These include a small amount of false positives (to be discussed later). With the numbers and by looking into the identified tweets, we have found that the results reflect our previous observations: (1) for most of the Twitter users, family-related tweets are very sparse. It is extremely timeconsuming, if not impossible, to manually identify such tweets. (2) The identified family-related tweets almost always bring additional information about the family members, including the relationship, Twitter username, date of birth, age, interests, etc.
Comparing with Keyword-based Retrieval. To evaluate the effectiveness of FamilyID in reducing false-positives, we compare it with a keyword-based approach -identifying family-related tweets with keyword spotting. That is, when a pre-selected relationship keyword (e.g., "sister", "mother", the same as we used in Sect. 3.4) is found in the tweet, it is labeled as "family-related". In order to manually examine the results, we perform keyword-based retrieval on 75 randomly selected users. We have evaluated 225,886 tweets. Keywordbased retrieval has found 6,121 tweets to be family-related, while FamilyID has identified 2301 of them as family-related. Note that due to the selection of the keywords, each tweet identified by keyword spotting is a candidate tweet in Fami-lyID. Therefore, more than 62 % of the tweets containing family-related keywords are identified as irrelevant to family relationships through content-based analysis in FamilyID. We further manually look into such irrelevant tweets, and find that more than 90 % of them are true negatives (not relevant to family members). This also indicates that the precision of the keyword spotting approach is low, since it has included large amount of non-family tweets.  Precision. We invite human evaluators to examine the tweets identified as family-related from 50 random users, to determine whether each tweet is truly related to family members. As the most important performance metric of Fam-ilyID, the precision is defined as: P recision = T P P , where T P indicates the number of true positives (tweets labeled as family-related that are determined to be family-related by human evaluators), and P indicates the number of positives (tweets labeled as family-related by FamilyID).
The evaluators have examined 1346 tweets that are identified as familyrelated by FamilyID. They have found 1110 tweets to be true positives. Therefore, the precision of FamilyID is 83 %. Table 4 shows examples of true/false positives. The precision is high, especially consider the difficulty of the task. For some tweets, the human evaluator could hardly determine if they are familyrelated. For instance, for the message "When one of my boys tells me he's in love", the evaluator has referred to many other posts from the user, to find that she is a teacher and she is very likely talking about a student, instead of a child. However, the evaluator is less confident about the verdict.
Finally, we would like to point out that we have not evaluated the overall recall of FamilyID, for two reasons: (1) the size of the data set (450K tweets in total) makes it infeasible to manually examine all tweets; and (2) due to the heavy use of urban slang, abbreviations and short texts, it is even difficult for human evaluators to determine whether some of the tweets are family-related.

Conclusion
With the growing popularity of online social networks, large amounts of private information have been voluntarily posted to the Internet. From attackers' perspective, they could stalk a targeted user and attempt to extract such private information. However, manually identifying family-related tweets that are scattered in millions of microblog posts is very labor intensive. The FamilyID project demonstrates the capabilities of an automated mechanism to identify familyrelated microblogs and extract family member information from the microblogs. By utilizing lexical and semantic features in a multi-phase approach, we are able to achieve high accuracy. Moreover, most of the identified tweets carry additional (very sensitive) information about the family, such as birthdates, hobbies,