Lessons Learned from Honeypots - Statistical Analysis of Logins and Passwords

. Honeypots are unconventional tools to study methods, tools and goals of attackers. In addition to IP addresses, timestamps and count of attacks, these tools collect combinations of login and password. Therefore, analysis of data collected by honeypots can bring diﬀerent view of logins and passwords. In paper, advanced statistical methods and correlations with spatial-oriented data were applied to ﬁnd out more detailed information about the logins and passwords. Also we used the Chi-square test of independence to study diﬀerence between login and password. In addition, we study agreement of structure of password and login using kappa statistics.


Introduction
In current information society we deal with an increasing security threat. Therefore, an important part of information security is protection of information. Common security tools, methods and techniques used before are ineffective against new security threats. Therefore, it is necessary to choose other tools and techniques. It seems that the network forensics, especially honeypots and honeynets, are very useful tools. The use of the word "honeypot" is quite recent [1], however honeypots have been used for more than twenty years in computer systems. It can be defined as a computing resource, whose value is in being attacked [2]. Lance Spitzner defines honeypot as an information system resource whose value lies in unauthorized or illicit use of that resource [3].
The most common classification of honeypot is classification based on the level of interaction. The definition of level of interaction is the range of possibilities the attacker is given after attacking the system. Honeypots can be divided into low-interaction and high-interaction. Example of this type of honeypots is Dionaea [4]. On one hand, low-interaction honeypots emulate the characteristics of network services or a particular operating system. On the other hand, a complete operating system with all services is used to get more accurate information about attacks and attackers [5]. This type of honeypot is called high-interaction honeypot. Example of this type of honeypots is HonSSH [6].
Concept of honeypot is extended by honeynet -a special kind of highlevel interaction honeypot. The honeynet can be also referred to as "a virtual environment, consisting of multiple honeypots, designed to deceive an intruder into thinking that he or she has located a network of computing devices of targeting value" [7]. Four main parts of the honeynet architecture are known, namely data control, data capture, data collection and data analysis [2,7].
The main reason to use these tools is collection and analysis of data captured using honeypots and honeynets. Learning new unconventional information about the attacks, attackers and tools is involved in the protection of the network services and computer networks of organizations. Each honeypot collects the IP addresses of attackers and special data according to type of honeypot. In paper we use the low-interaction honeypots Kippo [8], which collect timestamps, IP address of attacker, type of SSH clients and combination of logins and passwords. For purpose of this paper we focus on logins, passwords and their combinations.
This paper is a sequel to the analysis of data collected from honeypots and honeynets. In paper [9] authors focus on automated secure shell (SSH) bruteforce attacks and discuss the length of passwords, password composition compared to known dictionaries, dictionary sharing, username-password combination, username analysis and timing analysis.On the other hand, the main aim of this paper is to provide light on attackers' behaviour, and provide recommendations for SSH users and administrators. In this paper we focus on two main statistical analyses. Firstly, chi-square test of independence that analyzes group of differences. Secondly, Kappa statistics that measures agreement between observes.
To formalize the scope of our work, authors state two research questions: -What attribution of logins, passwords and their attribution are significant for security of systems? -What is the relationship between the logins and passwords and origin of attacks?
This paper is organized into seven sections. Section II focuses on the review of published research related to lessons learned from analysis in the honeypots and honeynets. Section III outlines the dataset and methods used for experiment. Sections IV-VI focus on statistical and spatial analysis of logins, passwords and combination of them. The last section contains conclusions, discussion and our suggestions for the future research.

Related works
As it was mentioned before, the main task of honeypots and honeynet is in analysing the captured data and searching for new knowledge about the attacks and attackers. This section provides overview of papers that focus on lessons learned from honeypots and honeynets data.
Analysis of data collected by high-interaction honeypots are discussed in Nicomette et. al. [10] and Alata et. al. [11]. [10] concentrate on the attacks executed by the SSH service and the activities executed after attackers gain access to the honeypot. Attackers and their activities after logging in are discussed in [11]. Authors correlated their findings with the results from distributed low-interaction honeypots.
But then, low-interaction honeypots are discussed in Sochor and Zuzcak in papers [12,13]. In [12] data show currently spreading threats caught by honeypots. But then, the thorough interpretation of lessons learned from using the honeypots was outlined. Principal results are shown in [13], in addition they underline the fact that the differentiation between honeypots according to their IP address is quite rough (e.g. differentiation for academic and commercial network).
SGNET was used by [14] as a distributed system of honeypots. They doubt the floatation of representative malware samples datasets. They claim that the false negative alerts differ from what they are allowed to be. Additionally, there is occurrence of false positive alerts on abrupt places. Clustering attack patterns with a suitable similarity measure are discussed in [15]. The results of this study allow identification of the activities of several worms and botnets in the collected traffic.
Time-oriented data were of interest in [16]. Visualization of this data in honeypots and honeynets was outlined. In addition, the authors provide results based on heatmaps that is special visualisation. It was proved that the time is an important aspect of attacks. Attackers are mainly active at night (according to the honeynets time zone analysis).
Next example of using low-interaction honeypots (Dionaea) in order to studying is in [17]. It presents the results of nearly two years operation of honeypot systems, installed on unprotected research network. The paper focuses on the information about the life time of malware programs and the long-time malware activity.

Data collection and analysis methodology
The data were collected from the honeynet located in the campus network. The honeynet that runs on port 22 consists of SSH honeypots Kippo [8] in lowinteraction mode. The honeypots do not allow attackers to log into shell in this mode, they only capture data about network flows entering the honeynet. The honeypots have collected authentication attempts from 3rd August 2014 to 24th December 2015. During this period 1 391 746 records were collected. Each record contains username and password used in an attempt, as well as IP address and version of client of attacker, beginning and end of sessions. Dataset contain unique 5 488 logins, unique 205 477 passwords and unique 212 687 combinations of login and password.
For spatial analysis, each record was competed with spatial data using the IP-API.com service [18]. This service provides free use of its Geo IP API through multiple response formats. Each record was supplemented with time zone, country, region, city, Internet service provider (ISP), and global positioning systems (GPS) coordinates.
Data cleaning and analysing was performed using, the HoneyLog framework [19]. This framework for analysing honeypots and honeynets data is based on a PHP framework of FuelPHP and JavaScript libraries. It has two main segments: a client part and a server part.
For purpose of paper, important part of dataset consists of combination of logins and passwords. Since the logins and passwords are the qualitative data it needed to be converted into quantitative data. For each login and password, we assigned following attributes: -contains only lowercases login or password contains only lowercase characters (ASCII codes between 97 and 122); -contains only uppercases -login or password contains only capital characters (ASCII codes between 65 and 90); -contains only numbers -login or password contains only numbers (ASCII codes between 65 and 90); -contains number -login or password contains at least one number; -contains year -login or password contains year (2014 or 2015) and -contains special character -login or password contains at least one special character (ASCII codes 32-47,58-64,91-96 and 123-127); In paper we use two statistical methods: chi-square test of independence and kappa statistics. The Chi-square test of independence, also known as the Pearson Chi-square test [20], is one of the most useful tools for testing hypotheses when the variables are nominal. It is a non-parametric tool designed to analyse group differences. Each non-parametric test has its own specific assumptions as well. The assumptions of the Chi-square include: 1. The data in the cells should be frequencies, or counts of cases. 2. The categories of the variables are mutually exclusive. 3. Each subject may contribute data to one and only one cell in the Chi-square. 4. The study groups must be independent. 5. While Chi-square has no rule about limiting the number of cells (by limiting the number of categories for each variable), a very large number of cells (over 20) can make it difficult to meet assumption #6 below, and to interpret the meaning of the results. 6. The value of the cell expected should be 5 or more in at least 80% of the cells, and no cell should have an expected of less than one (3). This assumption is most likely to be met if the sample size equals at least the number of cells multiplied by 5.
On the other hand, Kappa [21] is intended to give the reader a quantitative measure of the magnitude of agreement between observers. Interobserver variation can be measured in any situation in which two or more independent observers are evaluating the same thing.

Logins
The first observed aspect of analysis is login. Top 10 logins are shown in Fig.  1(left). This diagram shows that the most tested login is root. According to other logins, attackers test default logins for different systems (admin, user, PI, Oracle, etc.). Also attacker is often trying the same login and password combination. In this paper we focus on analysis of login with the largest number of unique passwords. Top 10 logins with unique passwords are shown in Fig. 1  (right). From this perspective, the most tested login is root. Attacker also tests following logins with large number of unique passwords: user, test, nagios, mysql.

Attributes of logins
According to Linux documentation for tool useradd [22], Unix/Linux's username (login) equals regular expression[a-z ][a-z0-9 -]*[$]?$. This expression means that the first character of login is lowercase and other characters are lowercases or numbers. Also capital letters are not allowed. Moreover, logins must neither start with a dash nor contain a colon or a whitespace, end of line and tabulation etc. Documentation notes that using a slash may break the default algorithm for the definition of the user's home directory.
As we can see in Fig. 2, the largest group of logins is logins containing only lowercases (88,47 %). A slight amount of logins contains a number (7,89 %) or special character (4,46 %). According to our opinion, logins, which contain capital letters or special character are tested by special group of attackers -script kidies or attacks were directed to other systems like UNIX/LINUX.
Another studied aspect is the length of logins (Fig. 3). According to above mentioned Linux documentation [22], logins may only be up to 32 characters long. The length of tested logins is in range from 1 to 50 characters. The logins with length between 33 and 50 are a sign of incorrect use of automated

Frequency of ASCII characters in logins
For purpose of the frequency of ASCII characters in logins we created frequency table (Fig. 4). This table takes into account the frequency of at least one occurrence of a given character within a login. ASCII character with the highest occurrence is lowercase a. Lowercase e, which is the most frequent character in many alphabets (e.g. English, French and German alphabet), is in the 2nd place. On the other hand, lowercase q and x have the lowest occurrence. The most used number is 1 and 2. On the other hand, 6 and 8 are used at least. In the most cases the login contain special character /. In contrast to this, pass-words do not contain this character. According to our opinion, it is again sign of incorrect use of automated programs.

Logins and origin of attacks
Tab. 1 shows top 20 countries, which are origin of attacks. For each country, table shows the count of attacks, top login and its count and percentage and the top three logins, which are tested by attackers from country. The login root is the most tested login from each top 20 country. The interesting fact is that percentage of tested login root to all tested passwords from country is different. On one hand, there is high percentage in countries such as China, Hong Kong, France, Hungary etc. On the other hand, there is low percentage in countries such as Argentina or Singapore. The most tested group of logins are root/admin/ubnt, root/admin/test and root/admin/user. Based on this it can be concluded that groups of tested logins, considering origin of attacks, can be interesting indicator for finding group of attackers.

Passwords
The second observed aspect is password. Compared to logins the types of passwords are pronounced. The most commonly used password is admin. Top 10 the most used passwords (123456, password, root, 1234, etc.) is shown in Fig.  5 (left). Like in login, we focus on the passwords that are used with the most unique logins. In this regard, the most used login is password (none). Other most used passwords with the most unique logins are shown in Fig. 5 (right).

Attributes of passwords
In this section we focus on attributes of passwords. These attributes are shown in Fig. 6. Compared to the login, Linux documentation does not restrict password from the perspective of characters (no security). It is due to the fact that system stores hash of password (no clear password). According to Fig. 6 the most frequently used passwords contain numbers (50,36 %). A slightly smaller number of the passwords containing only lowercase (45,24 %). In contrast, entries containing only a number occur almost three times less often. An interesting fact is that among the top 10 passwords were four passwords containing only numbers (123,1234,12345,123456) (9,9 %) and the only one password containing only lowercase characters (test) (0,83 %). Another attribute of password is its length. The length of the password is in the range between 0 and 98. The most passwords contain 8 characters. The largest number of length of passwords is in the range between 3 and 20 characters. It is worth mentioning that passwords with 32 characters are hashes (e.g. 706e642a056c7e894ed5a01e55700004). Number of characters of passwords is shown in Fig. 7 (left). Passwords with 33 characters and more are a sign of incorrect using of tool (e.g. #files th a:hover { background:transparent; border...) or manual attack by script-kidies (e.g. rooooooooooooooooooooooooooooooooooooooooooooooooooooooooooot) We also focus on the largest group of passwords that contain only numbers. In this group the largest subgroup of passwords contains 8 respectively 6 digits. Number of length of passwords, which contain only numbers, are shown in Fig.  7 (right).

Frequency of ASCII characters in passwords
Like for a login, the frequency tables of ASCII characters in passwords were created (Fig. 8). This table takes into account the frequency of at least one occurrence of a given character within a password. ASCII character with the highest occurrence is lowercase a. Lowercase e, which is the most frequent character in many alphabets (e.g. English, French and German alphabet), is in the 2nd place. On the other hand, capital V and capital K have the lowest occurrence. Similar to login, the most used number is 1 and 2. On the other hand, 6 and 7 are used the least. In the most cases the passwords contain special characters @ and !. Interesting fact is occurrence of characters Horizontal Tab (ASCII code 9) and Device control 1-4 (ASCII codes [17][18][19][20] in passwords (e.g. %username DC1 3!@, %username DC2 34567890-=). These codes are used for software flow control (e.g. DC 1 for quit application). These codes are not visible in logs. Passwords with these codes begin with special characters !, % or @ and they are linked to login root. According to our opinion, passwords with these codes are used in incorrect using of a tool by script-kidies.

Passwords and origin of attacks
Tab. 2 shows top 20 countries, where attacks originated. For each country, table shows the count of attacks, the most used passwords with their count and percentage and the top three logins, which were tested by attackers from country. In table (none) means that password without chars was inputted. The password 123456 is the most tested from 7 top countries. An interesting finding is password weubao in Hong Kong. In case of logins, there is similar the most tested groups of logins considering the origin of attacks. In case of passwords, there are no similar groups with top 3 passwords. Based on this it can be concluded that there is relationship between passwords and origin of attacks.  Now we sum cell chi square values to obtain chi square statistic for the table. In this case it is 3571. The chi square table requires knowledge of degrees of freedom to determine the significance level of the statistics. It holds: df = (numberof rows − 1) * (numberof columns − 1) = 1 * 3 = 3. The critical value for chi square distribution with df = 3 is 7,815. So our calculated value is bigger than critical value: 3571 > 7, 815 and we can conclude that null hypothesis is rejected, which means that there is a relationship between login and password. However, this result does not specify what impact on this relationship. It can be seen in Tab. 4. The largest values of cell chi square values can be seen in a special char for login. It means that number of logins that contain special char is significantly greater than expected value. On the other hand, cell chi square values less than 1 means that number of observed cases is equal to number of expected cases. So there is no effect on password for number and only uppercase.
Based on the above mentioned, it can be concluded that there is a relationship between the login and password. Especially if the password contains a special character or number. Logins typically contain only lowercases. Therefore, if it contains special characters, numbers, at least one number or all capital characters, there is a relationship between the login and password. In the greatest extent it occurs in case of login with special character (e.g. password garland!@# for login root). Another example is the login root!"?$%& with password (none). In these cases, it can be concluded that it is not a dictionary attack, respectively brute force attack, but a manual attack or automated attack by script-kidies.

Agreement of structure of password and login
For study agreement of structure of password and login, we use kappa statistics. The data were collected in Table 6. We can simply calculate the percentage of agreement as a sum of diagonals divided by number of observations, we have 90,3% agreement. But that measure does not take into account the random chance of agreement. We calculate expected agreement that is P e = 0, 416. Formula for kappa: K = (P o−P e)/(1− P e) = 0, 834. Using table in [21] we can conclude that agreement of login and password is substantial.

Conclusions, recommendations and future works
Attacks collected by honeypots are interesting source for further analysis. In paper we focus on logins, passwords and their combination. We outline statistical analysis of collected data. General rules for passwords creating state that password should contain lowercase, capital letter, number and special character. Length of password should be 8 or more. According to above mentioned, we propose to use capital V, capital K and number 6 and 7 in passwords. We recommend avoiding the following lowercases: a,e,i,n,r,o,s and following numbers: 1,2,3 and 9. To strengthen password it is recommended to use password with length 10 or more and special characters: [,],{ and }.
Since the combination of login and password is used in attack, it is needed to deal with the strength of login. General safety rules state that default passwords and root should not be used. We agree with these rules, but above mentioned we propose the following rules for login creating. The first character of password must be lowercase. Lowercase q or x look like the best choice. The login must have length between 1 and 32 characters. We recommend use the login with length between 12 and 32 characters. We recommend avoiding the following lowercases: a,e,i,r,n,o,s,t,l,c and following numbers: 1,2,3 and 0. In general, using the numbers increase the security of the password, especially numbers: 6,7 and 8.
As we showed before, Chi-square test of independence and Kappa statistics show that there is relationship between logins and passwords. On the basis of these tests, attacks can be divided into manual attacks and automated attacks.
In the future, the research in field of analysis of collected data will continue. We will primarily focus on types of clients and time-oriented analysis from the perspective of logins and passwords.