Usability Problems Experienced by Different Groups of Skilled Internet Users: Gender, Age, and Background

. Finding the right test persons to represent the target user group, when conducting a usability evaluation is considered essential by the HCI research community. This paper explores data from a usability evaluation with 41 participants with high IT skills, to examine if age, gender, and job function or educational background, has an impact on the amount and types of usability problems experienced by the users. All usability problems were analysed and categorised through closed coding, to group the test persons differently in relation to gender, age, and job function or educational background. The study found that the usability problems experienced across gender, age group and job function or educational background, are approximately the same. This indicates that the usual characteristics of test persons, might not be as important, and opens up for further research in regards to, if users with different skill levels, in regards to internet usage, might be more applicable.


Introduction
Usability evaluation is a strong tool for identifying areas of an interactive system that need improvement. In practice, one of the key challenges for usability evaluators is to find users that can participate as tests subjects. Recruitment of test subjects is challenging, and the time required for test sessions and the subsequent data analysis is usually dependent on the number of the number of test subjects. Therefore, there have been attempts to determine the minimal number of test users required for a usability evaluation [4], [7], [11]. Combining Other researchers have criticised these attempts to define the minimal number. One of the arguments is that different users experience different usability problems [6], [9]. In these discussions, there has been little evidence as to the actual differences between the usability problems experienced by different groups of users.
For specialised systems that are used by a homogeneous group of users, this issue is not particularly relevant. However, for systems that are aimed at diverse and heterogeneous groups of users, it is highly relevant. This paper presents results from an exploratory study of the usability problems experienced by different users. The focus of this study was to what extent different test persons, who are all experienced internet users, experience different types of usability problems, across gender, age, and educational background or job function.
The system we evaluated was a government data dissemination website aimed at a very broad user population. In the following section, the related work is presented, followed by a description of the method used for data collection and analysis. Then the results are presented, and finally, the results are discussed and concluded upon.

Related Work
The question about the number of test subjects needed in a usability evaluation has been discussed for many years. Virzi [11] focused on the need exists to reduce the cost of applying good design practices, such as user testing, to the development of user interfaces. He was one of the first to experiment with the number of test subjects needed. Over a series of 3 experiments, he found that 80% of the usability problems were detected with four or five subjects, additional subjects were less and less likely to reveal new information, and the most severe usability problems were likely to be detected with the first few subjects. In the experiments, he used test subjects who were from the surrounding community or undergraduate students. There is no further description of their demography. Lewis [7] emphasices that the aim of a usability evaluation is to have representative participants. He reports from an experiment with fifteen employees of a temporary help agency who all had at least three months' experience with a computer system but had no programming training or experience. Five were clerks or secretaries and ten were business professionals. In this study, using five participants uncovered only 55% of the problems. To uncover 80% of the problems would require 10 participants. The results show that additional participants discover fewer and fewer problems. The most important result was that problem discovery rates were the same regardless of the problem severity. Again, there is no concern for the demography of the test subjects.
Caulton [2] argues that the results obtained in these early experiments were based on the assumption that all types of users have the same probability of encountering all usability problems, and he denotes this as the homogeneity assumption. If that is violated, more subjects are needed. He argues that with heterogeneous user groups, problem detection with a given number of subjects is reduced. The more subgroups, the lower the proportion of problems expected. If ten unknown user subgroups exist, 50 randomly sampled subjects should yield 80% of the problems.
Law and Hvannberg [6] have worked more on the influence of subgroups on problem detection through an experiment with usability tests conducted in four different European countries. They conclude that the heterogeneity of subgroups in a test will dilute the problem detection rate. Not only for severe problems but also for moderate and minor ones, the diluting effect implied a reduction. The problem detection rate for the severe problems is significantly higher than for the less severe, but the absolute value for the severe problems is not particularly high. Between nine and ten participants were required to uncover 80% of the severe problems, whereas 15 participants were required to uncover 80% of the minor problems. In addition, they found no significant correlation between problem detection rate and problem severity level. Based on their results, they reject that so-called "magic five" assumption as 11 participants were required to obtain 80% of the usability problems.
More recently, there has been another attempt to define a specific "magic" number [4]. This new attempt has been criticised for being flawed [9]. A detailed analysis has been made of the use of the "magic five" assumption. None of these or the previous references in this area have explored in more detail how heterogeneous different subgroups are and how different user groups experience different usability problems.

Method
We have conducted an exploratory study of usability problems experienced by different user groups. This section describes how the data was collected and analysed.

Data Collection
The data was gathered through a usability evaluation of a data dissemination website (dst.dk). This site provides publicly available statistics about the population (e.g. educational level or IT skills), the economy, employment situation, etc. Test Persons. All test persons were invited through emails distributed across the university. For this study data from 41 usability evaluations were included. The test persons consist of 12 faculty members from Ph.D students to professors, from different departments, 15 students in technical or non-technical educations, and 14 participants from technical and administrative staff from different departments. All participants received a gift with a value of approximately 20 USD for their participation. An overview of the participants can be seen in Table 1 on the following page.
All test persons were placed in one of six groups in regards to gender and age. The test persons varied in age between 21 and 66 years and consisted of 19 males and 22 females. All test persons were asked to assess their own skill level in regards to Internet usage on a scale from 1 to 5, where 1 was the lowest and 5 the highest score. The average for each group is shown in the table, none of the 41 test persons assessed themselves lower than 4. Originally 43 usability evaluations were conducted, but the data from two usability evaluations were excluded from this study, due to these test persons assessed themselves at skill level 3 in regards to Internet usage. All test persons were asked if they were familiar with, and used this website. 19 people answered that they had never used the website, 20 answered that they were familiar with the site and used it approximately once a year, and, two people answered that they use the website approximately once a month.
Usability Evaluations. All tests were conducted as think-aloud evaluations in a usability laboratory. The test monitor and test person were placed in different rooms and communicated through microphone and speakers in order to avoid the possibility of the test moderator's body language or other visible expressions, influencing each test person. All test persons were asked to fill out a short questionnaire after the test in regards to their participation.
Tasks. Each user solved eight tasks all varying in difficulty. Examples of this were that the first task was to find the total number of people living in Denmark while a more difficult task was to find the number hotels and restaurants with one single employee in a particular area of Denmark. Data Handling. All usability evaluations were recorded and the collected recordings were analysed by conducting video analysis. All recordings were analysed by two evaluators. Both evaluators had extensive previous experience in analysing video data. The videos were analysed in different random order, to reduce possible bias from learning. The following characteristics were used to determine a usability problem; (A) Slowed down relative to their normal work speed (B) Inadequate understanding e.g. does not understand how a specific functionality operates or is activated (C) Frustration (expressing aggravation) (D) Test moderator intervention (E) Error compared to correct approach. adfa, p. 4, 2011.
The data handling resulted in a list of 147 usability problems after duplicates had been removed. To determine similarities between problems from each list, the usability problems found by each evaluator were discussed. Across the analysis, the evaluators had an any-two agreement of 0.44 (SD = 0.11), which is relatively high compared to other studies [3]. Further information about the data collection can be found in [1].
Data Analysis. We also uncovered which types of usability problems that were experienced by the different groups of participants. We did this through closed coding [10] where each problem was categorised according to the 12 types listed in Nielsen et al. [8]. Two of the authors conducted this coding and did so independently of each other. It was decided in advance that the raters would code all and only use the data from the codings where the authors agreed on the category independently of each other. An interrater reliability analysis using the Fleiss Kappa statistic was performed to validate the result. This determines the level of consistency among the two raters. The result of was a moderate level of agreement (Kappa = 0.44, p < 0.001, 95% CI=0.37, 0.52) [5]. The 12 categorised used for this study are described next.
Affordance relates to issues on the user's perception versus the actual properties of an object or interface. Cognitive load regards the cognitive efforts necessary to use the system. Consistency concerns the consistency in labels, icons, layout, wording, commands etc. on the different screens. Ergonomics covers issues related to the physical properties of interaction. Feedback regards the manner in which the interface relays information back to the user on an action that has been done and notifications about system events. Information covers the understandability and amount of information presented by the interface at a given moment. Interaction styles concern the design strategy and determine the structure of interactive resources in the interface. Mapping is about the way in which controls and displays correlate to natural mappings and should ideally mimic physical analogies and cultural standards. Navigation regards the way in which the users navigate from screen to screen in the interface. Task flow relates to the order of steps in which tasks ought to be conducted. User's mental model covers problems where the interactive model, developed by the user during system use, does not correlate with the actual model applied to the interface. Visibility regards the ease with which users are able to perceive the available interactive resources at a given time.
The coding and analysis by two raters resolved in a list of 83 coded usability problems, out of originally 147 usability problems. This reduction happened as all usability problems where the raters did not agree on the category was removed from the study.
These categorisations were used to distinguish if test persons experienced the same type of usability problems, or if there were deviations across gender, age, job function or educational background. The results of this analysis are presented in the following section.

Results
In this section, we present the results from conducting this study. The results are presented from four different perspectives. First, the test persons are divided into males and females, then into the three age groups without taking the gender into perspective, then, the test persons are divided into groups both in regards to age and gender, and finally, the test persons are divided into groups in regards to education or work function. This was conducted to show if gender, age or background plays a role in regards to differences in the perceiving of usability problems. The numbers shown in the tables in the result section represent an average number of usability problems found per test person in each category. This was conducted to be able to compare groups containing different numbers of test persons, and still make the numbers comparable.
The results show that problems were found in regards to five of the twelve closed codings. Affordance, Cognitive Load, Feedback, Information, and Visibility, respectively. As problems were not found relating to Consistency, Ergonomics, Interaction Styles, Mapping, Navigation, User's Mental Model, and Task Flow, these categorisations will not be mentioned further.
Note that all results are based on the number of problems to which the two raters agreed on the categorisations, e.g. if the two raters did not agree on the code of a particular problem, this was excluded from the result. Out of the total 147 problems the raters agreed on 83.

Gender
We analysed whether males and females with similar skills in regards to internet usage experienced the same amount and type of usability problems. The results are presented in Table 2.  An independent samples t-test revealed no significant differences in the total number of experienced between the genders (t=-0.9, df=39, p>0.2). We did, however find significant differences when considering the problem types related to feedback (t=-1.2, df=10, p<0.01) and information (t=-1.8, df=39, p<0.01).

Age
We also analysed if age had an impact on the experienced amount of usability problems. The results are presented in Table 3 on the following page. A one-way ANOVA test revealed no significant differences in number of experienced problems between the three age groups (F=1.02, df=40, p>0.3).

Job Function and Educational Background
Finally, we analysed if a large number of test persons with a background in computer science had an impact in regards to the amount of usability problems experienced. The results are presented in Table 4.
The table shows, that when dividing the test persons into job function or educational background, students which are not in computer science, experience more problems related to cognitive load and information. A one-way ANOVA test revealed no significant differences in the total number of problems experienced across job function or educational background (F=0.6, df=40, p>0.6).

Discussion
This study has focused on comparing the amount of usability problems found when grouping the test persons in regards to gender, age, and job function or educational background. This was conducted as all test persons assessed themselves as experienced internet users, as each rated themselves as either 4 or 5 on a scale from 1 to 5, where five was the highest score. This way, it could be explored if test persons of a high degree of internet skills experienced different types of usability problems, or if they could be considered a homogeneous group, where neither age, gender, and job function or educational background made a difference in regards to the average amount of usability problems.

Comparison with Related Work
Related work has shown that the amount of needed test persons varies [7], [11]. As demographical data was not included in these studies it is not possible for us to draw any conclusions in relation to the results from this study, though it raises the question of, if the test persons chosen by Virzi [11] were more homogeneous than the test persons chosen by Lewis [7] in regards to the skills of Internet usage or IT in general. This study has found indications that a user group can be homogeneous though a variety in age and background. Our results indicated that the test persons from this study experience around the same amount of usability problems in regard to each categorization (Affordance, Cognitive Load, Feedback, Information, Visibility), across gender, age, and background. This corresponds with Caultons' conclusions about homogeneous user groups experiencing the same usability problems [2].
This study shows no greater difference in regards to the types of usability problems experienced by the test persons. This does not correspond with the findings of Law and Hvannberg who concluded that the heterogeneity of subgroups in a test will dilute the problem detection rate [6].

Implications for Usability Practitioners
Though further research is needed, this study indicates that recruiting test persons across gender, and age might not be necessary, as these findings show that users with approximately the same level of skills in regards to Internet usage, experience the same amount of usability problems. If, the indication that skill level is key, when recruiting test persons for usability evaluations, this means that the most important is to recruit test persons of all skill levels of the target user group for the website or application, and, that variety in age or gender is not important when recruiting test persons. The implications might especially be of interest, when developing websites or applications for large heterogeneous user groups e.g. public websites or self-service applications, as these types of sites are targeted for all citizens in a country. This will make it challenging to represent all types of users when conducting usability evaluations, as a lot of test persons would need to be recruited, and it would be costly to conduct this amount of usability evaluations. On the contrary, if test persons only need to be recruited in regards to their skill level of Internet usage and IT in general, this would reduce the cost considerably.

Conclusion
This paper presents a study of to what extent different test persons, who are all experienced internet users, experience different types of usability problems. This has been presented across age, gender, and educational background or job function. The results are interesting as it is indicated that the usability problems experienced by users with a high level of internet experience do not vary significantly, across gender, age or background. This means that finding test persons might not have to be balanced in regards to neither gender or age, but that is more important to find test persons on all levels of internet experience in the target user group. Our results also indicate that people with an education in Computer Science do not experience significantly fewer usability problems, than other experienced internet users.

Limitations
We do recognise that further studies need to be conducted to be able to actually draw conclusions across user groups at different levels of Internet experience and that these results do not provide enough evidence to definitively rejecting the previously mentioned criticism of the "homogeneity assumption" by Law and Hvannberg [6]. This means that further research should be conducted with more homogeneous user groups with different levels of internet skills, and not just one group of experienced users. As it needs to be investigated further if these results also are valid for other user groups with lower skill levels in regards to Internet usage. We also recognise the limitations of our test persons having a higher educational background and a self-reported high expertise in internet usage. Also the fact that a lot of the found usability problems were discarded at the coding phase and therefore not included in the data analysis.