What Users Prefer and Why: A User Study on Effective Presentation Styles of Opinion Summarization

. Opinion Summarization research addresses how to help people in making appropriate decisions in an effective way. This paper aims to help users in their decision-making by providing them effective opinion presentation styles. We carried out two phases of experiments to systematically compare usefulness of different types of opinion summarization techniques. In the first crowd-sourced study, we recruited 46 turkers to generate high quality summary information. This first phase generated four styles of summaries: Tag Clouds, Aspect Oriented Sentiments, Paragraph Summary and Group Sample. In the follow-up second phase, 34 participants tested the four styles in a card sorting experiment. Each participant was given 32 cards with 8 per presentation styles and completed the task of grouping the cards into five categories in terms of the usefulness of the cards. Results indicated that participants preferred Aspect Oriented Sentiments the most and Tag cloud the least. Implications and hypotheses are discussed.


Introduction
The widespread use of the Internet in many aspects of human activities has resulted in an abundance of publicly-accessible opinions. People can find opin ions on a variety of topics in venues such as Twitter, Weibo, foru ms, e-co mmerce sites and specialized opinion-hosting sites such as Yelp. While most of these opinions are intended to be helpful for others, the sheer quantity of them often makes most of the opinions underutilized, as the information overload overwhelms many potential consumers. For example, A mazon.com has more than 18,000 reviews for Kindle Fire, a single pro duct alone. Summarizing these reviews in some concise form could bring enormous benefits to consumers and business alike. Not surprisingly, research on opinion su mmarizat ion is gaining increased attention [14, 18, 19, 22, 24,]. Ho wever, most of the research emphasizes technical advances in underlying algorithms, wh ile paying less attention to the presentation of the results, which is the focus of this work. Correspondingly, evaluation of opin ion summarization research is normally based on ce rtain notions of precision and recall calculat ion co mmonly used in information retrieval [28] and data mining [29]. Studies have only begun to investigate the effectiveness of opinion summarization in term o f usability (e.g. [ 31]). Such studies focus on tes ting the newly-proposed techniques. A systematic comparison of the usefulness of different types of opinion summarizat ion is still lacking. Th is paper reports our effort in addressing this deficiency.
One major difficulty with studying the effectiveness of opinion summarizat ion is a confounding effect between content effectiveness and presentation effectiveness. It is often not clear whether a technique's empirical superiority can be attributed to its superior text analytics quality or its effective informat ion presentation style. We plan to isolate the two factors and focus on studying the effect of presentation styles. This goal is achieved by using human-generated summarization as the content, so as to ensure the content has consistent high quality regardless of the presentation styles. We can then vary the presentation styles of the summaries to investigate their effect on the usefulness ratings of the summaries. Any differences found between the usefu lness ratings of the summaries can be safely attributed to the differences in present ation styles. We identified four types of presentation styles of opinion summarization through a crowd-sourcing study on Amazon Mechanical Turk, and then conducted a lab user-centered experiment to compare the effectiveness of the four styles.

Previous work
Although not abundant, studies investigating the effectiven ess of opinion summarization fro m a perspective of both usability and user preference are emerg ing. Several recent studies explore feedback fro m users regarding their preferences for certain opinion summarization styles and approaches. Most recently, Qazi et al. [26] addressed a gap in existing studies examining the determination of useful opinion review types from customers and designers perspectives. Specifically, the researchers used the Technology Acceptance Model (TAM) as a lens to analyze users' perce ptions toward different opinion review types and online review systems. The study, a pilot study, focused on three review types which are related to perceived usefulness, perceived ease of use, and behavioral intention: A (regular), B (co mparative), and C (suggestive). Suggestive reviews, the speech acts which are used to direct someone to do something in the form of a suggestion, were newly identified by the researchers as a third innovative review type. To examine user perspectives, researchers used a closed card sorting approach to analyze reviews from A mazon, blogs, and a selfdeployed website. The results of their work indicated that the review types play a significant ro le in developing user perception regarding a new product or system, with suggestive reviews more significant for both customers and designers to find mo re usefulness that ultimately improves their satisfaction level.
Further, in another work [31], researchers conducted a user study of a review summarization interface they created called "Review Spotlight." Review Spotlight is based on a tag cloud and uses adjective-noun word pairs to provide an overview of online restaurant reviews. Findings indicated that study participants could form d etailed imp ressions about restaurants and make faster decisions between two options with Review Spotlight versus traditional review webpages. In a large-scale, co mprehensive human evaluation of three opinion-based summarization models -Sentiment Match (SM), Sentiment Match + Aspect Coverage (SMAC), and Sen timent-Aspect Match (SAM) -Lerman, Blair-Go ldensohn and McDonald [15] found that users have a strong preference for sentiment-info rmed summaries over simp le, non-sentiment baselines. This finding reinforces the usefulness of modeling sentiments and aspects in opinion summarization. In another study, Lerman and McDonald [16] investigated contrastive versus single-product summarization of consumer electronics and found a significant improvement in the usefulness of contrastive summaries versus summaries generated by single-product opinion summarizers. To find out which visual properties influence people viewing tag clouds, Bateman, Gutwin and Nacenta [2] conducted an exploratory study that asked participants to select tags from clouds that manipulated nine visual propert ies (font size, tag area, nu mber of characters, tag width, font weight, color, intensity, and number of pixels). Participants were asked to choose tags they felt were "v isually important" and results were used to determine wh ich visual properties most captured people's attention. Study results indicated that font size and font weight have stronger effects than intensity, number of characters or tag area. Ho wever, when several visual properties were manipulated at one time, no one visual property stood out among the others. Carenin i, Ng and Pauls [4] also employed a user study as part of their wider co mparison of a sentence extract ion -based versus a language generation-based summarizer for summarizing evaluative text . In their quantitative data analysis, the researchers found that both approaches performed equally well. Qualitative data analysis also indicated that both approaches performed well, however, for d ifferent, co mplementary reasons. In a related work, Carenin i, Ng and Pauls [5] examined the use of an interactive mult imedia interface, called "Treemaps," for summarizing evaluative text o f online reviews of consumer electro nics. Treemaps presents the opinion summarizations as an interactive visualization along with a natural language summary. Results of their user study showed that pa rticipants were generally satisfied with the interface and found the Treemap summar ization approach intuitive and informative.
In more recent work, [12] researchers presented a novel interactive visual text analytic system called, "Opin ionBlocks." Opinion Blocks had two key design goals: (1) automated creation of an aspect-based visual summary to support users' real-world opinion and analysis tasks, and (2) support of user corrections of system text analytic errors to imp rove system quality over time. To demonstrate OpinionBlock's s uccess in addressing the design goals, researchers employed two crowd -sourced studies on Amazon Mechanical Turk, According to their results, over 70% of users successfully accomplished non-trivial opin ion analysis tasks using OpinionBlocks. Additionally, the study revealed that users are not only willing to use Opinion Blocks to correct text classification mistakes, but that their corrections also produce high quality results. For example, study participants successfully identified numerous errors and their aggregated corrections achieved 89% accuracy.
Additionally, Duan et al. [7], introduced the opinion mining system, "VISA" (VIsual Sentiment Analysis), (derived fro m an earlier system called TIA RA). The VISA system employs a novel sentiment data model to support finer-grained sentiment analysis, at the core of which is the "sentiment tuple," composed of four elements: feature, aspect, opinion, and polarity. Researchers conducted a user experiment to explore how efficiently people could learn to use VISA an d demonstrate its effectiveness. Study results indicated that VISA performed significantly better than the two comparison tools (TripAdvisor and a text edit tool) due to its key features, namely mash-up visualizations and rich interaction features.
In the current study, investigation of user perspectives on opinion summarization styles is taken further with the evaluation and comparison of four distinct, popular summarization styles focused on textual opinions; namely, Tag Clouds, Aspect -Oriented Sentiments, Paragraph Summaries, and Group Samples .
Following we introduce the four presentation styles used in our study, the metho dology, results of the experiment, discussion and conclusions.

Opinion summarization presentation styles
Some opinion hosting sites allow opinion writers to give nu merical rat ings in addition to the textual opinions. Since the visualization of numerical values is a well -studied problem, we focus instead on the summarizat ion of textual opin ions. Similarly, we do not compare visualizat ion systems that emphasize statistics rather than the textual content of the text co llect ions (e.g. [6]). Opin ion summarizations studied here are of the kind that could potentially be used in place of the full documents.
Based on our survey of the literature, we have categorized the presentation styles of such opinion summarization into four major types.

Tag clouds (TAG)
Tag clouds are perhaps the most popular form of summarization on the Internet today [3]. This type of text presentation has also been used extensively in research (e.g. [25,28]). They consist of enumerations of the most common words or phrases in a colle ction of opinions, laid out on a 2D canvas. The size of the words or phrases often indicates how frequently they were mentioned. The larger the word or phrase, the more frequently it received mentions. The effect of various visual features of the tag clouds on their effectiveness have been investigated [2], but the comparison with other styles of summarization has yet to be done. See Figure 1.

Aspect oriented sentiments (ASP)
Aspect oriented sentiment summarization is an active area of research in text mining [11,13,14,19,23]. In this approach, some important aspects or topics (also known as features) of opinions are ext racted fro m an opin ion collection. Sentiment orientation of the text snippets containing the aspects are then estimated and summary statistics reported. A typical summarizat ion for one aspect might look like this: for a collection of reviews on Kindle Fire, "screen" is identified as an aspect, and 100 text snippets in the collection are found to be about this aspect, 60 of them have positive sentiment, 30 of them are negative, and the rest are neutral. Representative text snippets for each sentiment orientation may also be provided. See Figure 2.

Paragraph summaries (PRG)
Automatic text summarization systems traditionally produce short passages of text as summary [10,27]. The summarizat ion is called ext ractive when the sentences are selected from original documents; abstractive when the sentences are generated by the system [8]. Regard less of the approach, the output could be a readable abstract that resembles what hu mans would write for generic purposes in order to emphasize intrinsic properties such as fluency and coverage [21]. See Figure 3.

Group samples (GRP)
Clustering algorith ms are typically used in post-query user interfaces as a way of organizing retrieved documents [9]. It has also been used in summarization, where similar documents are grouped together and the representatives of the groups are displayed [1,20]. This approach has shown to be effective in interactive information retrieval [17,30]. See Figure 4.

Summary of the four presentation styles
Tag clouds may perform most differently fro m the other three presentation styles. Because Tag clouds do not show full sentences, they lack the context needed to make accurate judgments about the value of the informat ion. On the other hand, Tag clouds have the best support for fast perception as they package the most important words or phrases efficiently in space, and visually highlight their relat ive importance. Aspect oriented sentiments are similar to Tag clouds in that they lack a prose structure. On the other hand, they are also similar to Paragraph summaries because they include text snippets that are reader friendly and provide the context missing in Tag clouds. Paragraph summaries are close to Aspect oriented sentiments because they may cover similar amounts of informat ion, as paragraphs often list pros and cons in a form that resembles aspect oriented sentiments. Ho wever, the prose structure in a well-written paragraph summary affords deep analysis and accurate assessment of the context. Group samples are similar to paragraph summaries in form. However, unlike paragraph summaries that are written anew, group samples are directly drawn fro m the original document collection, and retain the most amount of contextual information.

Research questions
We hypothesize that hu mans respond differently to different presentation styles of opinion summaries, and some styles would be more effective in terms of hu man acceptance.
In this two-phase study, we are interested in investigating the following research questions: 1. Will users prefer (or not prefer) a particu lar opin ion summarizat ion style in making judgments about product reviews?
2. What are the reasons that users may p refer (or not to prefer) a part icular opinion summarization style in making judgments about product reviews?

Phase I: crowd sourcing opinion summarization
As mentioned earlier, in order to study the effect of presentation styles alone, we want to ensure the consistently high quality of the summaries. We achieved the goal through leveraging the wisdom of the crowd. Essentially, we elicited four styles of the opinion summaries with the help of Amazon's Mechanical Turk.
Procedure. First, we collected the top 50 reviews for one model of the iRoomba cleaning robot from Amazon. We chose this collection of opinion text because of the relative novelty of the product and the ability for the general public to relate to it. Using a within-subject design, we recruited 46 turkers located in the USA to answer a survey we developed to gather information for generating the opinion summaries from the text collection. In the survey, turkers were first directed to the raw text of the 50 reviews, and asked to read them in full. Then, questions about the reviews were asked. These questions were directly mapped from the information need for the four presentation styles of summaries. All questions were mandatory and were individually validated to ensure the quality of the answers. On average, turkers spent 56.3 minutes on the survey, and each was paid 4 US dollars.
Generating opi nion summaries using Turkers. For Tag clouds, turkers were asked to list five short phrases to characterize the clean ing robot. They were also asked to estimate what percentage of the 50 reviews had opinions consistent with each phrase. The phrases turkers came up with were remarkably consistent and converged to 38 phrases (phrases with minor variat ions were grouped as one). All 38 phrases were used in the subsequent lab study. The average percentages of turkers' estimations were used in the subsequent lab study to determine the font size of the phrases. A total of 8 Tag clouds were drawn by hand, with each cloud containing 4 or 5 phrases.
Turkers were asked to list three important aspects of the product according to the reviews they read. For each aspect, they were asked to give an estimate of how many reviews mentioned the aspect, as well as the estimated percentage for positive, neutral and negative sentiment towards the aspect. The top 8 most -frequently listed aspects were used in the subsequent study. Again, the averages of the estimations were used in the display of the aspects.
Each turker was also asked to write a summary of all the rev iews so that "consumers who read your summary can make an informed decision on the product, without having to read the reviews themselves". Among the 46 summaries, the top 8 most readable summaries, as agreed by two judges, were used in the lab study.
We asked turkers to identify similar reviews and group them together. They were required to list 3 groups of similar reviews.
When two reviews appeared in the same group once, their similarity measure increase by one. This way, we were able to generate a similarity matrix among the 50 reviews.
Using the matrix as input, we used a hierarchical clustering algorithm to cluster the 50 reviews. Four clusters produced the optimal fit, and the four cluster prototypes were used in the lab study as group samples. In addition, for each cluster, the closest summary to the prototype was also selected, so that there were 8 group samples in total.

Phase II: comparing presentation styles of summary using card sorting technique
The goal of this phase was to co mpare the four presentation styles of opin ion su mmary in a consumer decision making context to respond to the two research questions.
Experi ment design. The co mparison of the opinion summaries was conducted as a lab card-sorting task. For each of the four presentation styles (experiment conditions), eight opinion summaries were prepared according to the procedure described in the previous section.
Each opin ion summary was put on a single image of 960 x 720 pixels. Figure 1, Figure 2, Figure 3 and Figure 4 show a sample display for each of the conditions. In total, 4*8=32 image items were placed in the preparation bin in a random order.
The lab experiment was a within-subject design. Each participant's task was to take all 32 image items and place each of them in one of the five category bins. These category bins were defined as "Not at all useful", "So mewhat useful", "Useful", "Very useful" and "Extremely useful". Participants were told to ignore the card order within each category box. Essentially, we asked participants to give a usefulness rating for each opin ion summary. We use such a card-sorting setting in order to record participants' thought process, as they were asked to th ink-aloud wh ile placing the cards.
Partici pants. Thirty-four participants were recruited fro m Un iversity at Albany, half of them were males. All o f the participants stated that they read online reviews reg ularly for making purchase decisions.
Procedure. Each participant was tested individually in a human computer interaction lab in a University campus in the USA. The subjects first filled out a consent form. Next, the subjects completed an entry questionnaire. The participants were then d irected to http://websort.net/ to do the card sorting. After they comp leted the exper iment, they were asked to answer several questions regarding the four presentation styles and their thoughts about the experiment. The whole experimental process was logged by Techsmith Morae 3 software.

Content analysis scheme
To address the earlier mentioned research question(What are the reasons that users may prefer (or not prefer) a part icular opinion su mmarization style in making jud gments about product reviews?) we emp loyed a qualitative content analysis using an open coding approach to analyze the exit interview data of the Phase II experiment. The content analysis began with a co mprehensive read -through and evaluation of all 34 part icipants' interview transcripts by each of the first three authors , the primary investigator and two doctoral students. In terms of the initial review and several discussions between the authors, a number of themes emerged fro m the interviews that pertained to the reasons of preference towards the presentation styles. Th emes included: Co mprehensiveness (Co mprehensiveness of Information), Time (Time required to read the summary), Organizat ion/categorizat ion (Organizat ion/categorization of the summary's content), Length/Amount (Length/amount of informat ion), Appearance (Appearance of summary content), and Ease of use (Ease of use of summarization style). A coding scheme was designed according to these themes and is shown in the Table 1. The unit of analysis for the open coding was the individual interview docu ment. Each of the 34 interview documents were independently coded by the three researchers and data collected in an Excel spreadsheet. The average pairwise percent agreement among the 3 coders is 0.81. Along with each code, snippets of supporting text were extracted from the interview data.

User perception of the presentation styles
As initially noted, the goal of this paper is to better understand the relationship between opinion summarization styles and user decision judgment and the underlying reasons. The results are aligned with this goal and are based on a qualitative content analysis of the participant exit interview data.
In the exit interview, the part icipants answered questions relevant to the four presentation styles, including, (1) most helpful/ least helpful, (2) easiest/most difficult, (3) fastest/slowest, (4) most informative/least informative, (5) sacrifice/keep the most, and (6) like/dislike. The six questions were the basis for the measurement of user perception of the four presentation styles. The results are displayed in Figure 5.
Participant responses also included their opinions on the usefulness of the summary in general and strategies they used to make the decision, as well as suggestions on system features they best liked, disliked, and thought could be added in the future.
Although "think-aloud" data and additional computer log data was collected using the Morae software, the current paper focuses exclusively on results fro m the exit interview data.
As can be seen from Figure 5, participants felt that paragraph summary was the most helpful, and the most informat ive presentation style while it was also the most difficult and the slowest one. Tag clouds were reported to be the easiest and the fas test to use, but accordingly they were the least informative and the least helpful style and the one disliked by most of the participants. On the other hand, though Aspect Oriented Sentiments didn't get the most votes, they were found to be generally helpful, easy to use, fast, and informat ive and the part icipants liked them the most. Group samples were relatively less helpful, more difficult, and slower.

Users' opinion on the presentation styles
We were interested in finding out the major reasons influencing users' preference of the various opinion summarizat ion styles. It appears that "Comprehensiveness", "Organization/Categorization" and "Ease of use" are equally important key reasons affecting users' preference o f the p resentation styles . The "Appearance", "Time", and "Length/Amount" were found to be less important reasons to the participants. Table 2 shows the distribution of reasons coded across participants. To generate the table, each unique code instance was counted for each participant.
We further investigated the distribution of the identified reasons across the four presentation styles. Table 3 shows the top 5 reasons for each style. Tag clouds received positive comments on "Appearance" and "Time" and a nearly equal amount of positive and negative comments on "ease of use." However, most of the participants agreed that they were negative regarding "co mprehensiveness." This finding can e xplain the above-mentioned finding that participants disliked tag clouds the most. As mentioned by participants, "They don't, they don't have details." All o f the top 5 reasons related to aspect oriented sentiments were positive and covered all of the main reason categories except "time." The participants liked them because they "contain negative and neutral and the positive opinions ," were "very convenient or easy to read," and "very clear, brief," among other reasons. This finding correlates with the findings in Figure 5 and also exp lains why the participants liked aspect oriented se ntiments the most. Most participants found the paragraph summary good in terms of

# Participants
"comprehensiveness" and some liked its organization and found the formatting was "…what's most normal for me." But on the other hand, the paragraph summary received negative comments regarding "time" and "length." The participants found it "too long to read" and they needed to "Spend time reading it." Co mpared with the previous three styles, the group sample received far fewer co mments. Some of the participants mentioned that it was good in terms o f "co mprehensiveness" and "organization," but, some d idn't like it because of the "length," "ease of use," and "organization." Typical user comments can be found in Table 3.
Unsurprisingly, in both paragraph summary and the group sample styles, comprehensiveness was the predominant reason in users' preference decision. Specifically, participants claimed that the paragraph summary is "getting a lot of information in a fairly simple package" and the group sample helps them "imagine what if I had that product." For tag clouds, though the participants liked them because they were "much faster" and "the font size was there for the words ," they agreed that "They don't, they don't give details" and were negative regarding "comprehensiveness." On the contrary, the paragraph summary was long and time-consuming, but most of the part icipants found it provided "a lot of information in a fairly simple package" and was positive in its "comprehensiveness." Overall, aspect oriented sentiments were the best among the four styles. They were brief and, at the same t ime, co mprehensive. The part icipants found them easy to use and liked their appearance and organization.  In those group samples, that I think was too large is the content was too large. E2UN E2U 5 because they tended to repeat the same information ORGN ORG 5 It was not categorized properly. And if it isn't organized properly, it's confusing.

Discussion
In this paper, we were interested in discovering users' preferences and the reasons affecting their preferences of the representation styles in an opinion summarization card-sorting task.
Results demonstrate that: (1) Aspect oriented sentiments are the most preferred presentation style; (2) co mprehensiveness, time, organization/categorizat ion, length/amount, appearance, and ease of use are the major reasons impacting users' preferences of a presentation style in making decisions in a product review task.
Our results supported the finding reported in [3] in that our participants disliked the tag clouds the most. As [3] pointed out, "tags are useful for grouping art icles into broad categories, but less effective in indicating the particular content of an article " In our study, the participants acknowledged that tag clouds were the easiest to use because "We can construe what the product was like in very short time,," and the fastest to use because "It's very fast; the tag clouds was much faster." But, in making the decision on the usefulness of the presentation styles, their first priority was the comprehensiveness of the information in the summary presentation. As mentioned by participants, they understood people liked the tag clouds because of the "font size" and "color," but, they disliked them because they don't "give details" and "drive my decision making." This finding raised an important issue here for the design of info rmat ion systems: Ho w can the user interface balance the need for co mprehensiveness of information and the need to provide key features enabling users to quickly grasp the desired information? On the other hand, [3] reported that, compared with humanassigned tags, automated tagging produces "more focused, topical clusters ." The tag clouds in our study were generated by the turkers. As a result, the co mprehensiveness of automatically generated tags might be our future research direction .
It was interesting to learn that participants liked the aspect oriented sentiments the mo st. Organization/ categorization is the most critical reason in users' decision ma king relevant to this presentation style. Most importantly, they liked them because they contain "negative and neutral, and the positive opinions," "color," "nu mber" and "percentages." Our results indicated that there may be a relationship between consumers' informat ion needs and their preference of an opinion summary presentation style. With regards to the opinion summary of a cleaning robot, it is within expectation that consumers may want to look for informat ion about system usability, performance, and reliability. This factor may contribute to the finding that participants prefer aspect oriented sentiments, not tag clouds.
It can also be noticed that there exist biases potentially introduced by manually generated summary presentations. Our summarizat ions were generated by human turkers, but, this generation could have been influenced by the instructions and contents distributed by the researchers.
Results of this study have practical imp licat ions for developers of text summarization. A few design considerations for imp roving usability and user experiences emerged based on participant responses and our observations. First, a deeper understanding of users' information behavior and their in formation needs in using info rmat ion systems supporting consumer decision-making is important. In this study, we made a step towards this understanding in using turkers to generate the summary reviews. Second, after having identified key features in the consumer decision -making system, a good design should well balance the nu mber of the features and the amount of information p rovided in the interface. Th ird, an appropriate and co mprehensive organization/categorizat ion scheme should be selected in terms of targeted user groups and task and design considerations. Many participants expressed their opinions about the importance of organization/categorization. We feel it should be given greater attention in the design process in the future experiments.

Conclusion
This paper reports a study comparing the effect iveness of four major styles of opinion summarization in a consumer purchase decision scenario. The study leverages the power of the crowd to bypass issues of text min ing quality in order to reach mo re meaningful conclusions.
Our next step is to design and implement an experimental system based on the findings of this study. Such an experimental system will provide customers with a better view in the system interface. Additionally, the experimental system will be compared with our baseline system in a user-centered lab experiment to test its effectiveness and efficiency. Our goal is to contribute to improving the user experience and usability of information systems that support consumer decision-making.
As a lab-based user-centered study, limitations exist. In this experiment, the generalizab ility of the findings was restricted by the limited types of tasks, the number of topics, and the sample pool. Additionally, the coding scheme we generated is a simple, init ial one. Deeper, more fine-tuned coding and analysis could be applied to the data in a subsequent analysis . Despite the limitations, the results of this type of research will have imp licat ions for the design of information systems that support consumer decision-making.