I am structured: Cluster Me, Don't Just Rank me

Sihem Amer-Yahia

Résumé

A large number of online applications are built over high dimensional data. That is the case for shopping where products have several features (e.g., price and color), dating where personal pro?les are described using several dimensions (e.g., physical features and political views), and entertainment (e.g., movie genre and director, restaurant ambiance and location). In addition, in some applications, items may be accompanied with qualitative data such as movie and restaurant reviews. The typical way users ?nd items in those applications is by entering a keyword query and receiving a ranked list of relevant results. Ideally, just like in Web search, users would want to spend little time before ?nding a satisfactory item. In practice, due the query output size, the high dimensionality of items, and in some cases, the presence of qualitative data, users tend to spend a lot of time trying to understand correlations between item features and item quality. In this talk, I will argue that the 10-blue links experience we are used to in Web search, keywords as input - ranked list as output, is inappropriate when querying and ranking high dimensional data. I will describe two applications: exploring qualitative data and ranked querying of structured data. Exploring qualitative data is a common activity on collaborative rating sites such as IMDb, CNet and Yelp. The amount of information available on those sites is often daunting. For example, on Yelp, a not-so-popular restaurant Joes Shanghai received nearly a thousand ratings, and more popular restaurants routinely exceed that number. Similarly, the movie The Social Network received more than 42000 ratings on IMDb after being released for just two months! In practice, a user either spends a lot of time examining items and reviews before making an informed decision. Ranked querying of structured data is typical in applications such as online dating or real estate search. In online dating, a user looking for a partner between 20 and 40 years old, and who sorts the matches by income from higher to lower, will see a large number of matches in their late 30s who hold an MBA degree and work in the ?nancial indus try, before seeing any matches in different age groups and walks of life. Similarly, in online real estate, a user looking for 1- or 2-bedroom apartments sorted by price will see a large number of cheap 1-bedrooms in undesirable neighborhoods before seeing any apartment with di?erent features. Top results in ranked lists tend to be homogeneous, thereby hindering data exploration. In both applications, an alternative to ranking is to cluster results on their attributes and describe the clusters (e.g.,Woody Allen Comedies liked by Males over 35, cheap 2 bedrooms with 2 baths). However, not all clusters will be of interest to users given varying item quality and varying reviewers information. When exploring qualitative data, different users are interested in the opinion of different reviewerpopulations. When querying and ranking structured data,different item features correlate differently with item quality. I will discuss two approaches in this talk. Persona-driven search for which we have preliminary ideas in restaurant search, aims to improve the exploration of qualitative data. Rank-aware clustering, aims to unveil hidden correlations between item features and item quality. In that context, I will report our results of a large-scale user study and a performance evaluation over datasets from a leading dating site.

I am structured: Cluster Me, Don't Just Rank me

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager