Re: IJHCS-D-07-00029
Title: Design Guidelines for Effective Recommender System Interfaces Based on a Usability Criteria Conceptual Model
Authors: A. Ant Ozok, PhD; Quyin Fan; Anthony F Norcio, PhD
Dear Dr. A. Ant Ozok,
Your paper has now been reviewed and comments are attached for your information.
In light of the recommendations received I am sorry to have to inform you that we are unable to accept your paper for publication. In spite of this outcome, I hope that the referee's comments are useful for you and that you will not hesitate to submit to IJHCS in the future.
Please note that IJHCS imposes a minimum interval of 9 months before significantly revised resubmissions of rejected papers can be considered.
Yours sincerely,
Kerstin Severinson Eklundh, PhD
Associate Editor
International Journal of Human-Computer Studies
Editorial Manager:
Journal homepage:
Comments:
Reviewer #1: In general:
This paper surveys existing e-commerce recommender systems to evaluate the usefullness of recommender systems and provide design guidelines.
The paper is very well written and easy to follow. The subject is important to research and the authors have done a good job of identifying key attributes that make such systems successful and some others that are less important.
I found the presentation of the results to be weak in its organization. The researchers basically just list their findings. The scope of their research is very broad...almost too broad and this may be the reason I felt the organization was weak. Things that might help:
1. They could have organized the recommendations as "primary" "secondary" "least important"
2. Perhaps they could mock up a few annotated designs for web pages based on their results
3. The findings could have been seperated into guidelines and findings
It also seems like the research could have focussed on certain product categories a little more. I think there are some significant differences in how one might recommend items. e.g. For an apparel site a person may want a shirt to match the pants and would therefore like to see the two items together and be able to swap in different ones with different colors etc instead of just the image of the recommended item which might be the case for a movie/music/book site. I think there are enough of these types of differences that focusing research efforts on specific product categories is important. This probably feeds into the comments regarding one-size-fits-all type recommendations. I think the researchers could have gathered some product category specific data to enhance the usefulness of their research. Ofcourse, this is identified as one of their future directions, but I think we could have gotten a glimpse of it in this paper.
Was any evaluation done on the impact of a review from a reviewer that was perceived as being more credible than another reviewer? e.g. "the review was useful by 100 out of 100 people" would be more respected than "the review was useful by 10 out of 100 people".
Some more specific comments:
On page 11 it is not clear what the statement "Flexibility of the recommender systems means....not be intrusive" means. How does flexibility relate to intrusiveness of an interface? Does this mean the user should be able to filter the amount of information they are presented to adjust the perceived intrusiveness? Why is that different from personalization?
On page 49 the questions should be preceeded by a Caption. It was hard to know what part of the research the questions were associated with. Why isn't Q2.2 folded into Q2.1 as (g) Other.
In table 2 Q1 what was the result for Amazon.com DVD? 9.9%. I think in Table 2 the numbers across the top represent (a), (b), (c), etc....but why use numbers instead of letters? Also, what do the "open" answers correspond to? Are these the "Other" option answers to those questions?
Page 21...recommenders is spelled incorrectly.
On page 28 how do recommender systems "increase the shopping choices"? The items are already in the store, the recommender system just highlights them for the shopper to promote the items. I do not agree with the conclusion based on the findings presented.
On page 33 there is talk of manipulating recommendations. How? Do people really do this? Amazon and Netflix seem to allow one to do this. Do people really believe it to be worth their effort? At what level?
Page 33....again I don't see how shopping choices are improved (or increased as concluded on p28). I think you can say that the systems help pinpoint relevant products.
Page 37...instead of using bulleted point consider using numbered points as in your other lists.
p37... seems like bullets 1 and 4 conflict with each other. the first one says shoppers want full control and the 4th one says they want partial control.
p37...what precise information would shoppers like to see? seems like bullets 2/3/5 are obvious and can be summarized in a sentence at the beginning of 5.1.
p37...bullet 6...I thought the results indicated that layout and content affected opinion. Is that not part of design? I am not clear on what this finding is saying.
There are many statements about shopping performance and about how recommender systems do not affect it. It is not clear to me what is meant by that. Does that mean they do not increase sales? They do not improve a customer's ability to find things? Length of shopping sessions are not decreased?
p38...bullet 10... they'd like to see descriptions, but this is secondary to the essential info. right?
p39..bullet 4....would be good to list some of the preferences.
p40...last paragraph of conclusions seems to be reiterating results of other research more than the findings in this research. I think the focus in your conclusion should be on your research.
Reviewer #2: This paper looks to discover customer preferences in the layout and information presented in recommender systems through a questionnaire. It also presents a model of important factors in recommender systems. However, the paper has several flaws. First, the model is only loosely justified based on prior work and much of it is not used in the paper. Second, rather than using the questionnaire to test the model, assumptions from the model are embedded in the questionnaire. Third, the analysis is shallow. And finally, the paper assumes that all recommender systems are essentially the same. To each in turn.
1) The model is only loosely justified by prior work. The review of prior seems haphazard. For example, the citations based on Cosley are not correct--the paper is not about star ratings, nor about algorithms; it's about the effect of ratings scales on behavior and the effect of showing predictions on people's ratings. A number of other papers about interface work in recommender systems are also ignored: Herlocker's paper about explanations in 2000 (related to transparency) and his 2004 TOIS paper about recommender system evaluation in general. The prior work is not integrated to create the higher level elements of the model such as transparency, sufficiency, etc. Then, the paper does not use those elements anyways. Making a better case for the high-level elements as arising from the prior work, then saying the paper focuses on the interface elements, would be one strategy for improving it. Eliminating the model entirely, and instead using the prior work to highlight
a number of important recommendation features would also be a reasonable approach.
2) Assumptions from the model are embedded in the questionnaire. Normally such a model would be derived from or verified by the data collected/presented in the paper. In this case, the data don't support or illuminate the model. Instead, one of the model's key assumptions is turned into a question: the model calls name, price, and image as primary information, the questionnaire asks whether people prefer name, price, image, or other info, and then the analysis claims that means people prefer name, price, and image.
3) The analysis is shallow. First, it's not well-connected back to the model. Second, there are no statistical tests. Thisis not necessarily bad for descriptive studies, but the paper claims that there are differences in the kind of information that people prefer -- and to do this, it should present evidence that the means are statistically different. Third, the paper presents a laundry list of findings without much in the way of order or organization, which makes them hard to use. Finally, some statements are not very well-founded, most notably "present a maximum of 3 recommendations". This is based on a question that asks people whether they want 1, 2, 3, 4, 5, 6, more recommendations, and because the majority says 3, the paper says 3 or less. But really, the majority of the responses are 3 or more. Claiming that relatively frequent online college student shoppers is adequeately general is also suspect. It might be useful to cite large web surveys to support the
idea that the survey population is representative.
4) All recommender systems are not the same. It seems very likely that people's preferences on how much information to present will depend on the product domain, their experience with the system, their shopping goals, etc. The paper should probably make the argument (or, if the survey instructions were explicit about this, even better) that people were thinking about their default shopping behavior, and that this behavior is so dominant in shopping that these conclusions are relatively universal.
Other notes: the grammar is a little shaky at times.
Reviewer #3: The topic for this paper is interesting, and as the authors point out, understudied in the literature on recommender systems. However, there are several major flaws with this paper that make it unsuitable for publishing in its current form.
I'll start with three broad areas of concern for this manuscript, and then move on to specific sections and feedback.
Concern one is conceptual in nature. The paper is organized around usability for e-commerce recommenders, but I'm not sure the e-commerce lens is useful or helpful here. There's not motivation of why e-commerce recommender usability would be different than that for other types of recommender systems. Adding the commerce perspective muddles the overall operationalization of the variables of interest, as well as causes the introduction to meander. I'm not sure that usability metrics differ across these domains, but to prove that the authors would either need evidence from previous work, or from their own direct testing in this inquiry.
Concern two is methodological. Classic usability studies are experimental in nature, ranging from cognitive walk-throughs to eye-tracking to task accomplishment set-ups. Very few usability studies use an unguided survey as a main method. In this case, I think some of the questions for the survey become almost tautological. For example, question III.2 asks "There should be descriptions of the recommended product in the recommendation." Is there any reason to suspect that people aren't going to agree with that? Also, the way the questions are framed only gets to user satisfaction, which is one of many dimensions of usability. Using surveys severely limits what can accurately be said about the usability of recommenders. Another weakness of surveys as a method here (especially given that no scales were developed) is that
Concern three is conceptual/methodological. The sites picked as representative for the respondents each have characteristics well outside of their recommender systems that could significantly affect how users perceive those systems. Unless you can compare two sites where the only difference is in the recommender system, then comparison becomes invalidated due to the severe statistical noise that would be introduced.
Some more specific concerns-
Intro:
My main concern is lack of focus in the introduction. The argument ranges from the importance of e-commerce to usability to survey methods. One of the consequences is that none of these areas are addressed with adequate depth to support later decisions and arguments. The authors often confound recommender systems and collaborative filtering systems, and do not address a larger body of work on the usability of recommenders (Cosley is a good piece, but there's also McNee, Lam and a few others that could be addressed). The review of survey literature, and I do like the comparison of recommender systems and surveys, ignores a whole body of work on usability in web surveys (Couper is the best place to start here). Still, all of that would just take revision, and it's methodological concerns that I think are more problematic.
Methods:
The rating of sites by experts should be reported with some justifications and statistics. What was the sampling frame here? Why just two experts? What does expertise mean in this context? What was the inter-rater reliability?
For the sample chosen to participate in the survey, what was the sampling frame? The authors go to some lengths to show that college students are a good target population for this study, but I actually do not find them to be a compelling group. Assuming they are heavy e-commerce shoppers (and data from Pew would have supported this), it could mean they are more tolerant of bad usability, which is a common finding amongst expert systems users in the usability literature.
Other methodological concerns are mentioned above.