In a clear and thoughtful article in the
Popular myth – and Mayo’s (1933) original conclusion – says that human subjects in any kind of research will perform better just because they’re aware they’re being studied.
Several researchers have reviewed the original study that generated the finding, and they say that’s not what really happened. Parsons (1974) was the first to say that the improvement in performance of subjects in the original study was more likely due to feedback they got from the researchers about their performance and what they learned from getting that feedback.
Why it doesn’t apply to usability tests
Macefield convincingly demonstrates why the
studies were longitudinal, most usability tests are not. Hawthorne
- The subjects were experts, most participants are novices at something in a usability test because what they are using is new.
- The metrics used in the
studies were different from most usability tests. Hawthorne
- The subjects in the
studies had horrible, boring jobs, so they may have been motivated to perform better because of attention they got from researchers; it’s possible in usability tests that participants are experiencing unwanted interruptions by being included or that they’re just doing the test to get paid. Hawthorne
subjects may have thought that taking part in the study would improve their chances for raises or promotions; the days of usability test participants thinking that their participating in studies might help them get jobs are probably over. Hawthorne
What about feedback and learning effects?
We want feedback to be part of a good user interface, don’t we? Yes. And we want people to learn from using an interface, don’t we? Again, yes. But, as Macefield says, let’s make sure that all the feedback and learning from a usability test comes from the UI and not the researcher/moderator. Instead, get to the cause of problems from qualitative data such as the verbal protocol from participants’ thinking aloud to see how they’re thinking about the problem.
Look at effects across tasks or functions
Macefield suggests that if you’re getting grief, add a control group to compare against and then look at performance across tasks. For example, you might expect that the test group (using an “improved” UI) would be more efficient or effective in all elements of a test than a control group. But it’s possible that the test group did better on one task but both groups had a similar level of problems on a different task. If this happens, it is unlikely that the moderator has given feedback or prompted learning to create the effect of improved performance because the effect should be global across tasks across groups.
Macefield closes the article with a couple of pages that could be a lesson out of Defense Against the Dark Arts, setting out very specific ways to argue against any assertion that your findings might be “contaminated.” But don’t just zoom to the end of the piece. The value of the article is in knowing the whole story.