Designing comparative evaluations

A set of smiling identical twins — The identical twins problem: are you sure your users will notice the differences you want to test?

It was one of those calls that is simultaneously good news and bad news. “We’d like you to do an evaluation for us. We have two designs here and we want to know which one is better”.

The good news: well, I’m a consultant so phone calls offering work are always good, right?

The bad news: comparative evaluations. Ugh. So I thought I’d at least make use of the pain by writing a few notes on them here.

Is A better than B?

The first challenge of a comparative evaluation is that the client wants a nice clear answer: A is better than B. Or perhaps: B is better than A. The problem is that the actual answer is usually more complicated. Parts of A are better than parts of B. Parts of B are better than parts of A. Some bits of A are execrable. Some bits of B, usually but not always different ones, are also execrable. There’s probably an approach C that is better than A or B, and the final answer is probably D: a bit of C, plus some of the good points from A and from B. It’s not exactly a nice clean story, is it?

‘Between subjects’ or ‘within subjects’

I don’t like to use the term ‘subject’ for the participant in a test, because my view is that the system is the subject not the person. But we need to turn to the design of psychological experiments here where the subject of the experiment is the person. If you have two designs to test, are you going to get the same participants to test both designs (‘within subjects’) or are you going to do two rounds of testing: one group of participants gets A, and another gets B (‘between subjects’)?

It’s not easy to design a good ‘Within subjects’ test

The problem with ‘within subjects’ design is that nearly all systems have some learning effects:

If you ask the participants to try the same or similar tasks with both systems then they learn about the task with the first system and can’t unlearn that knowledge before they try the second system. I’ve known participants who had a hard time with the task on A so they were adamant that they preferred B even though it was downright horrible to do the task with B.
If they try different tasks with each system then are they really comparing like with like?

It’s not easy to design a good ‘Between subjects’ test

The problem with ‘between subjects’ design is that you can’t ask the participants which they preferred. And surely that is one of the main reasons why we’re doing an evaluation anyway, to establish preference? So we end up in the murky world of inferential statistics: trying to figure out what the population as a whole might prefer on the basis of the two samples from that population who tried these two interfaces.

Both types of comparative test require large samples

‘Within subjects’ tests require much larger sample sizes than we usually work with because we have to vary the order of presentation of systems so that the one group of participants get A then B and an equal group gets B then A.

‘Between subjects’ tests are even worse, because of those pesky inferential statistics. That means: random sampling and very much larger sample sizes than we normally use in usability testing.

Minimal or radical differences?

My third recurring problem with comparative evaluation is the ‘identical twins’ problem. The client knows these babies and sees all the subtle and, to them, important differences that they want to explore. The participants see them as identical twins: both products look pretty much the same.

For example, we were looking at three different versions of a form that is much hated by the general public: The client could see all sorts of really, really major differences between them. The participants just saw the form they loathed.

Some tips

If you do have to undertake a comparative evaluation, maybe these tips will help:

Prepare your client for a complicated answer that picks elements from the different approaches.
Be prepared to undertake far more tests. You’ll probably need at least three times the number of participants you usually work with rather than just twice the number.
Dust off your statistics books. You really do need to think about what assertions are supported by your sample size.
Try to make sure that the differences you are exploring really do seem like differences to your participants.

A version of this article first appeared in ‘Caroline’s Corner’, in the August 2004 edition of Usability News.

featured image by Ethermoon, creative commons

#usability