Statistically significant usability testing

a large crowdIt was an intriguing question: “How do I find out about statistically significant usability testing?”. I’m sure it’s one that you’ve encountered, and maybe your reaction was the same as mine: “That’s the wrong question”. Then I realised that if someone asks a question, it probably means that they want to get an answer to it. So here we go.

Statistical significance

I’m assuming that we all know what we mean by ‘usability testing’, so let’s start with the idea of ‘statistical significance’. Suppose you have a target for your product – let’s make it a website – that a typical user can complete a typical task within 30 seconds – let’s make it finding the telephone number for Customer Services.

You run a usability test, and find that the mean task time is 35 seconds. Then you make some changes, run another usability test, and find that the mean task time is now 29 seconds. Result: much rejoicing. Or maybe not: was the improved result purely from the natural variability in the samples, or because of the changes that you made?

That’s where ‘statistical significance’ comes in. If you take two different random samples, measure something about them, compare the measurements, and find a difference then the statistical tests of significance will help you to work out whether the difference you found could have come from the natural variation you get from any random sample, or because the two samples were genuinely different for some underlying reason.

For example, suppose I wanted to know the typical number of years of experience of a usability professional. I could go to the next Usability Professionals’ Association meeting, ask the first ten people to arrive “how long have you been a usability professional”, and calculate the mean of their answers. Then, being meticulous, I might decide to go to an extra meeting and do the same thing. I’d probably get a difference, and I could use a significance check to find out whether that difference arose from chance.

Other statistical errors

Maybe you’re thinking: “hmm, I wouldn’t try to find out typical years of experience in that way”? I agree: my strategy is replete with other statistical errors. First of all, I’ve got a sampling error: the sample I chose (first ten people) isn’t random and it’s quite likely that the first ten people are somewhat different from the rest of the people in some way.

Secondly, I’ve got a non-response error: it’s quite possible that usability professionals who go to UPA meetings have a different experience profile to those who do not. What about people with young children? Very experienced professionals who are under the illusion that they know it all and don’t go to meetings anymore? Students who happened to all be having exams that week and couldn’t turn up?

It’s possible that all these factors might average out, but unlikely. Thirdly, I’ve almost certainly got a measurement error: some people at the meeting might not be usability professionals, others may have had breaks in their experience. What exactly did I want to find out, how will I use the data that I obtained, and so did I ask the right question?

Statistical errors in usability testing

I think that’s why I reacted a bit negatively to the original question. I’m sure all of us start planning our usability tests by thinking first of all about what exactly we want to find out, how we’ll use the findings, and making sure that we “ask the right questions” by careful design of our tasks and choice of what parts of the product to test. Statistically speaking, that’s all about avoiding measurement error and has nothing to do with statistical significance.

Do I hear you cry “but what about recruiting?” Definitely. We all know that we have to be thoughtful about who we recruit. That’s where sampling error and non-response error comes in. Whether you go for the cheap-and-cheerful ‘hey you’ method; whether you throw a lot of money at a complex, stratified sample; or whether you do something in-between.

No matter what you spend, it’s unlikely that everyone in the target audience has an equal chance of being selected (that’s sampling error). And it’s very likely that there are some systematic differences between the people who did have a chance of being selected and those that did not: for example, we often have to recruit for specific geographical locations. That’s non-response error. So whatever your recruitment strategy, it is absolutely crucial to think about how the users you get differ from the actual target population. They always will.

(Aside: thinking about how the respondents you get differ from the actual target population is a completely standard step in large-scale surveys done by any reputable market research organisation. They call it ‘rebalancing the sample’.)

We run the test and we find some problems. Did those problems arise from chance, in that our random selection of participants happened to bring us people who made lots of mistakes? Or did the problems arise because the product we tested was riddled with obvious usability defects? Or some mixture?

For many of us, it’s a no-brainer: our participants are indeed entirely representative of the target audience, and sadly the product is ghastly. We now have to exert our diplomatic skills on tactfully breaking the bad news. There’s no question of statistics here, it’s all about plain facts – and that’s why I reacted badly to the initial question. If that’s your experience too, you can skip the ‘statistical significance’ question and stop reading here.

Statistical significance can be useful

But as your product improves, you’ll start to get more complex results. Some people have problems, others don’t, and then when you look at solving the problem it’s actually quite subtle and involves a trade-off that might make the experience better for some but worse for others.

Now you’re into a place where statistics can really be helpful, allowing you to punch a few numbers into an appropriate calculator or spreadsheet and bingo, get some more numbers to use in the decision-making process.

OK, how do I do that?

And now we’re back to the original question: “How do I find out about statistically significant usability testing?” This used to be quite hard to answer, but no longer. Tom Tullis and Bill Albert have obliged with their book Measuring the user experience: Collecting, analyzing and presenting usability metrics.

I admit the title doesn’t sound all that enticing and I also admit that it’s got lots of rather nerve-wracking entries in the table of contents such as ‘Measures of Central Tendency’. Please don’t let that put you off. Tackle it gently, a chapter at a time. You’ll find that they ease you through with lots of practical advice, even to the point of explaining exactly what to do in Excel.

They have a companion website, Measuring UX, which has presentations and spreadsheets. I admit I was tempted to avoid paying for the book and instead just try to work directly from the website, but that’s a mistake: the book demystifies the spreadsheets and puts them into a context which is definitely worth the price.

Finally, there’s Jeff Sauro’s site Measuring Usability. This dives straight into the tough stuff, such as calculating a Z-Score.

This article first appeared in Usability News, 1 June 2009

Picture by Zoi Koraki, creative commons