Alan Dix

Statistics for HCI


Скачать книгу

will look at each in turn.

       Exploration –formative

      During the exploration stage of research or during formative evaluation of a product, you are interested in finding any interesting issue (Fig. 1.7). For research this is about something that you may then go on to study in depth and hope to publish papers about. In software development it is about finding usability problems to fix or identifying opportunities for improvements or enhancements. It does not matter whether you have found the most important issue, or the most debilitating bug, so long as you have found sufficient for the next cycle of development.

image

      Statistics are less important at this stage, but may help you establish priorities. If costs or time are short, you may need to decide which of the issues you have uncovered is most interesting to study further, or fix first. In practical usability, the challenge is not usually finding problems, nor even working out how to fix them; it is deciding which are worth fixing.

       Validation –summative evaluation

      In both validation in research and summative evaluation during development (Fig. 1.8), the focus is much more exhaustive: you want to find all problems and issues (though we hope that few remain during summative evaluation!).

      The answers you need are definitive. You are not so much interested in new directions (though that may be an accidental outcome); instead, you are verifying that your precise hypothesis is true, or that the system works as intended. For this you may need statistical tests, whether traditional (p-value) or Bayesian (odds ratio).

      You may also want figures: how good is it (e.g., “nine out of ten owners say their cats prefer …”), how prevalent is an issue (e.g., “95% of users successfully use the auto-grow feature”). For this the size of effects is important, so you may be more interested in confidence intervals, or pretty graphs with error bars on them.

      As we noted earlier, in practical software development there may not be an explicit summative step, but the decision will be based on the ongoing cycles of formative assessment. This is of course a statistical assessment, however informal; perhaps you just note that the number and severity of problems found has decreased with each iteration. It may also be pragmatic: you’ve run out of time and are simply delivering the best product you have. However, if there is any form of external client, or if the product is likely to be business critical, there should be some form of quality assessment. The decision about whether to use formal statistical methods, eyeballing of graphs and data, or simple expert assessment will depend on many factors including the pragmatics of liability and available time.

       Are five users enough?

      One of the most well-known (and misunderstood) myths of interaction design is the idea that five users are enough.a I lose count of the number of times I have been asked about this, let alone seen variants of it quoted as a justification for study sizes in published papers.

      The idea originated in a paper by Nielsen and Landaur [54], 25 years ago. However, that was crucially about formative evaluation during iterative evaluation. I emphasise, it was neither about summative evaluation, nor about sufficient numbers for statistics!

      Nielsen and Landaur combined a simple theoretical model based on software bug detection with empirical data from a small number of substantial software projects to establish the optimum number of users to test per iteration.

      Their notion of ‘optimum’ was based on cost—benefit analysis: each cycle of development costs a certain amount, each user test costs a certain amount. If you uncover too few user problems in each cycle you end up with many development cycles, which is expensive in terms of developer time. However, if you perform too many user tests you repeatedly find the same problems, thus wasting user-testing effort.

      The optimum value depends on the size and complexity of the project, with the number far higher for more complex projects, where redevelopment cycles are more costly; the figure of five was a rough average based on the projects studied at the time. Nowadays, with better tool support, redevelopment cycles are far less expensive than any of the projects in the original study, and there are arguments that the optimal value may now even be just testing one user [50]—especially if it is obvious that the issues uncovered are ones that appear likely to be common. This idea of one-by-one testing has been embedded in the RITE method (Rapid Iterative Testing and Evaluation), which in addition advocates having various stakeholders heavily involved in very rapid cycles of testing and fixing [52, 53].

      However, whether 1, 5, or 20 users, there will be more users on the next iteration—this is not about the total number of users tested during development. In particular, at later stages of development, when the most glaring problems have been fixed, it will become more important to ensure you have covered a sufficient range of the target user group.

      For more on this see Jakob Nielsen’s more recent and nuanced advice [55] and my own analyses of “Are five users enough?” [20].

image

       Explanation

      While validation establishes that a phenomenon occurs, what is true, explanation tries to work out why it happens and how it works (Fig. 1.9)—deep understanding.

      As noted, this will often involve more qualitative work on small samples of people. However, it is also often best connected with quantitative studies of large samples. For example, you might have a small number of rich in-depth interviews, but match the participants against the demographics of large-scale surveys. If, say, a particular pattern of response is evident in the large study and your in-depth interviewee has a similar response, it is often a reasonable assumption that their reasons will be similar to the large sample. Of course, they could just be saying the same thing for completely different reasons, but often common sense or prior knowledge means that the reliability is evident. If you are uncertain of the reliability of the explanation, that could always drive targeted questions in a further round of large-scale surveys.

      Similarly, if you have noticed a particular behaviour in logging data from a deployed experimental application, and a user has the same behaviour during a think aloud session or eyetracking session, then it is reasonable to assume that their vocal deliberations and cognitive or perceptual behaviours may be similar to those of the users of the deployed application.

      We noted that the parallel with software development was unclear; however, the last example starts to point toward a connection.

      During the development process, user testing often reveals many minor problems. It iterates toward a good-enough solution, but rarely makes large-scale changes. Furthermore, at worst, the changes you perform at each cycle may create new problems. This is a common problem with software bugs where code becomes fragile, and with user interfaces, where each change in the interface creates further confusion, and may not even solve the problem that gave rise to it. After a while you may lose track of why each feature is there at all.

      Rich understanding of the underlying human processes—perceptual, cognitive, social—can both ensure that ‘bug fixes’ actually solve the problem, and allow more radical, but informed redesign that may make whole rafts of problems simply disappear.

      The rest of this book is divided into three parts.

      Wild and wide—concerning randomness and distributions. This part will help you get a ‘gut feel’ for random