while this measured data is real, it is typically not what you wanted to know. Would the same user on a different day, under different conditions, have made the same errors? What about other users?
the population Another idea of ‘real’ is when there is a larger group of people you want to know about, say all the employees in your company, or all users of product A. This larger group is often referred to as the population. What would be the average (and variation in) error rate if all of them sat down and used the software you are testing? Or, as a more concrete kind of measurement, what is their average height? You might take a sample of 20 people and find their average height, but you are using this to make an estimate about your population as a whole.
the ideal However, while this idea of the actual population is very concrete, often the ‘real’ world you are interested in is slightly more nebulous. Consider the current users of product A. You are interested in the error rate not only if they try your new software today, but if they do so multiple times over a period—that is, a sort of ‘typical’ error rate when each uses the software.
Furthermore, it is not so much the actual set of current users (not that you don’t care about them), but rather the typical user, especially for a new piece of software where you have no current users yet. Similarly, when you toss a coin you have an idea of the behaviour of a fair coin, which is not simply the complete collection of every coin in circulation. Even when you have tossed the coin, you can still think about the different ways it could have fallen, somehow reasoning about all possible pasts and presents for an unrepeatable event.
the theoretical Finally, this hypothetical ‘real’ event may be represented mathematically as a theoretical distribution such as the Normal distribution (for heights) or Binomial distribution (for coin tosses).
In practice, you rarely need to voice these things explicitly, but occasionally you do need to think carefully about it. If you have done a series of consistent blood tests you may know something very important about a particular individual, but not patients in general. If you are analysing big data you may know something very precise about your current users, and how they behave given a particular social context and particular algorithms in your system, but not necessarily about potential users and how they may behave if your algorithms and environment change.
1.3.2 THERE AND BACK AGAIN
Once you have clarity about the ‘real’ world that you want to investigate, the job of statistics also becomes more clear. You have taken measurements, often of some sample of people and situations, and you want to use the measurements to understand the real world (Fig. 1.3).
For example, given a sample of heights of 20 randomly chosen people from your organisation, what can you infer about the heights of everyone? Given the error rates of 20 people on an artificial task in a lab, what can you tell about the behaviour of a typical user in their everyday situation? Given the complete past history of ten million users of a website, what does this tell us about their future behaviour or the behaviour of a new user to the site?
Figure 1.3: The job of statistics—moving from data about the real world back to knowledge about the real world.
1.3.3 NOISE AND RANDOMNESS
If all the measurements we had were deterministic, we would not need statistics. For example, an ultrasonic range finder sends a pulse of sound, measures how long it takes to return, then multiplies the time by the speed of sound, divides by two, and gives you a readout.
In the case of the sample of 20 people we can measure each of their heights relatively accurately, but maybe even this has some inaccuracy, so each measurement has some ‘noise.’ More critical is that they are randomly chosen from the far larger population of employees. In this and many similar situations, there is a degree of randomness in the measurements on which we base our decision making.
Just as with ‘real,’ ‘random’ is not so straightforward.
Some would argue that everything is pre-determined from its causes, with the possible exception of quantum mechanics, and even then only in some interpretations. However, in reality, when we toss a coin or roll a die, we treat these as probabilistic phenomena.
fundamentally random This is predominantly quantum-level processes such as the decay of radionuclides. These are used for some of the most critical random number generators.
complex processes When we toss a coin, the high speed of the spinning coin, coupled with the airflows around it as it falls, means that its path is so complex that it is effectively random. In the digital world, random number generators are often seeded by measuring a large number of system parameters, each in principle deterministic, but so complex and varied that they are effectively unpredictable.
past random events Imagine you have tossed a coin, and your colleague has taken a quick peek,2 but you have not yet looked at it. What is the probability it is a head? Instinctively, you would probably say “1 in 2.” Clearly, it is already completely determined, but in your state of knowledge it is still effectively random.
uncontrolled factors As you go round measuring the heights of the people, perhaps tiny air movements subtly affect your ultrasonic height measurement. Or if you subsequently ask the people to perform a website navigation task, perhaps some have better web skills than others, or better spatial ability. Sometimes we can measure such effects, but often we have to treat them as effectively random.
Note that most people would regard the first two of these as ‘really’ random, or we could call them ontologically random—random in their actual state of being. In contrast the latter two are epistemologically random—random in your state of knowledge. In practice, we often treat all these similarly.
A more important set of distinctions that are of practical use are as follows:
persistence In some cases the random effect is in some way persistent (such as the skill or height of the person), but in other cases it is different for every measurement (like the air movements). This is important as the former may be measurable themselves, or in some circumstances can be cancelled out.
probability With the coin or die, we have an idea of the relative likelihood of each outcome, that is we can assign probabilities, such as 1/6 for the die rolling a ‘3’. However, some things are fundamentally unknown, such as trillionth digit of π; all we know is that it is one of the ten digits 0–9.
uniformity For the probabilistic phenomena, some are uniform: the chances of heads and tail are pretty much equal, as are the chances of landing on each of the six faces of a die. However, others are spread unevenly, such as the level of skill or height of a random employee. For the latter, we often need to be able to know or measure the shape of this unevenness (its distribution).
In order for statistics to be useful, the phenomena we deal with need to have some probability attached to them, but this does not need to be uniform, indeed probability distributions (see Chapter 4) capture precisely this non-uniformity. Philosophically, there are many ways we can think about these probabilities:
frequentist This is the most down-to-earth interpretation. When you say the chance of a coin landing heads is 50:50, you mean that if you keep on tossing the coin again and again and again, on average, after many many tosses, the ratio of heads to tails will be about 50:50. In the case of an unrepeatable phenomenon, such as the already tossed coin, this can be interpreted as “if I reset the world and re-ran it lots of times,” though that, of course, is not quite so ‘down to earth.’
idealist Plato saw the actual events of the world as mere reflections of deeper ideals. The toss of the actual coin in some ways is ‘just’ an example of an ideal coin toss. Even if you toss a coin five times in a row, and it happens to come up