Experimental Design and Statistical Analysis for Pharmacology and the Biomedical Sciences
two pH metres.
There is an argument here that these two sets of samples simply reflect a single measurement in triplicate, similar to that described in the first example above (see Figure 3.2). However, this situation is slightly different since here we are interested in the performance of the two pH metres. Consequently, it is perfectly reasonable to treat these measurements as independent and so n = 3 in for both pH metre for each solution, whereupon the average pH (with the spread of data described by the standard deviation) for each set of data may be calculated. This may appear inconsistent with the earlier example, but to understand why this is permissible, then you need to understand the questions being asked in both situations. In the first, we were interested in the effect of the drug or serum on the population growth of the cells and not our consistency in measuring the number of cells. In this latter situation, however, it is the relative performance of the equipment being used, in terms of accuracy, consistency, precision, and variability, that is being investigated.
The take‐home message here is that as with everything in data analysis, the question being asked is of paramount importance and is a direct consequence of the aim of the experiment.
Independent and paired data sets
I have used the term ‘independent’ on a few occasions so far, and it is a term that is often used in statistics (independent observations, independent groups, independent data), and so it is important that we understand the true meaning of this term.
Consider the situation where you are interested in the heights of female students compared with those of age‐matched male students (see Chapter 5, Tables 5.1 and 5.2 for example data). There is a clear distinction between these groups based on their gender, and so the groups of female and male students are clearly independent from each other as they contain different participants or subjects and consequently produce independent data sets. Similarly, if we compared the heights of male students when they started secondary school to a different group of students at university, then these groups would also be independent since they clearly contain different participants.
In contrast, a different situation arises where a group of male students had their heights measured both on starting secondary school and again when they started university. In this situation the heights of the same participants are examined but at different time points; in such circumstances the resulting data is said to be paired.
This distinction between independent and paired groups is important since, as we shall see later, it informs our decision on the inferential statistical tests subsequently used to analysis data.
Of course, some data measurements include both independent and paired data sets. Consider the situation where the heights of female and male students are measured when they start secondary school and again when they are at university. Such data will include independent groups (based on their gender) and paired data (based on the time at which the heights were measured).
In addition to these terms, statisticians also use the terms ‘Between’ and ‘Within’ to describe different variables and factors in an experiment. The term ‘Between group variable’ refers to clear differences between independent groups used in an experiment; in the examples above then gender is a Between‐Group Variable. In contrast, the term ‘Within group variable’ refers to experimental changes experienced by the same group of participants or subjects during the course of an experiment and always occurs as a function of time; in the examples above, then the time at which the height measurements were obtained is a Within‐Group Variable.
4 Data collection: sampling and populations, different types of data, data distributions
Sampling and populations
Whenever we perform an experiment, we obtain measurements according to our observations, and as described in earlier chapters our aim is to collect data in a precise and accurate manner. Statistics may be defined as the science of collecting, summarising, presenting, and interpreting such data. A collection of data on their own is not information, but a valid summary and description of that data set derive information by putting the data into context. Statistics therefore involves summarising a collection of data in a clear and understandable way such that our reader or audience may see clearly the similarity or differences between the groups in our experiment.
One very important issue we need to accept in statistics is that in almost all cases, we only work with samples taken from whole populations of subjects. Consequently, we are almost always faced with the situation in which we estimate population parameters from the samples we have obtained.
For example, suppose we wanted to determine the height of students at your university or institution. We would not be able to determine the height of every single student, so we would choose a ‘representative sample’ of students, measure their height, and then estimate the average height from these values.
A statistical population, therefore, is the set of all possible values (our observations/measurements) that could possibly be measured.
A sample is the subset of the population for which we have a limited number of observations drawn at random from the population that will be used to describe the parent population (see Figure 4.1).
Figure 4.1 A data sample is a random set of values drawn from the parent population.
By necessity this process involves the tacit assumption that the sample group is truly representative of the parent population. Furthermore, if we take a large number of samples from the population and divide those randomly into subgroups of equal size, then each subgroup should truly represent the parent population. Furthermore, if the last statement is true, then each subgroup should be equal to each other. In reality, of course, there will be some differences not only between the subgroups but also to the parent population, and it is determining the importance of those differences where we rely on statistical analysis.
The Central Limit Theorem
Luckily, however, the small differences that arise as a result of taking samples from a population are not a huge issue thanks to what is known as the Central Limit Theorem, which states that, given a large enough sample size, then the sampling distribution of the sample mean will approximate to a normal distribution regardless of the variable's distribution in the given population. I know I have not described or explained the nature of the normal distribution as yet (sorry!), but have a quick look at Figure 4.7 later in this chapter and compare the shape to the distributions of data sets shown in Figures 5.3 and 5.4 in Chapter 5; can you see the differences in shape?
So, what does this theorem mean? Well, for any set of observations we can easily produce a scatterplot of the magnitude of the observation on the x‐axis against the frequency