As we have noted, standard deviation is often abbreviated to SD in the medical literature. Sometimes for emphasis we will denote it by SD(x), where the bracketed term x is included for a reason to be introduced later.
Means or Medians?
Means and medians convey different impressions of the location of data, and one cannot give a prescription as to which is preferable; often both give useful information. If the distribution is symmetric, then in general the mean is the better summary statistic, and if it is skewed then the median is less influenced by the tails. If the data are skewed, then the median will reflect a ‘typical’ individual better. For example, if in a country median income is £20 000 and mean income is £24 000, most people will relate better to the former number.
It is sometimes stated, incorrectly, that the mean cannot be used with binary, or ordered categorical data but, as we have noted before, if binary data are scored 0/1 then the mean is simply the proportion of 1s. If the data are ordered categorical, then again the data can be scored, say 1, 2, 3, etc. and a mean calculated. This can often give more useful information than a median for such data, but should be used with care, because of the implicit assumption that the change from score 1 to 2, say, has the same meaning (value) as the change from score 2 to 3, and so on.
2.5 Displaying Continuous Data
A picture is worth a thousand words, or numbers, and there is no better way of getting a ‘feel’ for the data than to display them in a figure or graph. The general principle should be to convey as much information as possible in the figure, with the constraint that the reader is not overwhelmed by too much detail.
Dot Plots
The simplest method of conveying as much information as possible is to show all of the data and this can be conveniently carried out using a dot plot. It is also useful for showing the distributions in two or more groups side by side.
Example – Dot Plot – Baseline Corn Size
The data on corn size and treatment group (corn plaster or scalpel) are shown in Figure 2.5 as a dot plot. This method of presentation retains the individual subject values and clearly demonstrates any similarities or differences between the groups in a readily appreciated manner. An additional advantage is that any outliers will be detected by such a plot. However, such presentation is not usually practical with large numbers of subjects in each group because the dots will obscure the details of the distribution. Figure 2.5 shows that the two randomised groups had similar distributions of corn sizes at baseline.
Figure 2.5 Dot plot showing corn size (in mm) by randomised treatment group for 200 patients with corns.
(Source: data from Farndon et al. 2013).
Histograms
The patterns may be revealed in large data set of a numerically continuous variable by forming a histogram with them. This is constructed by first dividing up the range of variable into several non‐overlapping and equal intervals, classes, or bins, then counting the number of observations in each. A histogram for all the baseline corn sizes in the Farndon et al. (2013) trial data is shown in Figure 2.6. In this histogram the intervals corresponded to a width of 1 mm. The area of each histogram block is proportional to the number of subjects in the particular corn size category concentration group. Thus, the total area in the histogram blocks represents the total number of patients. Relative frequency histograms allow comparison between histograms made up of different numbers of observations which may be useful when studies are compared.
Figure 2.6 Histogram of baseline index corn size (in mm) for 200 patients with corns.
(Source: data from Farndon et al. 2013).
The choice of the number and width of intervals or bins is important. Too few intervals and much important information may be smoothed out; too many intervals and the underlying shape will be obscured by a mass of confusing detail. As a rule of thumb, it is usual to choose between 5 and 15 intervals, but the correct choice will be based partly on a subjective impression of the resulting histogram. In the corn plaster trial the baseline corn size was measured in integers to the nearest mm. In Figure 2.6 we have 10 intervals or bins of width 1 mm which fits our rule of thumb. In this example an interval of 1–1.99 mm covers bin 1, 2–2.99 mm covers bin 2, etc. Histograms with bins of unequal interval length can be constructed but they are usually best avoided.
Box and Whisker Plot
A box and whisker plot contains five pieces of summary information about the data: the median; upper quartile; lower quartile; maximum and minimum values. If the number of points is large, a dot‐plot can be replaced by a box and whisker plot and which is more compact than the corresponding histogram.
Illustrative Example – Box and Whisker Plot – Birthweight by Type of Delivery
A box and whisker plot is illustrated in Figure 2.7 for the corn size and treatment group from Farndon et al. (2013). The ‘whiskers’ in the diagram indicate the minimum and maximum values of the variable under consideration. The median value is indicated by the central horizontal line whilst the lower and upper quartiles by the corresponding horizontal ends of the box. The shaded box itself represents the interquartile range. The box and whisker plot as used here therefore displays the median and two measures of spread, namely the range and interquartile range. In Figure 2.7, for the scalpel group the median and lower quartile for the baseline corn size coincide and is 3 mm.
Figure 2.7 Box and whisker plot of size of corn at baseline (in mm) by randomised group for 200 patients with corns.
(Source: data from Farndon et al. 2013).
Scatter Plots
When one wishes to illustrate a relationship between two continuous variables, a scatter plot of one against