Chris Jones

End-to-end Data Analytics for Product Development


Скачать книгу

8.3 − 8.05 = 0.25 (0.25)2 = 0.0625 8.1 − 8.05 = 0.05 (0.05)2 = 0.0025 8.2 − 8.05 = 0.15 (0.15)2 = 0.0225 7.6 − 8.05 = −0.45 (−0.45)2 = 0.2025

equation

      The variance measures how spread out the data are around their mean. The greater the variance, the greater the spread in the data.

      The variance is not in the same units as the data, but in squared units. If the data are in grams, the variance is expressed in squared grams, and so on. Thus, for descriptive purposes, its square root, called standard deviation, is used instead.

      The standard deviation (usually denoted by S) quantifies variability in the same units of measurement as we measure our data.

      Considering the previous example, the standard deviation is:

equation

      The greater the standard deviation, the greater the spread of data values around the mean.

      Considering the mean and the standard deviation together and computing the range: mean ± S, we can say that data values vary on average from (mean − S) to (mean + S).

      From the previous example the average range is:

equation

      The observed data vary on average from 7.5 to 8.6.

      Stat Tool 1.10 Measures of Variability: Coefficient of Variation Icon01

      Another measure of variability for numeric data is the coefficient of variation.

      It is calculated as follows:

equation

      Being a dimensionless quantity, the coefficient of variation is a useful statistic for comparing the spread among several datasets, even if the means are different from one another (a), or data have different units (b), or data refer to different variables (c).

image

      The higher the coefficient of variation, the higher the variability.

      Stat Tool 1.11 Boxplots Icon01

      So far, we have looked at three different aspects of numerical data analysis: shape of the data, central and non‐central tendency, and variability.

      Boxplots can be used to assess and compare these three aspects of quantitative data distributions, and to look for outliers.

      Like histograms, boxplots work best with moderate to large sample sizes (at least 20 values).

      Let's look at how a boxplot is constructed. It can be displayed horizontally or vertically:

      1 Start by drawing a horizontal or vertical axis in the units of the data values.

      2 Draw a box to encompass 50% of middle data values. The left edge of the box is the first quartile Q1. The right edge of the box is the third quartile Q3. The width of the box is the interquartile range, IQR. Draw a line inside the box to denote the median.

      3 Draw lines, called whiskers, on the left (to the minimum) and on the right (to the maximum) of the box to show the spread of the remaining data (25% of data points are below Q1 and 25% are above Q3). Several statistical softwares do not allow the whiskers to extend beyond one and a half times the interquartile range (1.5 × IQR). Any points outside of this range are outliers and are displayed individually by asterisks.

image

      Boxplots help to summarize:

      1 Central tendency. Look at the value of the median.

      2 Non‐central tendency. Look at the values of the first quartile Q1 and the third quartile Q3.

      3 Variability. Look at the length of the boxplot (range) and the width of the box (IQR).

      4 Shape of data. Look at the position of the line of the median in the box and the position of the box between the two whiskers. In a symmetric distribution, the median is in the middle of the box and the two whiskers have the same length. In a skewed distribution, the median is closer to Q1 (skewed to the right) or to Q3 (skewed to the left) and the two whiskers do not have the same length (Figure 1.10).

      Illustration of construction of a boxplot. It summarizes central and non-central tendencies. Graphical illustration of different distributions of histograms and boxplots: skewed to the right, fairly symmetric, and skewed to the left.

       Figure 1.10 Histograms and boxplots.

      Stat Tool 1.12 Basic Concepts of Statistical Inference Icon01

      After describing important characteristics of sample data through descriptive statistics, the second step of a statistical analysis is usually inferential analysis, where sample findings are generalized to the referring population.

      Inferential techniques use descriptive statistics such as:

sample mean “ images sample proportion “p” sample standard deviation “S”

      to draw conclusions about the corresponding unknown quantities of the population, called parameters:

population mean: “μ” population proportion “π” population standard deviation “σ”

      Note that it is standard to use Greek letters for certain parameters, such as μ to stand for a population mean, σ for a population standard deviation, σ2 for a population variance, and π for a proportion of statistical units having a characteristic of interest.

      A statistic (mean, proportion, variance) describes a characteristic of the sample (central tendency, variability, shape of data) and is known.

      A parameter (mean, proportion, variance) describes a characteristic