sophisticated procedures such as regression-based imputation do exist. These methods play important roles mainly in medical and scientific studies, where data collection from patients or subjects is often costly. In most industrial data analytics applications where data are typically abundant, simpler methods of handling missing values are usually sufficient.
2.1 Data Visualization
Data visualization is used to represent the data using graphical methods. It is one of the most effective and intuitive ways to explore the important patterns in the data such as data distribution, relationship among variables, surprising clusters, and outliers. Data visualization is a fast-growing area and a large number and variety of tools have been developed. This section discusses some of the most basic and useful types of graphical methods or data plots for industrial data analytics applications.
2.1.1 Distribution Plots for a Single Variable
Bar charts can be used to display the distribution of a categorical variable, while histograms and box plots are useful tools to display the distribution of a numerical variable.
Distribution of A Categorical Variable – Bar Chart
In a bar chart, the horizontal axis corresponds to all possible values/categories of a categorical variable. The vertical axis shows the number of observations in each category. To draw a bar chart for a categorical variable in R
, we need to first use the table()
function to count the number of observations in each category. Then the barplot()
function can be used to plot the calculated counts. For example, the following R
codes plot the distribution of the body.style
variable in the auto_spec
data set.
bodystyle.freq <- table(auto.spec.df$body.style)
barplot(bodystyle.freq, xlab = "Body Style",
ylim = c(0, 100))
The plotted bar chart is shown in Figure 2.1. From the bar chart, it is clear that most of the cars in the data are either sedans or hatchbacks.
Figure 2.1 Bar chart of car body style.
Distribution of Numerical Variables – Histogram and Box Plot
A histogram can be used to approximately represent the distribution of a numerical variable with continuous values. A histogram can be considered as a bar chart extended to continuous numerical variables. To draw a histogram, the entire range of the variable in the data set is divided into a number of consecutive equal sized intervals. Then a “bar” is shown for each interval to represent the number of observations in the interval.
Another commonly used plot that can represent distribution of a numerical variable is the box plot. We illustrate the basic elements of a box plot in Figure 2.2, which shows the box plot of the numerical variable width
of the auto_spec
data set. The bold line within the rectangle box represents the median value of the variable in the data set. The lower and upper bound of the box are corresponding to the first quartile (25th percentile) and the third quartile (75th percentile), respectively. The height of the box is the interquartile range (IQR), which is the distance between the first and the third quartile. The short horizontal lines above and below the box are called the whiskers, which represent the maximum and minimum of the values in the data set, excluding the “outliers”. In box plots, an outlier is typically defined as a data point that is either above the third quartile with a distance greater than 1.5 times of the IQR or below the first quartile with a distance greater than 1.5 times of IQR. The individual outliers are shown by the open circles in the box plot in Figure 2.2.
Figure 2.2 Elements of a box plot.
The R
functions hist()
and boxplot()
can be used to plot the histogram and box plot, respectively. The following R
codes plot, as shown in Figure 2.3, the histograms and box plots for three numerical variables, the length
, horsepower
, and compression.ratio
, in the auto_spec
data set.
Figure 2.3 Histograms and box plots of three numerical variables.
oldpar <- par(mfrow=c(2,3)) # split the plot into panels hist(auto.spec.df$length, xlab = "Length",
main = "Histogram of Length") hist(auto.spec.df$horsepower, xlab = "Horsepower", main = "Histogram of Horsepower") hist(auto.spec.df$compression.ratio, xlab = "Compression Ratio", main = "Histogram of Compression Ratio") boxplot(auto.spec.df$length, ylab = "Length", main = "Boxplot of Length") boxplot(auto.spec.df$horsepower, ylab = "Horsepower", main = "Boxplot of Horsepower") boxplot(auto.spec.df$compression.ratio, ylab = " Compression Ratio", main = "Boxplot of Compression Ratio") par(oldpar)
From the histogram and box plot of the variable length
, it can be seen that the distribution of the car lengths in the data set has a fairly symmetric shape. In contrast, the distribution of horsepower is more skewed with a long (right) tail. The histogram of the compression ratios shows the existence of two groups or clusters of data, which is also indicated by the separate cluster of outliers with high compression ratios that can be seen in the box plot.
2.1.2 Plots for Relationship Between Two Variables
The relationship between variables is one of the most useful patterns in industrial data analytics applications. For example, we are often interested in predicting a particular variable of interest, which is referred to as the response variable, based on available input information represented by a number of variables that are referred to as the predictor variables. In this situation, the relationship between the response variable and the predictor variables can help identify the most important predictors. Plotting of two variables can also be used to detect redundant variables and outliers in a data set. Depending on the types of variables being compared, different plots can be used to study the relationship between the variables.
Relationship Between Two Numerical Variables – Scatter Plot
In a scatter plot, each observation is represented by a point whose coordinates are the values for the two variables of this observation. The following R
codes draw the scatter plot for two numerical variables, horsepower
and highway.mpg
, of the auto_spec
data set.
plot(auto.spec.df$highway.mpg ~ auto.spec.df$horsepower,
xlab = "Horsepower", ylab = "Highway MPG")
The obtained scatter plot is shown in Figure 2.4. It can be seen from the scatter plot that a general trend exists in the relationship between the highway MPG and the horsepower, where a car with higher horsepower is more likely to have a lower highway MPG.
Figure 2.4 Scatter plot of highway MPG versus horsepower.
Relationship Between A Numerical Variable and A Categorical Variable – Side-by-Side Box Plot
Side-by-side box plots can be used to show how the distribution of a numerical variable changes over different