Yong Chen

Industrial Data Analytics for Diagnosis and Prognosis


Скачать книгу

11 introduces the concept of Gaussian processes as a nonparametric way for the modeling and analysis of multiple longitudinal signals. The application of the multi-output Gaussian process for failure prognosis will be presented as well. Chapter 12 introduces the method for failure prognosis combining the degradation signals and time-to-event data. The advanced joint prognosis model which integrates the survival regression model and the mixed effects regression model is presented.

      1.3 How to Use This Book

      This book is intended for students, engineers, and researchers who are interested in using modern statistical methods for variation modeling, diagnosis, and prediction in industrial systems.

      This book can be used as a textbook for a graduate level or advanced undergraduate level courses on industrial data analytics. The book is fairly self-contained, although background in basic probability and statistics such as the concept of random variable, probability distribution, moments, and basic knowledge in linear algebra such as matrix operations and matrix decomposition would be useful. The appendix at the end of the book provides a summary of the necessary concepts and results in linear space and matrix theory. The materials in Part II of the book are relatively independent. So the instructor could combine selected chapters in Part II with Part I as the basic materials for different courses. For example, topics in Part I can be used for an advanced undergraduate level course on introduction to industrial data analytics. The materials in Part I and some selected chapters in Part II (e.g., Chapters 7, 8, and 9) can be used in a master’s level statistical quality control course. Similarly, materials in Part I and selected later chapters in Part II (e.g., Chapters 10, 11, 12) can be used in a master’s level course with emphasis on prognosis and reliability applications. Finally, Part II alone can be used as the textbook for an advanced graduate level course on diagnosis and prognosis.

      One important feature of this book is that we provide detailed descriptions of software implementation for most of the methods and algorithms. We adopt the statistical programming language R in this book. R language is versatile and has a very large number of up-to-date packages implementing various statistical methods [R Core Team, 2020]. This feature makes this book fit well with the needs of practitioners in engineering fields to self study and implement the statistical modeling and analysis methods. All the R codes and data sets used in this book can be found at the book companion website.

      Bibliographic Notes

Part I Statistical Methods and Foundation for Industrial Data Analytics

      Before making a chess move, an experienced chess player first explores the positions of the pieces on the chess board for noticeable patterns such as opponent’s threats, special relationships between chess pieces, and the strengths and weaknesses of both sides, before digging into in-depth calculation of move sequences to find the optimal move. Similarly, a data scientist should also start with an exploration of the data set for noticeable patterns before conducting any in-depth analysis by building a sophisticated mathematical model or running a computationally intensive algorithm. Simple data exploration methods can help understand the basic data structure such as dimension and types of variables; discover initial patterns such as relationships among variables; identify missing values, outliers, and skewed distribution for the needs of data pre-processing and transformation. This chapter focuses on basic graphical and numerical methods for data description and exploration. We first look at a data set in the following example.

      Example 2.1 (auto_spec data) The data set in auto_spec.csv, which is from the UCI Machine Learning Repository [Dua and Graff, 2017], contains the specifications of a sample of cars. The following R codes can be used to read the data file and obtain information on basic characteristics and structure of the data set.

      # load data

      auto.spec.df <- read.csv ("auto_spec.csv", header = T)

      # show basic information of data set

      dim (auto.spec.df)

      names (auto.spec.df)

      head(auto.spec.df)

      From the R outputs, we see that this data set contains 205 observations on 23 variables including manufacturer, fuel type, body style, dimension, horsepower, miles per gallon, and other specifications of a car. In statistics and data mining literature, an observation is also called a record, a data point, a case, a sample, an entity, an instance, or a subject, etc. The variables associated with an observation are also called attributes, fields, characteristics, or features, etc. The summary() function shows the basic summary information of each variable such as the mean, median, and range of values. From the summary information, it is obvious that there are two types of variable. A variable such as fuel.type and body.style has a finite number of possible values, and there is no numerical relationship among the values. Such a variable is referred to as a categorical variable. On the other hand, a variable such as highway.mpg and horsepower has continuous numerical values, and is referred to as a numerical variable. Beyond the basic data summary, graphical methods can be used to show more patterns of both types of variables, as discussed in the following subsection.

      > dim(na.omit(auto.spec.df))

      [1] 197 23

      If a significant number of observations in a data set have missing values, an alternative to simply removing observations with missing values is imputation, which is a process of replacing missing values with substituted values. A simple method of imputation is to replace missing values with a mean or median of the