Tormod Næs

Multiblock Data Fusion in Statistics and Machine Learning


Скачать книгу

       Table 5.4 Properties of methods for common and distinctcomponents. The matrix D indicates a diagonalmatrix with all positive elements on its diagonal.

       Table 6.1 Overview of methods. Legend: U=unsupervised,S=supervised, C=complex, HOM=homogeneous data,HET=heterogeneous data, SEQ=sequential, SIM=simultaneous, MOD=model-based, ALG= algorithm-based, C=common,CD=common/distinct, CLD=common/local/distinct, LS=least squares, ML=maximum likelihood, ED=eigendecomposition, MC=maximising correlations/covariances. Forabbreviations of the methods, see Section 1.11.

       Table 8.1 Overview of methods. Legend: U=unsupervised, S=supervised, C=complex, HOM=homogeneous data, HET=heterogeneousdata, SEQ=sequential, SIM=simultaneous, MOD=model-based,ALG= algorithm-based, C=common, CD=common/distinct,CLD=common/local/distinct, LS=least squares, ML=maximum likelihood, ED=eigendecomposition, MC=maximising correla-tions/covariances. The green colour indicates that this methodis discussed extensively in this chapter. The abbreviations forthe methods represent the different sections and follow thesame order. For abbreviations of the methods, see Section 1.11.

       Table 8.2 Tabulation of consumer characteristics. A selection of two consumer attributes/characteristics, gender, and lunch habitsis given. The numbers represent percentages in each of the categories for each of the segments (subgroups). The sumsin each column for each consumer characteristic variable isequal to 100. The lunch variable reflects the frequency of usewith 1 representing the highest frequency and 5 ‘no answer’.Source: (Helgesen et al., 1997). Repro-duced with permission from Elsevier.

       Table 8.3 Consumer liking of cheese. Design of the conjointexperiment based on six design factors. Source: (Almli et al., 2011). Reproduced with permission from Elsevier.

       Table 9.1 Overview of methods. Legend: U=unsupervised,S=supervised, C=complex, HOM=homogeneous data,HET=heterogeneous data, SEQ=sequential, SIM=simultaneous, MOD=model-based, ALG= algorithm-based, C=common, CD=common/distinct, CLD=common/local/distinct, LS=least squares, ML=maximum likelihood, ED=eigendecomposition, MC=maximising correlations/covariances. For abbreviations of the methods, see Section 1.11.

       Table 10.1 Overview of methods. Legend: U=unsupervised, S=supervised, C=complex, HOM=homogeneous data, HET=heterogeneous data, SEQ=sequential, SIM=simultaneous, MOD=model-based, ALG= algorithm-based, C=common, CD=common/distinct, CLD=common/local/distinct, LS=least squares, ML=maximum likelihood, ED=eigendecomposition, MC=maximising correlations/covariances. The abbreviations for the methods follow the same order as the sections. For abbreviations (or descriptions) of the methods, see Section 1.11.

       Table 10.2 Results of the single-block regression models. PCovR isPrincipal Covariates Regression, U-PLS is unfold-PLS,MCovR is multiway covariates regression. The 3,2,3 com-ponents for MCovR refer to the components for thethree modes of Tucker3. For more explanation, see text.

       Table 10.3 Results of the multiway multiblock models. MB-PLS ismultiblock PLS, MWMBCovR is multiway multiblockcovariates regression. For more explanation, see text.

       Table 11.1 R packages on CRAN having one or more multiblock methods.

       Table 11.2 MATLAB toolboxes and functionshaving one or more multiblock methods.

       Table 11.3 Python packages having one or more multiblock methods.

Part I Introductory Concepts and Theory

      1.1 Scope of the Book

      In many areas of the natural and life sciences, data sets are collected consisting of multiple blocks of data measured on the same or similar systems. Examples are abundant, e.g., in genomics it is becoming increasingly common to measure gene-expression, protein abundances and metabolite levels on the same biological system (Clish et al., 2004; Heijne et al., 2005; Kleemann et al., 2007; Curtis et al., 2012; Brink-Jensen et al., 2013; Franzosa et al., 2015). In sensory science, the interest is often in relations between the chemical and sensory properties of the samples involved as well as consumer liking of the same samples (Næs et al., 2010). In chemistry, sometimes different types of instruments are utilised to characterise different properties the same set of samples (de Juan and Tauler, 2006). In cohort studies, it is increasingly popular to perform the same type of measurements in different cohorts to confirm results and perform meta-analyses. In (bio-)chemical process industry, plant-wide measurements are available collected by several sensors in the plant (Lopes et al., 2002). Clinical trials are often supported by auxiliary measurements such as gene-expression and cytokines to characterise immune responses (Coccia et al., 2018). Challenge tests to establish the health status of individuals usually contain multiple types of data collected for the same individuals as a function of time (Wopereis et al., 2009; Pellis et al., 2012; Kardinaal et al., 2015). All these examples show that simple data sets are increasingly becoming less common.

      In Elaboration 1.1 we define the terms concerning data sets we will use throughout in this book. Sometimes, we will sidestep this to some extent to make connections between fields. At those places we will clarify exactly what we mean.

       ELABORATION 1.1

      Glossary of terms

      Elaboration 1.1 suggests a consistent vocabulary to be used in the book. However, the difference between variables and objects is not