Tormod Næs

Multiblock Data Fusion in Statistics and Machine Learning


Скачать книгу

blocks can be placed in different relationships. The arrangement of blocks as dependent or independent may be a purpose of the analysis. We call such an arrangement a topology. In that case, it would be useful to have a strategy for deciding on the topology that fits the data best.Common versus distinct variation:There can be common and distinct variation in the multiple data blocks (see Section 1.8). This separation into types of variation greatly simplifies subsequent interpretation of the results.Treatment effects:The effect of a treatment can be measured in different blocks of data. The interest is usually what the main effect of a treatment is on measurements in the different blocks of data.Individual differences:Apart from group differences, also individual differences are useful. This can be for personalised medicine or nutritional interventions or consumer behaviour. Multiblock data analysis may help to find such differences and thereby facilitate population stratification and sub-typing.Mixed goals:In real-life applications, a mixture of goals is usually present. It may be that a treatment has been given which expresses itself differently in the common and distinct variation. Moreover, interest may be in the main effects of treatments but also on individual treatment effect differences.

      1.6 Some History

      Figure 1.8 Phylogeny of some multiblock methods and relations to basic data analysis methods used in this book.

      1.7 Fundamental Choices

      In any sort of multiblock data analysis, choices have to be made such as which method to use and what kind of pre-processing to apply. Two fundamental questions which always should be considered (and dealt with) are highlighted below.

      Variation explained:Do we only want to explain variation between blocks or also within blocks?Fairness:Should all blocks play a role in the final solution or can we allow some of the blocks to be dominant in this respect?

      1.8 Common and Distinct Components

      Figure 1.9 The idea of common and distinct components. Legend: blue is common variation; dark yellow and dark red are distinct variation and shaded areas are noise (unsystematic variation).

      Suppose there are two data blocks X1 and X2 sharing the same samples, i.e., different variables are measured on the same set of samples (see Chapter 3). Then these two blocks can have variation in common (the blue part). This common variation spans a subspace and the common components are then a basis for this subspace.

      There is also a part in each block that contains still systematic variation (the dark yellow and dark red parts). These have nothing in common and are, therefore, called distinct parts. These also represent subspaces and the distinct components (two sets; one set for each block) are the bases for these subspaces. What is left in the matrices is unsystematic variation or noise (shaded parts).

      The division of each data block in common, distinct, and unsystematic variation should not be read in terms of the individual variables being in common or being distinct but in terms of subspaces. Hence, a part of the variation of a variable in block 1 may be in common with variation of some variables in block 2 whereas the other part of that variable may be distinct, see Elaboration 1.8.

      ELABORATION 1.8

      Common and distinct in spectroscopy

      Suppose that the same set of samples is measured in the UV-Vis regime (block X1) and with near-infrared (NIR, block X2). Also assume that this set of samples contains three chemical components (A,B,C): A absorbs both in UV-Vis and NIR; B only absorbs in the UV-Vis regime and C absorbs only in NIR. Then the common part is the absorption of A in both data blocks; the distinct parts are B in block 1 and C in block 2. However, at a particular wavelength in the NIR region there may be a contribution from both A and C. Hence, this wavelength, i.e., variable, has a common and a distinct part. The same can happen in block 1.

      1.9 Overview and Links

      1 A method for unsupervised (U), supervised (S) or complex (C) data structures.

      2 The method can deal with heterogeneous data (HET, i.e., different measurement scales) or can only deal with homogeneous data (HOM).

      3 A method that uses a sequential (SEQ) or simultaneous (SIM) approach.

      4 The method is defined in terms of a model (MOD) or in terms of an algorithm (ALG).

      5 A method for finding common (C); common and distinct (CD); or finding common, local and distinct components (CLD).

      6 Estimation of the model parameters is based on least squares (LS), maximum likelihood