Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs. Читать онлайн. Hotlib. HOTLIB.NET

Multiblock Data Fusion in Statistics and Machine Learning

two in three-dimensional space.The blue and green surfaces represent the column-spaces and the redline indicated with X_12C represents the common component. Source:Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 2.13 Common and distinct components. The common componentis the same in both panels. For the distinct componentsthere are now two choices regarding orthogonality: (a) bothdistinct components orthogonal to the common component, (b) distinct components mutually orthogonal. Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 2.14 Common components in case of noise: (a) maximally correlated common components within column-spaces; (b) consensus component in neither of the columns-spaces. Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 2.15 Visualisation of a response vector, y, projected ontoa two-dimensional data space spanned by x₁ and x₂.

Figure 2.16 Fitted values versus residuals from a linear regression model.

Figure 2.17 Simple linear regression: ŷ = ax + b (see legend for description of elements). In addition, leverage is indi-cated below the regression plot, where leverage is at a minimum at ¯x and increases for lower and higher x-values.

Figure 2.18 Two-variable multiple linear regression with indicated residuals and leverage (contours below regression plane).

Figure 2.19 Two component PCA score plot of concatenated Raman data.Leverage for two components is indicated by the marker size.

Figure 2.20 Illustration of true versus predicted values from aregression model. The ideal line is indicated in dashed green.

Figure 2.21 Visualisation of bias variance trade-off as a function of model complex-ity. The observed MSE (in blue) is the sum of the bias² (red dashed),the variance (yellow dashed) and the irreducible error (purple dotted).

Figure 2.22 Learning curves showing how median R² and Q² from linear regression develops with the number of training samples for a simulated data set.

Figure 2.23 Visualisation of the process of splitting a data set into a set ofsegments (here chosen to be consecutive) and the sequentialhold-out of one segment (V_k) for validation of models. Alldata blocks X_m and the response Y are split along the sampledirection and corresponding segments removed simultaneously.

Figure 2.24 Cumulative explained variance for PCA of the concatenatedRaman data using naive cross-validation (only leavingout samples). R² is calibrated and Q² is cross-validated.

Figure 2.25 Null distribution and observed test statistic usedfor significance estimation with permutation testing.

Figure 3.1 Skeleton of a three-block data set with a shared sample mode.

Figure 3.2 Skeleton of a four-block data set with a shared sample mode.

Figure 3.3 Skeleton of a three-block data set with a shared variable mode.

Figure 3.4 Skeleton of a three-block L-shaped data setwith a shared variable or a shared sample mode.

Figure 3.5 Skeleton of a four-block U-shaped data set with a shared variable or ashared sample mode (a) and a four-block skeleton with a shared variableand a shared sample mode (b). This is a simplified version; it should be understood that all sample modes are shared as well as all variable modes.

Figure 3.6 Topology of a three-block data set with a shared sample mode and unsupervised analysis: (a) full topology and (b) simplified representation.

Figure 3.7 Topology of a three-block data set with ashared variable mode and unsupervised analysis.

Figure 3.8 Different arrangements of data sharing twomodes. Topology (a) and multiway array (b).

Figure 3.9 Unsupervised combination of a three-way and two-way array.

Figure 3.10 Supervised three-set problem sharing the sample mode.

Figure 3.11 Supervised L-shape problem. Block X₁ is a predic-tor for block X₂ and extra information regardingthe variables in block X₁ is available in block X₃.

Figure 3.12 Path model structure. Blocks are connected throughshared samples and a causal structure is assumed.

Figure 3.13 Idea of linking two data blocks with ashared sample mode. For explanation, see text.

Figure 3.14 Different linking structures: (a) identity link, (b) flexible link, (c) partial identity link: common (T_12C) and distinct (T_1D, T_2D) components.

Figure 3.15 Idea of linking two data blocks with shared variable mode.

Figure 3.16 Different linking structures for supervised analysis: (a) linking structure where components are used both for the X-blocks and the Y-block; (b) linking structure that only uses components for the X-blocks.

Figure 3.17 Treating common and distinct linking structures for supervised analysis: (a) Linking structure with no differentiation between common and distinct in the X-blocks (C is common, D₁, D₂ are distinct for X₁ and X₂, respectively; e X₁ and e X₂ represent the unsystematic parts of X₁ and X₂); (b) first X₁ is used and then the remainder of X₂ after removing common (predictive) part T₁ of X₁.

Figure 4.1 Explanation of the scale (a) and orientation (b) component of the SVD.The axes are two variables and the spread of the samples are visualised including their contours as ellipsoids. Hence, this is a representation ofthe row-spaces of the matrices. For more explanation, see text. Source: Smilde et al. (2015). Reproduced with permission of John Wiley and Sons.

Figure 4.2 Topology of interactions between genomics data sets. Source: Aben et al. (2018). Reproduced with permission of Oxford University Press.

Figure 4.3 The RV and partial RV coefficients for the genomics example.For explanation, see the main text. Source: Aben et al. (2018). Reproduced with permission of Oxford University Press.

Figure 4.4 Decision tree for selecting a matrix correlation method.Abbreviations: HOM is homogeneous data, HET is heterogeneousdata, Gen-RV is generalised RV, Full means full correlations,Partial means partial correlations. For more explanation, see text.

Figure 5.1 Unsupervised analysis as discussed in this chapter, (a) links between samples and (b) links betweenvariables (simplified representations, see Chapter 3).

Figure 5.2 Illustration explaining the idea of exploring multiblock data. Source:Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 5.3 The idea of common (C), local (L) and distinct (D) parts of three datablocks. The symbols X^t denote row spaces; X^t_13L, e.g., is the part of X^t₁ and X^t₃ which is in common but does not share a partwith X^t₂.

Figure 5.4 Proportion of explained variances (variances accounted for)for the TIV Block (upper part); the LAIV block (mid-dle part) and the concatenated blocks (lower part). Source:Van Deun et al. (2013). Reproduced with permission of Elsevier.

Figure 5.5 Row-spaces visualised. The true row space (blue) contains thepure spectra (blue arrows). The row-space of X is the green plane which contains the estimated spectra (green arrows). The redarrows are off the row-space and closer to the true pure spectra.

Figure 5.6 Difference between weights and correlation

Скачать книгу