Mark P. Kritzman

Prediction Revisited


Скачать книгу

Images

      Relevance is the centerpiece of our approach to prediction. The key concepts that give rise to relevance were introduced over the past three centuries, as illustrated in this timeline. In Chapter 8, we offer more detail about the people who made these groundbreaking discoveries.

image

      This book introduces a new approach to prediction, which requires a new vocabulary—not new words, but new interpretations of words that are commonly understood to have other meanings. Therefore, to facilitate a quicker understanding of what awaits you, we define some essential concepts as they are used throughout this book. And rather than follow the convention of presenting them alphabetically, we present them in a sequence that matches the progression of ideas as they unfold in the following pages.

       Observation: One element among many that are described by a common set of attributes, distributed across time or space, and which collectively provide guidance about an outcome that has yet to be revealed. Classical statistics often refers to an observation as a multivariate data point.

       Attribute: A recorded value that is used individually or alongside other attributes to describe an observation. In classical statistics, attributes are called independent variables.

       Outcome: A measurement of interest that is usually observed alongside other attributes, and which one wishes to predict. In classical statistics, outcomes are called dependent variables.

       Arithmetic average: A weighted summation of the values of attributes or outcomes that efficiently aggregates the information contained in a sample of observations. Depending on the context and the weights that are used, the result may be interpreted as a typical value or as a prediction of an unknown outcome.

       Spread: The pairwise distance between observations of an attribute, measured in units of surprise. We compute this distance as the average of half the squared difference in values across every pair of observations. In classical statistics, the same quantity is usually computed as the average of squared deviations of observations from their mean and is referred to as variance. However, the equivalent evaluation of pairwise spreads reveals why we must divide by N – 1 rather than N to obtain an unbiased estimate of a sample's variance; it is because the zero distance of an observation with itself (the diagonal in a matrix of pairs) conveys no information.

       Information theory: A unified mathematical theory of communication, created by Claude Shannon, which expresses messages as sequences of 0s and 1s and, based on the inverse relationship of information and probability, prescribes the optimal redundancy of symbols to manage the speed and accuracy of transmission.

       Circumstance: A set of attribute values that collectively describes an observation.

       Informativeness: A measure of the information conveyed by the circumstances of an observation, based on the inverse relationship of information and probability. For an observation of a single attribute, it is equal to the observed distance from the average, squared. For an observation of two or more uncorrelated attributes, it is equal to the sum of each individual attribute's informativeness. For an observation of two or more correlated attributes—the most general case—it is given by the Mahalanobis distance of the observation from the average of the observations. Informativeness is a component of relevance. It does not depend on the units of measurement.

       Co-occurrence: The degree of alignment between two attributes for a single observation. It ranges between –1 and +1 and does not depend on the units of measurement.

       Correlation: The average co-occurrence of a pair of attributes across all observations, weighted by the informativeness of each observation. In classical statistics, it is known as the Pearson correlation coefficient.

       Covariance matrix: A symmetric square matrix of numbers that concisely summarizes the spreads of a set of attributes along with the signs and strengths of their correlation. Each element pertains to a pair of attributes and is equal to their correlation times their respective standard deviations (the square root of variance or spread).

       Mahalanobis distance: A standardized measure of distance or surprise for a single observation across many attributes, which incorporates all the information from the covariance matrix. The Mahalanobis distance of a set of attribute values (a circumstance) from the average of the attribute values measures the informativeness of that observation. Half of the negative of the Mahalanobis distance of one circumstance from another measures the similarity between them.

       Similarity: A measure of the closeness between one circumstance and another, based on their attributes. It is equal to the opposite (negative) of half the Mahalanobis distance between the two circumstances. Similarity is a component of relevance.

       Relevance: A measure of the importance of an observation to forming a prediction. Its components are the informativeness of past circumstances, the informativeness of current circumstances, and the similarity of past circumstances to current circumstances.

       Partial sample regression: A two-step prediction process in which one first identifies a subset of observations that are relevant to the prediction task and, second, forms the prediction as a relevance-weighted average of the historical outcomes in the subset. When the subset from the first step equals the full-sample, this procedure converges to classical linear regression.

       Asymmetry: A measure of the extent to which predictions differ when they are formed from a partial sample regression that includes the most relevant observations compared to one that includes the least relevant observations. It is computed as the average dissimilarity of the predictions from these two methods. Equivalently, it may be computed by comparing the respective fits of the most and least relevant subsets of observations to the cross-fit between them. The presence of asymmetry causes partial sample regression predictions to differ from those of classical linear regression. The minimum amount of asymmetry is zero, in which case the predictions from full-sample and partial-sample regression match.

       Fit: The average alignment between relevance and outcomes across all observation pairs for a single prediction. It is normalized by the spreads of relevance and outcomes, and while the alignment for one pair of observations may be positive or negative, their average always falls between zero and one. A large value indicates that observations that are similarly relevant have similar outcomes, in which case one should have more confidence in the prediction. A small value indicates that relevance does not line up with the outcomes, in which case one should view the prediction more cautiously.

       Bias: The artificial inflation of fit resulting from the inclusion of the alignment of each observation with itself. This bias is addressed by partitioning fit into two components—outlier influence, which is the fit of observations with themselves, and agreement, which is the fit of observations with their peers—and using agreement to give an unbiased measure of fit.

       Outlier influence: The fit of observations with themselves. It is always greater than zero, owing to the inherent bias of comparing observations with themselves, and it is larger to the extent that unusual circumstances coincide with unusual outcomes.

       Agreement: The fit of observations with their peers. It may be positive, negative, or zero, and is not systematically biased.

       Precision: The inverse of the extent to which the randomness of historical observations (often referred to as noise) introduces uncertainty to a prediction.

       Focus: The choice to form a prediction from a subset of relevant observations even though the smaller subset may be more sensitive to noise than the full sample of observations, because the consistency of the relevant subset improves confidence in the prediction more than noise undermines confidence.