David Sánchez

Database Anonymization


Скачать книгу

in X. Masking induces a relation between the records in Y and the original records in X. When applied to quasi-identifier attributes, the identity behind each record is masked (which yields anonymity). When applied to confidential attributes, the values of the confidential data are masked (which yields confidentiality, even if the subject to whom the record corresponds might still be re-identifiable). Masking methods can in turn be divided in two categories depending on their effect on the original data.

      – Perturbative masking. The microdata set is distorted before publication. The perturbation method used should be such that the statistics computed on the perturbed data set do not differ significantly from the statistics that would be obtained on the original data set. Noise addition, microaggregation, data/rank swapping, microdata rounding, resampling, and PRAM are examples of perturbative masking methods.

      – Non-perturbative masking. Non-perturbative methods do not alter data; rather, they produce partial suppressions or reductions of detail/coarsening in the original data set. Sampling, global recoding, top and bottom coding, and local suppression are examples of non-perturbative masking methods.

      – Fully synthetic [77], where every attribute value for every record has been synthesized. The population units (subjects) contained in Y are not the original population units in X but a new sample from the underlying population.

      • Synthetic data. The protected data set Y consists of randomly simulated records that do not directly derive from the records in X; the only connection between X and Y is that the latter preserves some statistics from the former (typically a model relating the attributes in X). The generation of a synthetic data set takes three steps [27, 77]: (i) a model for the population is proposed, (ii) the model is adjusted to the original data set X, and (iii) the synthetic data set Y is generated by drawing from the model. There are three types of synthetic data sets:

      – Partially synthetic [74], where only the data items (the attribute values) with high risk of disclosure are synthesized. The population units in Y are the same population units in X (in particular, X and Y have the same number of records).

      – Hybrid [19, 65], where the original data set is mixed with a fully synthetic data set.

      In a fully synthetic data set any dependency between X and Y must come from the model. In other words, X and Y are independent conditionally to the adjusted model. The disclosure risk in fully synthetic data sets is usually low, as we justify next. On the one side, the population units in Y are not the original population units in X. On the other side, the information about the original data X conveyed by Y is only the one incorporated by the model, which is usually limited to some statistical properties. In a partially synthetic data set, the disclosure risk is reduced by replacing the values in the original data set at a higher risk of disclosure with simulated values. The simulated values assigned to an individual should be representative but are not directly related to her. In hybrid data sets, the level of protection we get is the lowest; mixing original and synthetic records breaks the conditional independence between the original data and the synthetic data. The parameters of the mixture determine the amount of dependence.

      The evaluation of the utility of the protected data set must be based on the intended uses of the data. The closer the results obtained for these uses between the original and the protected data, the more utility is preserved. However, very often, microdata protection cannot be performed in a data use specific manner, due to the following reasons.

      • Potential data uses are very diverse and it may even be hard to identify them all at the moment of the data release.

      • Even if all the data uses could be identified, releasing several versions of the same original data set so that the i-th version has been optimized for the i-th data use may result in unexpected disclosure.

      Since data must often be protected with no specific use in mind, it is usually more appropriate to refer to information loss rather than to utility. Measures of information loss provide generic ways for the data protector to assess how much harm is being inflicted to the data by a particular data masking technique.

      Information loss measures for numerical data. Assume a microdata set X with n individuals (records) x1,…,xn and m continuous attributes x1,…,xm. Let Y be the protected microdata set. The following tools are useful to characterize the information contained in the data set:

      • Covariance matrices V (on X) and V′ (on Y).

      • Correlation matrices R and R′.

      • Correlation matrices RF and RF′ between the m attributes and the m factors PC1, PC2,…,PCp obtained through principal components analysis.

      • Communality between each of the m attributes and the first principal component PC1 (or other principal components PCi’s). Communality is the percent of each attribute that is explained by PC1 (or PCi). Let C be the vector of communalities for X, and C′ the corresponding vector for Y.

      • Factor score coefficient matrices F and F′. Matrix F contains the factors that should multiply each attribute in X to obtain its projection on each principal component. F′ is the corresponding matrix for Y.

      There does not seem to be a single quantitative measure which completely reflects the structural differences between X and Y. Therefore, in [25, 87] it was proposed to measure the information loss through the discrepancies between matrices X, V, R, RF, C, and F obtained on the original data and the corresponding X′, V′, R′, RF′, C′, and F′ obtained on the protected data set. In particular, discrepancy between correlations is related to the information loss for data uses such as regressions and cross-tabulations. Matrix discrepancy can be measured in at least three ways.

      • Mean square error. Sum of squared componentwise differences between pairs of matrices, divided by the number of cells in either matrix.

      • Mean absolute error. Sum of absolute componentwise differences between pairs of matrices, divided by the number of cells in either matrix.

      • Mean variation. Sum of absolute percent variation of components in the matrix computed on the protected data with respect to components in the matrix computed on the original data, divided by the number of cells in either matrix. This approach has the advantage of not being affected by scale changes of attributes.

      Information loss measures for categorical data. These have been usually based on direct comparison of categorical values, comparison of contingency tables, or on Shannon’s entropy [25]. More recently, the importance of the semantics underlying categorical data for data utility has been realized [60, 83]. As a result, semantically grounded information loss measures that exploits the formal semantics provided by structured knowledge sources (such as taxonomies or ontologies) have been proposed both to measure the practical utility and to guide the sanitization algorithms in terms of the preservation of data semantics [23, 57, 59].

      Bounded information loss measures. The information loss measures discussed above are unbounded, i.e., they do not take values in a predefined interval. On the other hand, as discussed below, disclosure risk measures are naturally bounded (the risk of disclosure is naturally bounded between 0 and 1). Defining bounded information loss measures may be convenient to enable the data protector to trade off information loss against disclosure risk. In [61], probabilistic information loss measures bounded between 0 and 1 are proposed for continuous data.

      Propensity scores: a global information loss measure for all types of data. In [105], an information loss measure U applicable to continuous and categorical microdata was proposed. It is computed as follows.