Merge the original microdata set X and the anonymized microdata set Y, and add to the merged data set a binary attribute T with value 1 for the anonymized records and 0 for the original records.
2. Regress T on the rest of attributes of the merged data set and call the adjusted attribute T̂. For categorical attributes, logistic regression can be used.
3. Let the propensity score p̂i of record i of the merged data set be the value of T̂ for record i. Then the utility of Y is high if the propensity scores of the anonymized and original records are similar (this means that, based on the regression model used, anonymized records cannot be distinguished from original records).
4. Hence, if the number of original and anonymized records is the same, say N, a utility measure is
The farther U from 0, the more information loss, and conversely.
2.7 TRADING OFF INFORMATION LOSS AND DISCLOSURE RISK
The goal of SDC to modify data so that sufficient protection is provided at minimum information loss suggests that a good anonymization method is one close to optimizing the trade-off between disclosure risk and information loss. Several approaches have been proposed to handle this tradeoff. Here we discuss SDC scores and R-U maps.
SDC scores
An SDC score is a formula that combines the effects of information loss and disclosure risk in a single figure. Having adopted an SDC score as a good trade-off measure, the goal is to optimize the score value. Following this idea, [25] proposed a score for method performance rating based on the average of information loss and disclosure risk measures. For each method M and parameterization P, the following score is computed:
where IL is an information loss measure, DR is a disclosure risk measure, and Y is the protected data set obtained after applying method M with parameterization P to an original data set X. In [25] IL and DR were computed using a weighted combination of several information loss and disclosure risk measures. With the resulting score, a ranking of masking methods (and their parametrizations) was obtained. Using a score permits regarding the selection of a masking method and its parameters as an optimization problem: a masking method can be applied to the original data file and then a post-masking optimization procedure can be applied to decrease the score obtained (that is, to reduce information loss and disclosure risk). On the negative side, no specific score weighting can do justice to all methods. Thus, when ranking methods, the values of all measures of information loss and disclosure risk should be supplied along with the overall score.
R-U maps
A tool which may be enlightening when trying to construct a score or, more generally, optimize the trade-off between information loss and disclosure risk is a graphical representation of pairs of measures (disclosure risk, information loss) or their equivalents (disclosure risk, data utility). Such maps are called R-U confidentiality maps [28]. Here, R stands for disclosure risk and U for data utility. In its most basic form, an R-U confidentiality map is the set of paired values (R, U) of disclosure risk and data utility that correspond to the various strategies for data release (e.g., variations on a parameter). Such (R, U) pairs are typically plotted in a two-dimensional graph, so that the user can easily grasp the influence of a particular method and/or parameter choice.
2.8 SUMMARY
This chapter has presented a broad overview of disclosure risk limitation. We have identified the privacy threats (identity and/or attribute disclosure), and we have introduced the main families of SDC methods (data masking via perturbative and non-perturbative methods, as well as synthetic data generation). Also, we have surveyed disclosure risk and information loss metrics and we have discussed how risk and information loss can be traded off in view of finding the best SDC method and parameterization.
CHAPTER 3
Anonymization Methods for Microdata
It was commented in Section 2.5 that the protected data set Y was generated either by masking the original data set X or by building it from scratch based on a model of the original data. Microdata masking techniques were further classified into perturbative masking (which distorts the original data and leads to the publication of non-truthful data) and non-perturbative masking (which reduces the amount of information, either by suppressing some of the data or byreducing the level of detail, but preserves truthfulness). This chapter classifies and reviews some well-known SDC techniques. These techniques are not only useful on their own but they also constitute the basis to enforce the privacy guarantees required by privacy models.
3.1 NON-PERTURBATIVE MASKING METHODS
Non-perturbative methods do not alter data; rather, they produce partial suppressions or reductions of detail in the original data set.
Sampling
Instead of publishing the original microdata file X, what is published is a sample S of the original set of records [104]. Sampling methods are suitable for categorical microdata [58], but for continuous microdata they should probably be combined with other masking methods. The reason is that sampling alone leaves a continuous attribute unperturbed for all records in S. Thus, if any continuous attribute is present in an external administrative public file, unique matches with the published sample are very likely: indeed, given a continuous attribute and two respondents xi and xj, it is unlikely that both respondents will take the same value for the continuous attribute unless xi = xj (this is true even if the continuous attribute has been truncated to represent it digitally). If, for a continuous identifying attribute, the score of a respondent is only approximately known by an attacker, it might still make sense to use sampling methods to protect that attribute. However, assumptions on restricted attacker resources are perilous and may prove definitely too optimistic if good quality external administrative files are at hand.
Generalization
This technique is also known as global recoding in the statistical disclosure control literature. For a categorical attribute Xi, several categories are combined to form new (less specific) categories, thus resulting in a new Yi with |Dom(Yi)| < |Dom(Xi)| where |·| is the cardinality operator and Dom(·) is the domain where the attribute takes values. For a continuous attribute, generalization means replacing Xi by another attribute Yi which is a discretized version of Xi. In other words, a potentially infinite range Dom(Xi) is mapped onto a finite range Dom(Yi). This is the technique used in the μ-Argus SDC package [45]. This technique is more appropriate for categorical microdata, where it helps disguise records with strange combinations of categorical attributes. Generalization is used heavily by statistical offices.
Example 3.1 If there is a record with “Marital status = Widow/er” and “Age = 17,” generalization could be applied to “Marital status” to create a broader category “Widow/er or divorced,” so that the probability of the above record being unique would diminish. Generalization can also be used on a continuous attribute, but the inherent discretization leads very often to an unaffordable loss of information. Also, arithmetical operations that were straightforward on the original Xi are no longer easy or intuitive on the discretized Yi.
Top and bottom coding
Top and bottom coding are special cases of generalization which can be used on attributes that can be ranked, that is, continuous or categorical ordinal. The idea is that top values (those above a certain