Home     Topics     Book     Talks     Publications     Links     Conferences     History           

Data Privacy

  • Transparency and data privacy. This includes transparency-aware masking methods, disclosure risk assessment under the transparency principle. Details here.

  • Disclosure. There are two main types of disclosure:
    • Identity disclosure. We have identity disclosure when we are able to identify someone in a database. There are several privacy models that focus on this type of disclosure (e.g., privacy for re-identification, k-anonymity, uniqueness). Identity disclosure is a type of disclosure that we need to avoid in most of the cases.
    • Attribute disclosure. We have attribute disclosure when we increase our knowledge on a particular attribute of a particular individual. There are authors that distinguish attribute disclosure from inference/inferential disclosure. The difference is whether this additional knowledge is obtained from the database directly or inferred from the database using e.g. a statistical model. While it is usual to avoid identity disclosure, the extend we need to avoid attribute disclosure is not always so clear. When we build statistical and machine learning models is usual to expect some increment in the knowledge of particular individuals.
      There are privacy models that focus on this type of disclosure. An example is privacy for interval disclosure. Differential privacy can also be seen from this perspective: its goal is to avoid inferring that an individual was present or absent in the database used to compute a function. Integral privacy has a similar goal. Secure multiparty computation wants different involved parties to avoid learning anything but the outcome of the computation. So, in particular, avoid learning any unauthorized attribute of any individual whose data is used in the computation. Some extensions of k-anonymity are defined to avoid attribute disclosure: l-diversity.

Privacy models:
  • A privacy model is a formal definition of privacy that allow us to design algorithms and validate them with respect to this formal definition. Examples of privacy models include the following:
    • Privacy from re-identification. When a database is available for its analysis we want to avoid that someone identifies an individual in the database. This applies to any type of database from standard SQL databases to e.g. non-SQL ones. For example, for graphs representing social networks, re-identification applies when we know that a node of the graph is a particular person.
    • k-Anonymity. This privacy model is also related to privacy from re-identification. In this case, we require that when an intruder looks for an individual using some prior knowledge, there are at least k individuals with exactly the same information that can be the one being looked for. For a standard database, this means that given some values to be looked for, there are at least k records in the database with those exactly k values.
    • Differential privacy. This model is related to the computation of a query or a function given a database. The objective of the model is to avoid that from the output of the function or query, we can learn that the data of a particular individual was used. We have that the function satisfies differential privacy when the output does not change much under the presence or absence of an individual.
    • Integral privacy. We proposed this privacy model as an alternative to differential privacy. Information and results on integral privacy here.
    • Secure multiparty computation. The goal is to compute a function in a distributed way, i.e. using data from different parties, so that the only knowledge parties acquire from the process is the result of the function. No additional knowledge should be obtained. Parties of a secure multiparty computation model should only learn what they would learn if instead of a distributed approach, a centralized approach were used (using a trusted third party to compute the result of the function).
    We have worked with all these models. Some of our results are reported below focusing on the protection procedures and measures of risk and utility.

DP methods:
  • Data protection methods. They implement privacy models. Because of that there is a close relationship between families of data protection methods and privacy models.
    Examples of data protection methods include:
    • Masking methods / data anonymization procedures. They are methods for achieving privacy for reidentification and k-anonymity. They are also used for local differential privacy. Masking methods are applied to databases (standard SQL and non-standard ones) to reduce their quality so that disclosure of information is avoided. The three main group of methods are: perturbative, non-perturbative, synthetic data generators.
    • Methods to achieve differential privacy. This includes additive noise using Laplace distribution, multiplicative noise, randomization.
    • Cryptographic protocols for secure-multiparty computation. Each function to be computed according to the secure-multiparty computation privacy model needs a specific cryptographic protocol.
    Information and results here.

  • Information loss (IL) and data utility measures. Masking methods apply a transformation to databases to reduce their quality to avoid disclosure of sensitive information. As quality is reduced, the utility of the new database is reduced or, in other words, there is some information loss. Information loss measures exist to quantify this loss.
    As we report in our book masking is not always equivalent to information loss. Several authors have shown that for some type of perturbation and data disclosure can be reduced with no information loss. E.g. machine learning models learnt from masked data do not reduce their accuracy but can even slightly increase the accuracy in some cases. Of course, this depends on the type of data and the way the model is built. We have also results in this direction (here)
    Information loss measures depend on the type of data we have (e.g., standard numerical database vs. graphs or documents) and the data uses (e.g., regression, clustering):
    • Generic information loss measures. They are based on statistics (as mean, variance, correlation, contingency tables-based). They are used to have a general metric of the perturbation suffered by the data, specially when we do not know much on the possible data uses of the database.
    • Specific information loss measures. They are based on the analysis of actual analysis on the data. E.g., compare the accuracy between a model extracted from the original database and the accuracy between a model extracted from the masked database.
    We can formulate information loss between a database X and a masked database X' for an analysis f as follows:
    ILf(X, X')=divergence(f(X),f(X'))

  • Disclosure risk (DR) measures. Masking methods reduce the quality of a database to reduce the risk. Nevertheless, not any modifications is able to reduce e.g. the possibility of reidentification in the same extent. Disclosure risk measures are to evaluate disclosure risk after masking.
    There are two main types of disclosure in a database:
    • Identity disclosure is when we are able to identify someone in a (masked) database. Uniqueness and measures based on record linkage are examples of disclosure risk measures for identity disclosure.
    • Attribute disclosure is when we increase our knowledge on the attribute of an individual. An example of attribute disclosure measure is to compute (for a given individual) the difference between the inferred value for an attribute and the real value.
    We have worked on different topics related to risk assessment. E.g.,
    • on disclosure risk assessment for the worst-case scenario using supervised ML for estimating risk using record linkage with distances based on [1] bilinear forms [2], Choquet integral [3], as well as using the weighted Euclidean distance,
    • on disclosure risk assessment under transparency attacks (for microaggregation [4], for rank swapping [5], [6]),
    • on disclosure risk assessment for synthetic data showing that in some cases reidentification is possible [7]