转发Privacy Preserving Publication of Set-Valued Data

Written by {ga=terrovitis}
Set-values are a natural representation of data that appears in a large number of application areas, ranging from retail sales logs to medical records. Datasets with set-values are sparse multidimensional data that usually contain unique combinations of values. Publishing such datasets poses great dangers to the privacy of the individuals that are associated with the records. Adversaries with partial background knowledge can use it to retrieve the whole record that is associated with an individual.
The aim of this article is to provide an overview of the basic anonymization methods for set-valued data. Anonymization methods allow sanitizing set-valued datasets in such a way that
the privacy of the individuals who are associated with the data is guaranteed. The paper describes the basic privacy guaranties that are used in research literature, and briefly presents the state-of-the-art anonymization methods.

Privacy preserving publishing of set-values

As technology infiltrates more and more aspects of our lives, each human activity leaves a digital trace in some repository. Vast amounts of personal data are implicitly or explicitly created each day, and rarely one is aware of the extent of information that is kept and processed about him or her. These personal data give rise to significant concerns about user privacy, since important and sensitive details about one's life are collected and exploited by third parties. The goal of privacy preservation technologies is to provide tools that allow greater control over the dissemination of user data. A promising trend in the field is Privacy Preserving Data Publishing (PPDP), which allows sharing of anonymized data. Anonymizing a dataset is not limited to the removal of direct identifiers that might exist in a dataset, e.g. the name or the Social Security Number of a person. It also includes removing secondary information, e.g. like
age, zipcode that might lead indirectly to the true identity of an individual. This secondary information is often referred to as quasi-identifiers.
A basic attack scenario against user privacy is as follows: An adversary who knows some characteristics of a person, e.g. age and zipcode, searches the anonymized data for records that match her background knowledge. If a single record is found, then she be can certain that the record refers to the person see knows. The sparser the data are, the more unique combinations exist, and the easier it is for an adversary to locate unique records that correspond to specific users. This makes collections of set-values, which are sparse multidimensional data, very hard to anonymize effectively without compromising user privacy [Aggarwal 2007]. There are several methods in research literature that address the challenges of anonymizing set-valued data. I report the basic techniques classified in three categories: anonymization methods that protect against identity disclosure, methods that protect against
attribute disclosure and methods that offer differential privacy.

Paste_Image.png

Protection against identity disclosure

Protection against identity disclosure guarantees that adversaries will not be able to associate specific records with a known individual. The most popular guaranty is k-anonymity [Samarati 2001, Sweeney 2002]. k-anonymity guarantees that each record will be indistinguishable from other k-1 records, with respect to the quasi identifiers. Every combination of quasi identifiers appears 0 or more than k times in the anonymized dataset.


Paste_Image.png

In [He et. al. 2009] a method for transforming set-valued data to a k-anonymous form, the Partition algorithm, is proposed. Partition employs generalization for transforming the data to k-anonymous form. Generalization is the replacement of a group of original values by one new more abstract one. For example, if the residence area of an individual is reported in terms of cities, the city name can be replaced by the country name in anonymized data as in Figure 2. Partition employs local recoding; not all appearances of an original
value are replaced by a generalized one. Partition is a top down algorithm; it starts by considering that all values are generalized to the more generic value of the generalization hierarchy and then drills down the hierarchy until the k-anonymity property nolonger holds.
Despite using a flexible local recoding transformation, Partition has to significantly transform the original data in order to achieve k-anonymity. Records of different cardinalities that contain items from a large domain must end up to groups of identical records of size at least k. In response to this problem [Terrovitis et. al. 2008, Terrovitis et. al. 2010] propose km-anonymity. km-anonymity requires that each combination of up to m quasi identifiers must appear at least k times in the published data. The intuition behind km-anonymity is that there is little privacy gain from protecting against adversaries who already know most of the terms of one record, and significant information loss in the effort to do so. In [Terrovitis et. al. 2008, Terrovitis et. al. 2010] algorithms that rely on generalization using both global and local recoding are proposed.

Paste_Image.png

In [Loukides et. al. 2011, Gkoulalas et. al. 2011] the authors provide an even more targeted approach. The paper assumes that a set of privacy and utility constraints is known by the data publisher. Such assumption holds in specific domains, like that of medical records [Loukides et. al. 2010/II]. Exploiting this explicit knowledge about privacy dangers and utility the authors propose algorithms that achieve protection against identity disclosure with limited information loss.

Protection against attribute disclosure

The data values in a dataset are not usually equally important as personal information. A common distinction in privacy related literature is between quasi identifiers and sensitive values. Quasi identifiers are usually known through several sources and they do not threaten the privacy of an individual. Sensitive values on the other hand, are not considered available through other sources and they reveal important personal information. If such a distinction holds and it is known by the data publisher, then data must also be protected against the disclosure of sensitive attributes. A common guaranty for protecting against sensitive values is l-diversity [Machanavajjhala et. al. 2006]. l-diversity guarantees that any adversary cannot associate her background knowledge with less than l well represented sensitive values. Well-represented is usually defined as a probability threshold: an adversary cannot associate her background knowledge with any sensitive value with probability over 1.

Paste_Image.png

The first anonymization method that provided protection against attribute disclosure in set-valued attributes was proposed in [Ghinita et. al. 2008]. The proposal of [Ghinita et. al. 2008, Ghinita et. al. 2011] relies on separating sensitive values from quasi identifiers as depicted in Tables 4 and 5. The idea of separation was first proposed in the relational context in [Xiao et. al 2006], but it was adjusted and extended in [Ghinita et. al. 2008, Ghinita et. al. 2011] for the set-valued context. The basic idea of proposed the anonymization method is to create clusters of similar records (with respect to quasi identifiers) and then publish at each cluster the quasi identifiers and the sensitive values separately. A benefit of this transformation with respect to generalization and suppression is that it does not require creating groups with identical quasi identifiers. This way the information loss is kept low, even for data of very high cardinality and dimensionality.

Protection against attribute disclosure is also studied in [Cao et. al. 2010]. The proposed guaranty, p-uncertainty is similar to as l-diversity and requires that quasi identifiers cannot be associated with sensitive values with probability over 1/p. The novelty of the approach is that is considers as quasi identifier every record subset that appears in the dataset, including the ones that contain sensitive values. This means that sensitive values can also act as quasi identifiers. The proposed anonymization method relies both on generalization and suppression.

A guaranty that provides protection both from identity and attribute disclosure, the (h,k,p)-coherence, is proposed in [Xu et. al. 2008/I, Xu et. al. 2008/II]. (h,k,p)-coherence. Similarly to km-anonymity it protects from adversaries how know up to p terms, by guaranteeing that every combination of items will appear at least k times. Moreover, (h,k,p)-coherence guarantees that combinations of up to p items cannot be associated with a sensitive value with probability over h. The proposed anonymization method relies solely on suppression.

In [Loukides et. al. 2010/I] an anonymization method that can be tailored to specific sensitive inferences is presented. The authors propose the notion of PS-rules, which are sensitive association rules specified by the data owner. The anonymization procedure guarantees that an adversary will not be able to infer these rules with high certainty. The proposed anonymization algorithms are based on generalization.

Differential privacy in for set-valued attributes

Differential privacy [Dwork et. al. 2006] is a more generic guaranty than k-anonymity
and l-diversity and the protection it offers is broader than protection against identity and attribute disclosure. It does not make any assumptions about the adversary's background knowledge (although anonymization methods that provide differential privacy might make some implicit ones [Kifer et. al. 2011]), and it requires that any analysis of the anonymized data is not significantly affected by the addition or removal of one record in the original data. The trade-off for providing such a strong anonymity guaranty is that the information quality of the result if significantly affected. Only the results of specific analysis can be provided, by adding noise in a non-deterministic procedure.
A first approach to the anonymization of set-values by using differential privacy appears in [Korolova et. al. 2009], in the context of web search query logs. Each record is the web search history of a single user, which can be modeled as a set value, but the result of the anonymization is not set-values. Only the frequent query terms and they support (with the addition of Laplace noise) is published. A more recent approach that publishes anonymized set values under differential privacy and provides set-values as a result, appears in [Chen 2011]. The paper proposes DiffPart, a top-down algorithm that decides which frequent itemsets can be published, without violating differential privacy. The supports of all published items are distorted by adding Laplace noise.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,734评论 6 505
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,931评论 3 394
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,133评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,532评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,585评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,462评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,262评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,153评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,587评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,792评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,919评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,635评论 5 345
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,237评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,855评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,983评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,048评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,864评论 2 354

推荐阅读更多精彩内容

  • 醉后书0815阅读 89评论 0 0
  • 意识是人脑对外界事物的客观反映。是人能觉察到的心理活动。 意识→觉心-觉知、心理功能、心理状态 意识的局限性与能动...
    黄佳宁阅读 710评论 0 1
  • 《芈月传》落幕了,然而关于剧中人物的讨论依然炽热。 很多人说这是一个女人和三个爱她的男人的故事,更多人在讨论月儿到...
    思舞阅读 436评论 1 2
  • 这是我现在很赞同很喜欢的观点也很认同推广这一理念。 因为我认为会生活甚至比会学习更重要。 生活的时间更长。 好的生...
    王丽燕199阅读 1,144评论 1 5