Pseudonymization with Metadata Encryption for Privacy-Preserving Searchable Documents

The average costs of data leakage are steadily on the rise. As a consequence, several data security and access control mechanisms have been introduced, ranging from data encryption to intrusion detection or role-based access control, doing a great work in protecting sensitive information. However, the majority of these concepts are centrally controlled by administrators, who are one of the major threats to corporate security. This work presents a security protocol for data privacy that is strictly controlled by the data owner. Therefore, we integrate pseudonymization and encryption techniques to create a methodology that uses pseudonyms as access control mechanism, protects secret cryptographic keys by a layer-based security model, and provides privacypreserving querying.


Introduction In recent times where the quantities of stored data are steadily on a rise, keeping these vast amounts of information secure has become a major challenge. Sensitive corporate data must be protected at all costs from being leaked to unauthorized persons; otherwise organizations have to face massive direct and indirect costs. For example, the leaking of construction plans for a critical component to a competitor may completely set back the company that spent considerable amounts of money on the development; a minor financial institute may not be able to deal with the loss of customer confidence resulting from a security incident involving the theft of the customers’ account information. Not only companies but also individuals have to be concerned with data security: An individual’s health information leaked to the wrong person may result in severe adverse consequences for this person. Sensitive information about your health status may cause insurance companies to deny coverage. In the e-health sector, health information is often shared with multiple parties, resulting in a significant compromise in patients’ privacy [1]. Although security is considered as a critical factor with health information systems, security gaps still exist [2]. In the past, several effective data security and access control mechanisms have been introduced, ranging from data encryption to intrusion detection or role-based access control. However, what the majority of these concepts are centrally controlled by administrators, who define which persons are allowed to access which data. As long as these mechanisms are not circumvented while the administrators are trusted, one can expect an adequate level of data security and privacy. But careless configuration or implementation may result in holes in the security architecture. Especially internal attackers, e.g., disgruntled employees, are a major threat to corporate security when exceeding their access rights and leaking sensitive information to the highest bidder. The most dangerous of the internal adversaries are malicious administrators with their extended privileges, usually endowed with full access rights to fulfill their jobs. The majority of current security concepts cannot protect against this type of attacker. But also external attackers exploiting weaknesses in corporate security layers are able to acquire direct access to sensitive data, unless explicitly protected, as incidents involving Citigroup and Heartland Payment Systems and, more recently, the infamous SONY hack have demonstrated so impressively [3]. Therefore, this work presents a security architecture for data privacy that is strictly controlled by the data owner, i.e., the data owner decides who is granted access to the data, which takes away the required trust in the administrators, especially database administrators. As relying on a single security strategy has its downsides, we integrate pseudonymization and encryption techniques to overcome their individual shortcomings and create a protocol that uses pseudonyms as access control mechanism, protects secret cryptographic keys by a layer-based security model, and supports privacy-preserving querying. Background From a conceptual point of view, two approaches on how to deal with data confidentiality and data leakage prevention exist: (i) limit access by a dedicated access control system and (ii) modify and persist the data records themselves such that a potential attacker does not gain any useful information, i.e., data masking. Resource modification in this manner can be achieved by either making the data unreadable for unauthorized parties (encryption) or by disassociating individual data items. Data disassociation assumes that the main property of data records that needs to be kept secret is the association between the items, not the items themselves (e.g., if the individual items are publicly available). Therefore, data disassociation “encrypts and delinks the data held about an individual from the individual’s identity” [4]. Anonymization and the similar pseudonymization are examples of techniques based on data disassociation.

Traditional Access Control (in the context of this work) refers to limiting access to resources by a dedicated access control module, defining and deciding which actors are allowed to access which resources, or in other words, explicit access control. Role-based access control (RBAC) for example decides on the role the actor currently impersonates. The access rights are defined as rules or policies and can be expressed, e.g., in the eXtensible Access Control Markup Language (XACML). For RBAC, a policy expressed in XACML [5] includes (among others) elements for the particular role (actor), the resources (object), and the permitted actions on the resources (create, retrieve, etc.). These policies are created by the policy administration point (PAP). Access requests are checked against one or more policies and thus granted or denied by the policy decision point (PDP) and enforced by the policy enforcement point (PEP).

Disassociation techniques include anonymization which can be achieved by depersonalization, i.e., the systematic removal of the individual’s identifying information from data records such that the records cannot be traced back to the corresponding individual. K-anonymity [6] deals with the existence of quasi identifiers (i.e., elements that are not identifying per se, but may be when grouped together, such as ZIP code combined with last name and birth date) by using generalization and suppression techniques to create equivalence groups. Some extensions of the basic k-anonymity approach deal with the problem of similarity of data tuples within an equivalence group (l-diversity [7]) and the distribution and semantic distance of specific sensitive attributes within an equivalence block in the complete dataset (t-closeness [8]) to further reduce the risk of re-identification. While anonymization is non-reversible, pseudonymization is a similar technique with the difference that identifying information is not permanently deleted but separated from the data records and referenced by a specifier, the pseudonym. Thereby, the process of depersonalization is reversible under specified and controlled circumstances, i.e., when knowing a particular secret. Pseudonymization is often used in identity management (e.g., [9]) but is also applied in other application areas as well, such as e-health. In this context, pseudonyms are generally used as ‘secret’ links between patients and health records where the links are only recoverable when being authorized (cf. [10], [11], [12]).

Encryption is the straightforward approach to shield sensitive data from unauthorized glances. The main issue here is how to efficiently query within encrypted data. The naive solution is to transmit the entire encrypted database to a trusted machine where the data is decrypted and then processed as usual. A more efficient and sophisticated approach involves special encryption techniques or some kind of precalculated index created by the data owner or data provider and post-filtering the result set. The simplest form is a hash-based or encryption-based index over individual attribute values. In [13], [14], encrypted table rows are stored along with a set of hash-based indexes, depending on which of the table columns are required for queries. Another approach is described in [15] where buckets are introduced, spanning over a pre-defined range of the attributes’ domain values. Each bucket is assigned an identifier which serves as index. Other approaches exploit the hierarchical structure of XML documents. For example, in [16], [17], [18], an XML document is stored as a set of (disjoint) document fragments, and crypto-indexes are used to facilitate the search. To answer a query on the structure or the content of an XML document, the crypto-indexes are scanned and all matching document fragments are transmitted to the client. The client then decrypts the fragments and performs some postprocessing on the retrieved fragments in order to obtain the final query result.

Limitations. Traditional access control mechanisms are secure as long as the architecture is intact. If the access control module is circumvented by an external attacker, all data records are prone to be leaked due to the lack of further protection mechanisms. Actors like system or database administrators usually have unrestricted access to all sensitive information as well. Traditional access control methods are the predominant protection mechanisms implemented in electronic health care where RBAC is usually implemented, such as in Austria’s ELGA (electronic health record) or the UK National Health Service. For instance, the IHE (Integrating the Health Enterprise) standard, which defines how to exchange health data, requires that data is protected adequately by policy-based access control, but encryption or disassociation is not mentioned [19]. The HIPAA (Health Insurance Portability and Accountability Act) does not dictate mandatory encryption either; it is stated as optional [20]. In health care, most RBAC systems have exception mechanisms to circumvent normal access control which are often overused [21]. Privacy violation incidents have shown that hospital employees do exploit their (technical) access rights [22]. Disassociation and encryption alter the data structure of the stored information, also protecting against internal attackers, as long as the involved crypto keys are secure, but they have other drawbacks: Anonymization techniques are accompanied by loss of information and data accuracy, thus limiting data expressiveness. They cannot be reversed either, restricting their applicability to secondary use of the data pool only (e.g., for statistical or research purposes). Data encryption on the other hand prevents efficient secondary use unless explicitly decrypted, which can be a major disadvantage, especially in e-health where secondary use of medical records for research is an important factor. Data sharing, i.e., access authorization, is also tricky to be handled, requiring either the redundant storage of encrypted data (if re-encrypted with the authorized person’s personal key) or sharing the secret decryption key itself. The latter makes de-authorizations rather tedious, demanding re-encryption of the particular record and re-issuing the new decryption key to all other still authorized persons. Decryption can also be a performance-related issue when the records are very large and processing power is limited. Pseudonymization of the sensitive data records supports both privacy-preserving primary and secondary use as long as the records are diligently depersonalized. The issue with pseudonymization is how to realize privacy-preserving querying: The metadata for searching must not contain any sensitive keywords to prevent leakage of critical information, in other words, the domain of keywords must be standardized and highly-structured. No arbitrary and therefore potentially compromising keywords should be allowed.



[1] Lechler T., Wetzel S., Jankowski R., Identifying and Evaluating the Threat of Transitive Information Leakage in Healthcare Systems, Proc. of the 44th Hawaii International Conference on System Sciences, 1-10, 2011
[2] Luethi M., Knolmayer G.F., Security in Health Information Systems: An Exploratory Comparison of U.S. and Swiss Hospitals, Proc. of the 42nd Hawaii International Conference on System Sciences, 1-10, 2009
[3] Grocer S., Sony, Citi, Lockheed: Big Data Breaches in History, The Wall Street Journal, June 9, 2011
[4] Eggers, W.D., Government 2.0: Using Technology to Improve Education, Cut Red Tape, Reduce Gridlock, and Enhance Democracy, Rowman and Littlefield, 2007
[5] OASIS, eXtensible Access Control Markup Language (XACML) Version 3.0, Committee Specification 01, 2010
[6] Sweeney L., k-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557-570, 2002
[7] Machanavajjhala A., Kifer D., Gehrke J., Venkitasubramaniam M., l-Diversity: Privacy Beyond k-Anonymity, ACM Transactions on Knowledge Discovery from Data, 1(1), 2007
[8] Li N., Li T., Vekatasubramanian S., t-Closeness: Privacy Beyond k-Anonymity and l-Diversity, IEEE 23rd International Conference on Data Engineering, 106-115, 2007
[9] Camenisch J., Shelat A., Sommer D., Fischer-Hübner S., Hansen M., Krasemann H., Lacoste G., Lenes R., Tseng J., Privacy and Identity Management for Everyone, Proc. of the 2005 Workshop on Digital Identity Management, 20-27, 2005
[10] Thielscher C., Gottfried M., Umbreit S., Boegner F., Haack J., Schroeders N., Data Processing System for Patient Data, Int. Patent, WO 03/034294 A2, 2005
[11] Noumeir R., Lemay A., Lina J., Pseudonymization of Radiology Data for Research Purposes, Journal of Digital Imaging, 20(3), 284-295, 2007
[12] NEMA, Digital Imaging and Communications in Medicine, Standard, 2008
[13] Damiani E., di Vimercati S.D.C., Jajodia S., Paraboschi S., Samarati P., Balancing Confidentiality and Efficiency in Untrusted Relational DBMSs, ACM Conference on Computer and Communications Security, 93-102, 2003
[14] Damiani E., di Vimercati S.D.C., Finetti M., Paraboschi S., Samarati P., Jajodia S., Implementation of a Storage Mechanism for Untrusted DBMSs, Proc. of the 2nd IEEE International Security in Storage Workshop, 38-46, 2004
[15] Hacigümüs H., Iyer B., Li C., Mehrotra S., Executing SQL over Encrypted Data in the Database-Service- Provider Model, Proc. of the 2002 ACM SIGMOD International Conference on Management of Data, 216- 227, 2002
[16] Yang Y., Ng W., Lau H.L., Cheng J., An Efficient Approach to Support Querying Secure Outsourced XML Information, Conference on Advanced Information Systems Engineering, 157-171, 2006
[17] Jammalamadaka, R.C., Mehrotra S., Querying Encrypted XML Documents, 10th International Database Engineering and Applications Symposium, 129-136, 2006
[18] Lee J.G., Whang K.Y., Secure Query Processing against Encryped XML Data using Query-Aware Decryption, Information Sciences, 176(13), 1928-1947, 2006
[19] Integrating the Healthcare Enterprise (IHE), IHE IT Infrastructure (ITI) Technical Framework 7.0, 2010
[20] U.S. Department of Health & Human Services, HIPAA Administrative Simplification, Federal Register, 2006
[21] Røstad L., Edsberg O., A Study of Access Control for Healthcare Systems Based on Audit Trails from Access Logs, Proceedings of the 22nd Annual Computer Security Applications Conference, 175-186, 2006
[22] Lerner M., Allina Hospitals Fire 32 Over Privacy Violation, StarTribune, May 6, 2011
[23] Abouakil D., Heurix J., Neubauer T., Data Models for the Pseudonymization of DICOM Data, Proc. of the 44th Hawaii International Conference on System Sciences, 1- 11, 2011
[24] Neubauer T., Heurix J., A Methodology for the Pseudonymization of Medical Data, Journal of Medical Informatics, 80(3), 190-207, 2011
[25] Heurix J., Neubauer T., Privacy-Preserving Storage and Access of Medical Data through Pseudonymization and Encryption, Proc. of the 8th International Conference on Trust, Privacy and Security in Digital Business, 186- 197, 2011
[26] Schrefl M., Dorn J., Grün K., SemCrypt – Ensuring Privacy of Electronic Documents through Semantic based Encrypted Query Processing, Proc. of the International Workshop on Privacy Data Management, 2005
[27] Grün K., Karlinger M., Schrefl M., Schema-aware Labeling of XML Documents for Efficient Query and Update Processing in SemCrypt, Computer Systems Science and Engineering, 21(1), 65-82, 2006
[28] Shamir A., How to Share a Secret, Communications of the ACM, 22(11), 612-613, 1979



Heurix, J.; Karlinger, M.; Neubauer, T., “Pseudonymization with Metadata Encryption for Privacy-Preserving Searchable Documents,” in System Science (HICSS), 2012 45th Hawaii International Conference on , vol., no., pp.3011-3020, 4-7 Jan. 2012
doi: 10.1109/HICSS.2012.491