Privacy-Preserving Storage and Access of Medical Data Through Pseudonymization and Encryption

Privacy-Preserving Storage and Access of Medical Data Through Pseudonymization and Encryption: E-health allows better communication between health care providers and higher availability of medical data. However, the downside of interconnected systems is the increased probability of unauthorized access to highly sensitive records what could result in serious discrimination against the patient. This article provides an overview of actual privacy threats and presents a pseudonymization approach that preserves the patient’s privacy and data condentiality. It allows (direct care) primary use of medical records by authorized health care providers and privacypreserving (non-direct care) secondary use by researchers. The solution also addresses the identifying nature of genetic data by extending the basic pseudonymization approach with queryable encryption.

E-Health and the Need for Privacy: Today’s health care is driven by the goal of streamlining and optimizing processes in order to reduce costs without compromising the quality of patient treatment. E-health denotes the application of information and communication technologies (ICT) to support the medical workflows and to improve the communication be- tween health care providers. Over the past years, interconnected systems, such as electronic health records (EHR), provide the technical infrastructure for facilitated document sharing by making them digitally available, having the potential to increase the quality of health care while keeping the costs at a controlled level [1]. However, facilitated access also means higher chance of misuse. Thus sensitive information such as HIV infection data or drug abuse histories must be adequately protected to prevent discrimination, such as denied insurance cover- age. Even the sole probability of developing a serious illness may be sufficient to decide against health or live insurance coverage. A particular example of this form of prejudice is called genetic discrimination, the biased treatment of people due to gene mutations that may cause or increase the risk of an inherited disorder [4], [2]. There are numerous documented cases where the results of so-called predictive genetic tests were disclosed to insurance companies resulting in denied insurance coverage, although genetic tests usually deliver uncertain probabilities instead of clear-cut predictions of developing a genetic disorder. Genetic discrimination is also an issue with job applications and employment, where employees were fired because of ‘unfavourable’ genetic tests and thus keeping them would be too ‘risky’. Although legal acts such as the Genetic Information Nondiscrimination Act (GINA) [3], the Health Insurance Portability and Accountability Act (HIPAA) [12], and the Directive 95/46/EC [5] by the EU exist, technical solutions are still required to prevent the disclosure of medical records to unauthorized per- sons. At the same time, the vast amounts of digitized data produced in today’s health care environment should be available for secondary use, for non-direct care use of personal health information including (but not limited to) analysis and research, as well as quality and safety measurement [9]. Providing access to this rich source of information can help to expand knowledge about diseases and treatment and enhance the effectiveness and efficiency of health care, which in turn improves direct care for the individual patient. But considering reports on buying and selling of non-anonymized patient and health care provider data by the medical industry without the explicit consent from patients or physicians, making these data available poses a significant privacy risk. The effective primary and secondary use of medical records is a major challenge for developing appropriate privacy protection measures.

Anonymization and Encryption: Two techniques often mentioned when confidentiality and privacy of data is required are anonymization and encryption. Anonymization refers to removing the identifier from the medical data such that the records cannot be traced back to the corresponding patient [11]. Anonymization can be achieved by depersonalization, the removal of any patient-identifying information from the health records. Because perfect depersonalization, where the data subject is no longer identifiable at all circumstances, is practically impossible to achieve, the assumption can be relaxed to modifying the health data such that the corresponding patient can either not at all or only with a ‘disproportionate amount of time, expense and labour’ be identified (cf. [6]). A well-known technique of anonymization is k-anonymity [10] where identifying information is removed in such a way that each person cannot be distinguished from at least k-1 individuals by comparing the remaining data stored in the database. A particular downside of anonymization is the fact that it cannot be reversed, which means that anonymized health data cannot be used for direct care or primary use where the link between health data and corresponding patient obviously needs to be known by the health care providers. Anonymization also has its downsides in secondary use, where it is usually applied: As the patient cannot be identified any more, they cannot be contacted to ask for necessary further information or be directly informed of any results either, thus cannot immediately profit from advances in medical treatment. Anonymization may also be inadequate for securely storing genetic data due to their identifying nature. The other technique, data encryption, is usually employed when data confidentiality is required. By fully encrypting health data with a secret key only known to the patient, his or her privacy can be assured as well. Native data encryption is provided by many major database providers and prevents unauthorized disclosure of any sensitive data as long as the decryption key is kept secret and protected adequately. Unlike anonymization, full data encryption is obviously reversible, but the major problem is that secondary use of the records in research projects is entirely prevented, unless the patient explicitly decrypts the data, thus unconcealing his or her identity. Also considering the technical heterogeneous environment of health institutions, (authorized) sharing of encrypted records is also more complicated. Furthermore, encryption and decryption can be very time-consuming when large (monolithic) medical records are involved such as imaging data, in this case rendering data access operations quite tedious.

Pseudonymization as a Solution: Pseudonymization combines the strengths of anonymization and (full) document encryption: It achieves unlinkability by introducing specifiers (pseudonyms) which cannot be associated with the patient without knowing a certain secret. Other than plain anonymization, it is reversible. Therefore, with prior depersonalization of health records, it allows storing the records in an anonymized state, while this anonymity can be reversed by authorized persons having the knowledge of the secret key. While pseudonymization itself also relies on cryptography (when no cleartext mapping/linking list is involved), only metadata need to be encrypted, and thus the necessary cryptographic overhead can be considerably reduced, compared to simply fully encrypting the health documents. Figure 1 represents the difficulty of keeping the patient’s privacy and data usability as a trade-o_ between privacy and transparency: Both anonymization and encryption shift the emphasis on privacy, compromising transparency, while secondary use without anonymization or data encryption discloses the link be- tween patient and health data, compromising patient’s privacy. Pseudonymization however is able to keep the balance between privacy and transparency.

Heurix-Neubauer_Privacy-Preserving Storage and Access of Medical Data through Pseudonymization and Encryption



  1. Chaudry, B., Wang, J., Wu, S., Maglione, M., Mojica, W., Roth, E., Morton, S.C., Shekelle, P.G.: Systematic review: Impact of health information technology on quality, efficiency, and costs of medical care. Annals of Internal Medicine 144(10), 742-752 (2006)
  2. Coalition of Genetic Fairness: Faces of genetic discrimination – How genetic discrimination affects real people (July 2004)
  3. Congress of the United States of America: Genetic Information Nondiscrimination Act (2008)
  4. Council for Responsible Genetics: Genetic discrimination. January 2001)
  5. European Union: Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. Official Journal of the European Communities L 281, 31{50 (1995)
  6. Fischer-Hübner, S.: IT-Security and Privacy: Design and use of privacy-enhancing security mechanisms. Springer, Berlin (2001)
  7. Neubauer, T., Heurix, J.: A methodology for the pseudonymization of medical data. International Journal of Medical Informatics 80(3), 190-204 (2011)
  8. Roses, A.D.: Pharmacogenetics and the practice of medicine. Nature 405, 857-865 (2000)
  9. Safran, C., Bloomrosen, M., Hammond, W.E., Labkoff, S., Markel-Fox, S., Tang, P.C., Detmer, D.E.: Toward a national framework for the secondary use of health data: An american medical informatics association white paper. Journal of the American Medical Informatics Association 14, 1-9 (2007)
  10. Sweeney, L.: k-Anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5), 557-570 (2002)
  11. Thomson, D., Bzdel, L., Golden-Biddle, K., Reay, T., Estabrooks, C.A.: Central questions of anonymization: A case study of secondary use of qualitative data. Forum Qualitative Social Research 6, 29 (2005)
  12. United States Department of Health & Human Service: HIPAA Administrative Simplification: Enforcement; Final Rule. Federal Register / Rules and Regulations 71(32) (2006)