Written By: Steve Zagieboylo, Senior Architect at Calavista
In this post, I’m going to discuss some approaches to obtain realistic Test Data without compromising the security of any customer’s sensitive data. In the world of health care software, this is referred to as Personal Health Information (PHI), but the concept exists in financial software, document management, … really, everything.
Why is Realistic Test Data Important?
The data you test against must include a lot of completely unrealistic data: the corner cases that you have to make sure you handle correctly, but almost never come up in real life. However, it is equally important to have a good quantity of realistic data, justified by, “You just never know.” I can’t count the times that a special case came up in the real data that we hadn’t considered as a corner case. If your application handles large amounts of real-world data, there’s just no substitute for testing it with large amounts of real-world data.
What Aspects are Important?
- Quantity. Having too little data will hide a lot of performance problems, such as an N+1 problem in your queries. You don’t need the individual developers’ systems to have the same quantity as a production system, but the primary QA server should.
- Quality. One of Calavista’s customers has a very cool product that does intelligent analysis of the data produced in a hospital stay. This includes checks, for instance, of the drugs prescribed, considering the age of the patient. If the data were just randomly generated with actual drugs and actual dates, but no consideration of making them ‘realistic’ (i.e., drugs that a doctor would actually have prescribed for a patient with this profile and this diagnosis), then the intelligent analysis code will freak out. Tests created with random data would just be useless.
- Interconnectedness. The manner and scale in which the different parts of the data model connect should be realistically represented. For example, compare a business in which Customers typically make 3 or 4 orders a year compared to one where they typically make thousands of orders a year. The approaches to both the UI and to the data storage will be quite different.
Approaches to Create Realistic Test Data
There are three general approaches:
- Generate the data by hand.Someone who is an expert in the system as well as in the specific field would have to create a realistic data set. Even with tools to support the effort, this would be a humungous undertaking and it would still be a challenge to make data “realistic,” especially medical data. It would need to include Patients and Visits and Symptoms and Diagnoses that all fit correctly together and have realistic quantities.
- Use a tool that generates the data.This is industry specific, of course, but often there are tools that can do some of this work. In the case of medical data for instance, there are some tools for generating, from whole cloth, data that adheres to the HL7 standard. There are several good choices, if all you’re looking for is legal HL7 messages, but nothing that can automatically generates a fully coherent set of data
- Start with real data and obfuscate it. This approach is the only one we have been able to find that really solves the problem. The challenge here is to remove all sensitive information from the data, but still leave it adequately realistic. This approach is the subject of the rest of this paper.
Obfuscation / Data Masking
PHI data is more than just the names and addresses of the patients. It is important that even a determined investigator would not be able to take obfuscated data and reverse the process to figure out whom the data describes. Therefore, not only names but birthdates, dates of visits, locations, even some specific combinations of symptoms might need to be obscured. However, changing the dates too much might change the analysis of the illness: Symptoms of osteoarthritis are much more alarming in a twelve year old compared to the same symptoms in someone who is seventy-three. Proper obfuscation, or data masking, creates characteristically intact, but inauthentic, replicas of personally identifiable data or other highly sensitive data in order to uphold the complexity and unique characteristics of data. In this way, tests performed on properly masked data will yield the same results as they would on the authentic dataset.
Tool Recommendations
SOCR Data Sifter
On open-source tool for obfuscation. It has a setting for exactly how obfuscated the data should be, where the lowest setting is unchanged and the highest setting leaves the data unrecognizable.
Documentation Link
Source Code Link
SQL Server Dynamic Data Masking (DDM)
In SQL Server, as of SQL Server 2016, there is a Dynamic Data Masking feature that allows you to specify data masking rules on the database itself, with no need to change the data. Whether or not a particular user can see the real or the masked data is based on user privileges. Here is the best overview of the feature that I’ve found: Use Dynamic Data Masking to Obfuscate Your Sensitive Data
Roll Your Own
I am usually a big advocate of buy over build. The software you imagine is always bug-free and does just what you want. It isn’t until you’ve spent a bunch of time working on it that it acquires the bugs and feature compromises that make you unhappy with the existing tools. In this case, however, it is a pretty small bit of code and purely internal, so if it chokes, you haven’t embarrassed yourself in front of customers. The biggest risk is that you have to make sure that no sensitive data is getting past it, which is pretty straightforward to check. The biggest bit is to have a serious review of the entire database schema to make sure all-important fields are being modified.
Approaches:
You can randomize the data just with SQL, with clever use of the RAND command. Here is a great article on this approach: Obfuscating Your SQL Server Data.
If the code for your data model is cleanly isolated in a library, you can make a new application, pull in that library, and build your tool with that, using the same technology your developers are already comfortable with. If your code is not so cleanly isolated, a riskier approach is to write the obfuscation code inside the same code base, where it is only possible to trigger it with a completely isolated command. I’d recommend putting what amounts to a two-factor authentication to trigger this code; just think what a disaster it would be if it somehow gets built and then run inside the production system. It should only be included if a TOOLS_ONLY flag is set in the build process, and can only be run if the OBFUSCATION_TOOL configuration flag is set in the application configuration. Of course, neither flag should be on by default. Even so, carefully consider the impact that an error – or even a rogue developer – could have on a production system before using this approach.
This extra code might need other additional controls on it, as well. For instance, if you have tables that are flagged as Insert-only, you will need to change the settings and add new code to your lowest level code to enable your obfuscation code to write the obfuscated versions of the records. Again, these functions should have two levels of enablement, such that they are not even built in the code that goes to production, and if they accidentally do get built, they won’t run because they do nothing if the OBFUSCATION_TOOL configuration flag is not set.
Steps:
These steps should be built into the DevOps tools so they can be run easily, or even are automated to run every day. For one Calavista customer, we set up a process where part of the nightly process was to copy all the production data, obfuscate it, and install it in the QA system. It was incredibly helpful in tracking down tricky bugs, because even though the names and dates were different, the IDs of the records that triggered the problems were consistent, and that was all that was saved into logs.
- Clone the data. This is no harder than just taking a recent backup, standing up a new database instance, and restoring from backup. This MUST be done inside the controlled data space, where only users who are approved to see PHI data have access.
- Run the obfuscation code on the alternate data set, changing it in place. There are a number of tools for this step, with some recommendations below. Alternatively, you could create these tools yourself.
- Create a “backup” of the alternate data set, which now has only obfuscated data. Only this backup can leave the protected zone.
- Delete the alternate data set.
- In the non-protected zone where the QA systems live, bring in the alternate backup and restore from backup to the database where you want to have this latest data.
Additional Consideration: Logging
This is more a general thought for any system with sensitive data, but it intersects with this approach. When logging specific record information, such as when you are logging that an unusual process is being invoked for a certain record, the developers should know to be conscious of what data is sensitive, and never to put that data in the log. For most applications, it is acceptable to put naked IDs in the logs, since those do not mean anything to someone without access to the database, anyway. Fortunately, the IDs are exactly what you want to be able to hand to developers who are going to be debugging against the obfuscated QA database. They still match up to the problem records, whereas the names, addresses, etc. do not.
Conclusion
When dealing with sensitive data of any sort, including health care data, it is important for QA systems to have realistic data upon which to operate. Having small amounts of hand-generated data is going to allow bugs and problems to creep in, and will make them very difficult to track down with they are found in the production system. There is really no substitute for getting the real production data and having a copy of it, properly obfuscated, in the QA environment. Calavista recommends performing this step automatically, as part of the nightly process. Even though getting this set up represented work that had to go on the schedule, it inevitably more than pays for itself in the long run.