Partners & Technology

DevOps1 Approach to Enterprise Data Masking

Introduction

Data is an asset as well as a liability. In a complex enterprise environment, we have large IT systems involving multiple enhancements and change requests all the time. It is critical to test the software changes on production like data to mitigate any issues which may otherwise arise in production. For this purpose, it is a common practice for enterprises to have a copy of production data in non-production environments; usually in UAT, Pre-Prod and SIT environments. This provides development team ability to tests on real data, however, introduces significant data compliance risks.

Often, there are developers and quality assurance professionals from different external vendors engaged in these projects; sometime offsite. Hence the risk for any personally identifiable and sensitive data being mishandled is quite high. There have been instances of triggering of communications to the actual customers from the test environments which is a serious issue and can lead to reputational risk and significant damage control costs for the organization.

We need to find a balance between data compliance needs and ability of QA team to provide similar quality assurance as if the test data was real production data. This is a difficult task and is further complicated by the eco-system of various ERP systems deployed in hybrid architecture. Data masking is a process through which we can convert datasets into structurally similar but inauthentic (non-identifiable) version which can be used for testing and other activities such as user training. DevOps1's approach for data masking is tools agnostic and focuses on the specific challenge of finding the balance between compliance and QA as depicted in figure 1.

DevOps1 Approach to Enterprise Data Masking

DevOps1's approach to enterprise data masking is based on executing an overarching process which involves key subprocesses such as identifying the PII data (also known as data profiling), technical setup, tools selection, development of masking scripts, QA and handover. These distinct processes are shown in the diagram below and described subsequently.

PII Data

Identifying the Personally Identifiable Information (PII) is critical first step in enterprise data masking. DevOps1 has developed frameworks to engage the relevant stakeholders to define and identify the PII data in the context of business. The typical stakeholders for the identification of PII data are application owners, SMEs, test team and InfoSec/CyberSecurity team. DevOps1 recommends to first develop an overarching definition of PII data and apply it consistently across the applications to avoid any confusion. The process should also cover identification of PII data in future when there are changes in application data model.

Tool Selection/Technical Design

Although there are multitude of similar looking toolsets for data masking, we cannot provide best solution without understanding the infrastructure and environment landscape of the business. In our experience, a significantly large amount of effort may be consumed in arriving at the correct architecture for data masking appliance. A proof of concept with appropriate tool is highly recommended to test the assumptions and resolve technical challenges early on.

Below are the few things we should be cognisance of while developing a masking architecture:

  1. Authentication method approved by the client should be supported by selected masking tool (for example token-based vs SQL authentication). LDAP/S and other security related configurations should be discussed
  2. Support needs of the masking tool should conform with the client's policy
  3. Access method to different environments should be taken into account while developing masking architecture
  4. Data refresh approach should be confirmed
  5. Any connection to the production data base (even with Read only access) should be avoided. Pre-prod can be used to import the production data for masking
  6. Referential integrity is a critical component of enterprise data masking. This cannot be achieved without understanding the data flow across systems. A standard approach to maintain the referential integrity across systems is to use same masking key. This should be considered while developing the architecture
  7. Sometimes, there are development efforts required on the client side to fully automate the masking process; for example, developing triggers for batch process, GitHub integration etc. This needs to be discussed in advance. DevOps1's approach is to split the work into two phases to accommodate for further automation of masking process and deal with the core masking activities in phase 1
  8. Version control and back up strategy should be confirmed

Data Masking

DevOps1 proposes to adopt iterative approach for data masking. In iteration 1, we focus on straightforward scenarios which can be handled using out of the box algorithms. Also, we have experienced significant data quality issues with some legacy application which will need to be handled as part of iteration 1. Iteration 2 and possibly 3 are about handling corner cases and trying various options (such as segmented masking vs obfuscation etc) before deciding upon the best approach. DBAs, environment support and QA should be involved in this process. Referential integrity tests should be continuously carried out while masking data across systems. A full refresh in iteration 2 or 3 after resolving all the issues will ensure that masking is working end to end. A pictorial description of the masking process is given in Figure 3.

Quality Assurance

QA is an important step of masking process. DevOps1 recommends QA activities to be carefully planned rather than being an afterthought. There are three layers of QA activities we propose as part of masking process.

  1. Technical Tests: Mostly performed by DBAs and Data Engineers, technical tests involve back-end verification and comparing the records against the production data. Usually, modern toolsets provide good level of reporting on the masking activity, at table and record levels. Performance related parameters should also be analysed as part of the technical verification
  2. UI/Regression Tests: This is usually performed by system testers. We recommend taking the screen shots of key fields containing PII data from the UI before masking and compare with the post masking values. Effort should be made to capture screenshots at various UI screens referencing same data to verify that the coverage of masking is adequate. Also, if available, automation test cases (regression suites) should be utilized to conduct a general sanity on the application. Few tests are application specific and some expertise from SMEs will be handy; for example, in applications like SAP, even if data is masked correctly in back-end tables, it does not reflect on UI until views are refreshed. Also, some search functions may not work as previously because of masking of fields such as email id etc.
  3. End to End Tests: This is the most critical part of the quality assurance process; and the most time consuming one. The complexity in masking process mainly arises from the data flow across the systems. Testing the critical end to end scenarios which verify the referential integrity across systems with the minimal effort requires careful planning and good understanding of the test cases. DevOps1 has developed a framework to identify the data flow across the systems and prioritize test cases to test the referential integrity. A high-level overview of the matrix used to elicit such test cases is given below.

The matrix provides an overview of the span of various systems to complete end to end test cases. The numbering depicts the data flow (starting and end point). There could be different variations of this matrix to align with project needs however the goal is to have a single view of the data flow requiring to execute critical end to end test cases. Effort to develop the matrix, write test cases and execute them should be part of the masking effort estimation.

Governance

DevOps1 has developed its own governance model for managing all type of Quality Engineering projects including Data Masking. Our governance model is based on the proactive management of risks and issues, real time traceability and effective management of stakeholders expectations. We can provide frameworks to manage the data masking project as part of Agile, Waterfall or any other delivery model as applicable. In an environment involving multiple applications, we recommend using three layered approach to maintain the stories related to masking work as in below diagram.

Conclusion

Data masking, for a standalone database, is straightforward and relatively simple. However, when multiple interconnected applications are involved, this becomes a complex engineering task and requires careful planning and experienced execution. Also, focus should be on developing a process of data masking as part of test data management activities instead of just a point of time activity. DevOps1 approach and experience in this space can provide significant value in both process and technology side. Please reach out to us for understanding challenges in data compliance and our approach to solve them.

Related Posts