During a customer engagement we’re always trying to balance security with agility and gaining access to the data we need to conduct our analysis and deliver value back to the customer.  In the early stages particularly, there’s the process of building trust and understanding in both directions.  To aid this we’ve developed some principles and standards to ensure we balance all these concerns effectively.

Our number one overarching principle is:

“If we don’t need it, we don’t want it!”

Or rather…


A no brainer really.  It’s our default position at a file level but we apply this principle to every column in a structured dataset also.  Before we accept any real data, our first step is to request and assess a highly anonymised, small sample set of data.  This sample file is securely transferred into our RAMP platform, immediately giving confidence through built-in security delivered in a repeatable way via infrastructure and deployment automation.   This means we can focus on assessing the data and identify fields that contain Personally Identifiable Information (PII) without worrying about the security of the platform.

We’re then able to create our Anonymisation Spec providing clear guidance to the customer as to what data we need but more importantly the anonymisation rules that must be applied prior to data delivery.  One thing to be mindful of is that whilst it’s easy to get a list of the PII data to look out for, it’s just as important to understand how a combination of values may lead to an individual to be identified and it’s this that sometimes trips people up.  Applying our omission principle, any columns that we don’t need are marked accordingly and just like an interface spec, the anonymisation spec provides clear boundaries of responsibility.  This is critical where security is concerned, as it helps ensure there are no gaps.

On your marks…

This gives us a starting position but we can (and pretty much always do!) go back to the customer with further revisions requesting access to more information. Additional controls can be applied to the delivery of more sensitive data items, but the starting point is to keep it simple.

As well as building the anonymisation spec, this Data Assessment phase provides other benefits too:

  • It helps us familiarise ourselves with the data and begin to understand the values, distributions, correlations and other patterns
  • It helps uncover any data quality issues early on in the project
  • Wrangling with the data even at this early stage always provides some useful insights and observations

Personally Identifiable Information (PII)

What if data contains PII but is core to our analysis?  Well, there are some other standards and techniques we employ to ensure no-one can identify or make sense of data that is PII.  Hope for the best, plan for the worst :).

Shuffling cards

By Alan Cleaver (Flickr: Shuffling the deck) via Wikimedia Commons

Hashing is the most obvious approach and core to much of what we do, providing a one-way mechanism for anonymising data.  Data encrypted in this way is in theory susceptible to a brute force attack, but using the right hashing algorithm and salts mitigates this risk. In instances where the results from our analysis means our customer wants to be able to identify an individual then encryption, re-sequencing or shuffling can be employed instead.

Another technique is to genericise data so that it still allows valuable insight to be gleaned but mitigates risk.  Date of Birth is a good example, in that it is usually core to analysis but represents a high PII risk, but genericising into Age Buckets (0-5, 6-10 etc) satisfies both needs.

A word of caution when genericising however, in that in some situations this still doesn’t solve the problem.  I’ve worked on customer projects where consideration needs to be given to when a single row falls into an Age Bucket for example.  If shared with a 3rd party the single row may still (in theory) allow them to identify an individual depending on what access they have to other data.  In this instance we define a uniqueness threshold and discard rows that fall below this threshold.

Wrapping up

One other word of advice would be to have a test set of data that can be worked on and shared freely with all parties. This saves stacks of time by allowing all parties to work collaboratively during this phase.  It enables the anonymisation rules and results to be tested and walked through giving everyone confidence.

As with most things, there’s no one size fits all answer and ultimately it depends on a number of factors.  Understanding these factors, and the remedial action is core to our work.


Featured Image by William Murphy from Dublin, Ireland (Where’s Wally World Record (where you there?)) via Wikimedia Commons