Sampling for Sensitive Data: Sample “Depth”October 11th, 2016
In a previous blog entry, Richness and Precision and Recall (Oh My!), we began to explore the many complexities of estimating richness, precision, and recall when searching for sensitive information. In this post, we’ll focus on one of those complexities: sample “depth.” Sample “depth” is the level at which we intend to measure and remediate. For sensitive data, we have three key sample depths—document, entity, and element.
The Power of Statistical Sampling
When properly executed, statistical sampling allows you to assess a subset of a “population” to estimate characteristics about the whole, within a margin of error and confidence level. Voter polling is a timely example. Pollsters contact a few hundred—maybe thousands of—probable voters and, based on the information from that relatively small group of voters, they estimate how many people are likely to vote for a given candidate in the coming general election. Although the reliability of the polling information heavily depends on the sampling methodology and execution (for example, sample size, representativeness, etc.), in general, sampling can provide a powerful means to understand large data sets while looking at only a small portion.
When dealing with large collections of information containing sensitive data, leveraging the power of sampling can be useful to refine retrieval and/or remediation methodologies to better target sensitive content.
Statistical Sampling & Sensitive Data
The key to sampling for sensitive data is to measure at the level—or “depth”—you plan to use for remediation.
Often in the context of litigation—particularly when assessing the efficacy of keywords at returning responsive content—we evaluate samples for richness, precision and recall at the document level. Is the document—the highest unit in question—responsive? And is it returned by one of our keywords? With answers to only these two yes/no questions, we can easily calculate our metrics. (Read our primer on statistical sampling in the context of litigation discovery, including keyword searching.)
But suppose we’re not looking to identify responsive documents, and our goal instead is to find and redact all social security numbers (“SSN”) and medical identification (“ID”) numbers prior to a document production. There are several ways we could attack this problem.
Sampling at the Document Level
First, it may be ok to set our sample depth at the document level, even when assessing sensitive data. If our overall process contemplates a full human review of documents identified as potentially containing sensitive data, we may not care why a document is returned for review—only that it’s returned. To illustrate this, consider a document that contains three SSNs and three medical ID numbers. The SSN search we run against the data identifies two of the three SSNs; however, our search does not find any of the three medical ID numbers in this document, because these particular numbers are unique and don’t fit any known patterns. Because we’ve employed a good review team that has been trained to redact all instances of SSN and medical ID numbers, it’s ok that certain sensitive data elements (one SSN and all three medical ID numbers) are missed in our initial searches. We can rely on our document level assessment. Our retrieval methodology properly returned the document based on a SSN hit, and a well-trained reviewer will redact all sensitive data contained therein, including the medical ID numbers. Simple enough, right?
However, as we get more sophisticated with our remediation methods (e.g., human redaction, enhanced redaction workflow, auto-redaction), our sampling requirements become more demanding. If we want to use auto-redaction to decrease the need for human review, it’s probably not good enough to sample at the document level. We will need to know more particularly—in other words, with more sample depth—how well our search methods are performing.
Sampling at the Entity Level
To help understand sampling at the entity level, consider a workflow where we want to route documents that hit on a medical ID number to a team that specializes in medical document review/redactions (documents having a SSN hit and no medical ID number are still reviewed by the less costly primary team). We need to assess our sample one unit deeper, at the level of the sensitive data entity—here, SSN or medical ID number. It’s not good enough to know whether a document has any sensitive data hit. We need to know the specific sensitive data entity, so the document can be properly routed: documents with medical ID numbers to the specialized team, and documents with SSN only hits to the primary team.
Sampling at the Element Level
To take our example one step further, now instead of routing documents with specific entity hits to a specialized team, we want to employ automated redactions for all SSNs. If you recall, our initial search only captured two of the three SSN in our example document. If we assess our samples at the entity level, we’ll know which documents contain SSN (and which contain medical ID numbers), but not how well the search performed as to each SSN or medical ID number element. If we simply wanted to route documents with SSN for human redaction, assessing at the SSN entity level would suffice. However, because we want to apply auto-redactions, we must ensure our search captures each SSN element. Measuring at the entity or document level would not provide the details required for proper remediation. We need to understand the precision and recall of the SSN search for each SSN element that exists in a document, and so we must measure our sample at the element level.
It’s worth noting that while the above examples focus primarily on missing a sensitive data document, entity or element, over-identifying sensitive data also creates challenges. At the document and entity levels, over-identifying sensitive data likely means over-reviewing documents or having them reviewed by the wrong team—at a cost to the client. At the element level, where we rely on automation, over-identification would lead to over-redacting, reducing the quality of the output and possibly calling into question the overall process.
Sample Depth’s “Transitive” Properties
As is probably obvious at this point, we can sample at a depth greater than that at which we intend to remediate. If we sample at the element level, we can know which entities exist – and if we know which entities (or elements) exist, we can know which documents contain sensitive data.
Is the opposite true? We’ve already established why you might want to sample at the entity or element level, and why sampling at a higher level generally won’t provide the details required for proper remediation. But what about a scenario where we want to assess at the element level, and we have 100 documents with one sensitive data element each and one document with 100,000 sensitive data elements. In this (admittedly somewhat extreme) example, when we draw our random sample, chances are we’d only pull elements from the one document with 100,000 elements. So in this case, sampling at the document level might allow us to extrapolate to the other, deeper levels. Although extreme, this scenario isn’t impossible; we could have a collection that includes relatively few records, one of which is a customer data dump with considerable PII, which might impact our sampling depth strategy.
Decisions about sampling depth hinge on how you’ll use the sample results. If well-trained reviewers will lay eyes on the data, you may be ok sampling at the document level. If your remediation methods get more complex, you may want to sample at a greater depth to produce more precise statistical measurements in support of accuracy and defensibility. Ultimately, you must always consider your remediation methodology when determining your sampling depth and measure at the depth you intend to remediate.