Richness and Precision and Recall (Oh My!)June 10th, 2016
Many of our blog readers are familiar with the concepts of richness, precision, and recall. We published a series of posts explaining these statistical measurements (in parts one, two, and three), we hosted a webinar on the subject, and there has been a good deal of education on this topic for eDiscovery practitioners. But most of that discussion and education focused on the context of statistical testing and measurement of techniques used to find documents containing relevant information for discovery—a context in which these statistics are fairly well-settled and easily understood.
However, organizations aren’t just searching their data collections for documents relevant to litigation disputes. It’s becoming more and more critical for companies to find and protect sensitive information—personally identifying information, personal health information, financial and payment information, etc.—stored in their caches of unstructured (non-database) data. What happens when we apply these statistical concepts of richness, precision, and recall to the identification of this sensitive information? The situation becomes a bit more complicated—and the statistics more difficult—which prompted me to write this post.
To help explain the challenge, let’s first define a few relevant terms:
Document – A collection of words, phrases, numbers, characters, or other items all grouped into one unit, such as a text file, word processing document, or spreadsheet.
Entity – A particular type of data that exists in a document. Examples of “sensitive” data entities include social security number, credit card number, username/password combination, and account number. Of course, a document with sensitive entities most likely contains other, non-sensitive entities, such as words, punctuation, metadata, formulas, pictures, and graphs—the list is almost endless.
Element – An individual instance of an entity, such as an individual social security number or particular word.
To illustrate the statistical oddities that can arise in calculating richness, precision, and recall when searching for sensitive data, let’s use a hypothetical example. Imagine a web site customer support rep for a financial institution, who authored ten documents saved on the company’s network. All ten documents were gathered and scanned for sensitive data. In one of those documents, the rep took copious notes regarding hundreds of customer interactions. The sensitive data scan reveals that those notes included the customers’ web site user names, passwords, social security numbers, and account numbers. The other nine documents did not contain any sensitive data.
In this example, the sensitive data document includes four sensitive data entities (user name, password, social security number, and account number), hundreds of sensitive data elements (each instance of one of those entities for a customer), and thousands of non-sensitive data elements (words, punctuation, etc.). Each of the other nine documents includes hundreds of non-sensitive data entities and thousands of non-sensitive data elements.
To make the math easy, assume that in this data set we have:
- 10 unique documents, with only one document containing sensitive data
- 100 unique entities, with 4 of those entities being sensitive data
- 10,000 unique elements, with 700 of those elements being an instance of one of the four sensitive data entities
The Calculations & Measurements
If we assume perfect knowledge of the sensitive data elements, we can calculate the richness of sensitive data. (What if we have imperfect knowledge, or rely on sampling? We’ll save that for a later, more in-depth blog post.) Generally, calculating richness is pretty easy—it’s simply the proportion of the total items that contain the content we’re measuring. But in the context of sensitive data, this is where the calculations get interesting. Is richness measured at the document level, entity level, or element level? In our example, richness of sensitive data at the document level is one out of 10 documents, or 10%. At the entity level, it is 4 out of 100, or 4%. At the element level, it is 700 out of 10,000, or 7%. So, which is correct? Is richness 10%, 7% or 4%? And to make it even more complicated, I assumed unique documents, entities and elements, but in reality there will be duplication on each level. How should we count duplicate entities (or elements or documents) in our calculations? (We’ll save that one for a later post, too.)
This simple representation of richness considers sensitive data of any type. But what if we want to know the richness of user name, social security number, password, or account number separately? Do we calculate each one at the document level, entity level, and element level? The problem is multiplicative, where now we have three different measurement points for four different sensitive data elements, giving us 12 possible richness measurements.
When we turn to the calculations of precision and recall, the same questions arise and we continue to compound the measurement possibilities. Now we have three measurements (precision, recall, and richness) across four entities (user name, social security number, password, and account number) at three levels of measurement (document, entity and element). That gives us 36 different possible measurements.
Which measurements do we choose? Not surprisingly, the answer is “It depends.” In large part, it depends on what we’re trying to accomplish. In our hypothetical, if the objective of the data scan is to redact every instance of sensitive data and return the sanitized document to the network, then we need to measure at the element level. But if our objective is to wholly remove any document containing sensitive data from the network, then we can measure at the document level.
Richness, precision, and recall–oh my. We’re not in Kansas anymore.